Tear-down Races

Primacy of SIGKILL
Final callbacks
Engine and task pointers
Serialization of DEATH and REAP
Interlock with final callbacks
Using utrace_barrier

Primacy of SIGKILL

Ordinarily synchronization issues for tracing engines are kept fairly straightforward by using UTRACE_STOP. You ask a thread to stop, and then once it makes the report_quiesce callback it cannot do anything else that would result in another callback, until you let it with a utrace_control call. This simple arrangement avoids complex and error-prone code in each one of a tracing engine's event callbacks to keep them serialized with the engine's other operations done on that thread from another thread of control. However, giving tracing engines complete power to keep a traced thread stuck in place runs afoul of a more important kind of simplicity that the kernel overall guarantees: nothing can prevent or delay SIGKILL from making a thread die and release its resources. To preserve this important property of SIGKILL, it as a special case can break UTRACE_STOP like nothing else normally can. This includes both explicit SIGKILL signals and the implicit SIGKILL sent to each other thread in the same thread group by a thread doing an exec, or processing a fatal signal, or making an exit_group system call. A tracing engine can prevent a thread from beginning the exit or exec or dying by signal (other than SIGKILL) if it is attached to that thread, but once the operation begins, no tracing engine can prevent or delay all other threads in the same thread group dying.

Final callbacks

The report_reap callback is always the final event in the life cycle of a traced thread. Tracing engines can use this as the trigger to clean up their own data structures. The report_death callback is always the penultimate event a tracing engine might see; it's seen unless the thread was already in the midst of dying when the engine attached. Many tracing engines will have no interest in when a parent reaps a dead process, and nothing they want to do with a zombie thread once it dies; for them, the report_death callback is the natural place to clean up data structures and detach. To facilitate writing such engines robustly, given the asynchrony of SIGKILL, and without error-prone manual implementation of synchronization schemes, the utrace infrastructure provides some special guarantees about the report_death and report_reap callbacks. It still takes some care to be sure your tracing engine is robust to tear-down races, but these rules make it reasonably straightforward and concise to handle a lot of corner cases correctly.

Engine and task pointers

The first sort of guarantee concerns the core data structures themselves. struct utrace_engine is a reference-counted data structure. While you hold a reference, an engine pointer will always stay valid so that you can safely pass it to any utrace call. Each call to utrace_attach_task or utrace_attach_pid returns an engine pointer with a reference belonging to the caller. You own that reference until you drop it using utrace_engine_put. There is an implicit reference on the engine while it is attached. So if you drop your only reference, and then use utrace_attach_task without UTRACE_ATTACH_CREATE to look up that same engine, you will get the same pointer with a new reference to replace the one you dropped, just like calling utrace_engine_get. When an engine has been detached, either explicitly with UTRACE_DETACH or implicitly after report_reap, then any references you hold are all that keep the old engine pointer alive.

There is nothing a kernel module can do to keep a struct task_struct alive outside of rcu_read_lock. When the task dies and is reaped by its parent (or itself), that structure can be freed so that any dangling pointers you have stored become invalid. utrace will not prevent this, but it can help you detect it safely. By definition, a task that has been reaped has had all its engines detached. All utrace calls can be safely called on a detached engine if the caller holds a reference on that engine pointer, even if the task pointer passed in the call is invalid. All calls return -ESRCH for a detached engine, which tells you that the task pointer you passed could be invalid now. Since utrace_control and utrace_set_events do not block, you can call those inside a rcu_read_lock section and be sure after they don't return -ESRCH that the task pointer is still valid until rcu_read_unlock. The infrastructure never holds task references of its own. Though neither rcu_read_lock nor any other lock is held while making a callback, it's always guaranteed that the struct task_struct and the struct utrace_engine passed as arguments remain valid until the callback function returns.

The common means for safely holding task pointers that is available to kernel modules is to use struct pid, which permits put_pid from kernel modules. When using that, the calls utrace_attach_pid, utrace_control_pid, utrace_set_events_pid, and utrace_barrier_pid are available.

Serialization of DEATH and REAP

The second guarantee is the serialization of DEATH and REAP event callbacks for a given thread. The actual reaping by the parent (release_task call) can occur simultaneously while the thread is still doing the final steps of dying, including the report_death callback. If a tracing engine has requested both DEATH and REAP event reports, it's guaranteed that the report_reap callback will not be made until after the report_death callback has returned. If the report_death callback itself detaches from the thread, then the report_reap callback will never be made. Thus it is safe for a report_death callback to clean up data structures and detach.

Interlock with final callbacks

The final sort of guarantee is that a tracing engine will know for sure whether or not the report_death and/or report_reap callbacks will be made for a certain thread. These tear-down races are disambiguated by the error return values of utrace_set_events and utrace_control. Normally utrace_control called with UTRACE_DETACH returns zero, and this means that no more callbacks will be made. If the thread is in the midst of dying, it returns -EALREADY to indicate that the report_death callback may already be in progress; when you get this error, you know that any cleanup your report_death callback does is about to happen or has just happened--note that if the report_death callback does not detach, the engine remains attached until the thread gets reaped. If the thread is in the midst of being reaped, utrace_control returns -ESRCH to indicate that the report_reap callback may already be in progress; this means the engine is implicitly detached when the callback completes. This makes it possible for a tracing engine that has decided asynchronously to detach from a thread to safely clean up its data structures, knowing that no report_death or report_reap callback will try to do the same. utrace_detach returns -ESRCH when the struct utrace_engine has already been detached, but is still a valid pointer because of its reference count. A tracing engine can use this to safely synchronize its own independent multiple threads of control with each other and with its event callbacks that detach.

In the same vein, utrace_set_events normally returns zero; if the target thread was stopped before the call, then after a successful call, no event callbacks not requested in the new flags will be made. It fails with -EALREADY if you try to clear UTRACE_EVENT(DEATH) when the report_death callback may already have begun, if you try to clear UTRACE_EVENT(REAP) when the report_reap callback may already have begun, or if you try to newly set UTRACE_EVENT(DEATH) or UTRACE_EVENT(QUIESCE) when the target is already dead or dying. Like utrace_control, it returns -ESRCH when the thread has already been detached (including forcible detach on reaping). This lets the tracing engine know for sure which event callbacks it will or won't see after utrace_set_events has returned. By checking for errors, it can know whether to clean up its data structures immediately or to let its callbacks do the work.

Using utrace_barrier

When a thread is safely stopped, calling utrace_control with UTRACE_DETACH or calling utrace_set_events to disable some events ensures synchronously that your engine won't get any more of the callbacks that have been disabled (none at all when detaching). But these can also be used while the thread is not stopped, when it might be simultaneously making a callback to your engine. For this situation, these calls return -EINPROGRESS when it's possible a callback is in progress. If you are not prepared to have your old callbacks still run, then you can synchronize to be sure all the old callbacks are finished, using utrace_barrier. This is necessary if the kernel module containing your callback code is going to be unloaded.

After using UTRACE_DETACH once, further calls to utrace_control with the same engine pointer will return -ESRCH. In contrast, after getting -EINPROGRESS from utrace_set_events, you can call utrace_set_events again later and if it returns zero then know the old callbacks have finished.

Unlike all other calls, utrace_barrier (and utrace_barrier_pid) will accept any engine pointer you hold a reference on, even if UTRACE_DETACH has already been used. After any utrace_control or utrace_set_events call (these do not block), you can call utrace_barrier to block until callbacks have finished. This returns -ESRCH only if the engine is completely detached (finished all callbacks). Otherwise it waits until the thread is definitely not in the midst of a callback to this engine and then returns zero, but can return -ERESTARTSYS if its wait is interrupted.