Skip to content

Commit f6a38e0

Browse files
authored
Wall-time/all tasks profiler (#55889)
One limitation of sampling CPU/thread profiles, as is currently done in Julia, is that they primarily capture samples from CPU-intensive tasks. If many tasks are performing IO or contending for concurrency primitives like semaphores, these tasks won’t appear in the profile, as they aren't scheduled on OS threads sampled by the profiler. A wall-time profiler, like the one implemented in this PR, samples tasks regardless of OS thread scheduling. This enables profiling of IO-heavy tasks and detecting areas of heavy contention in the system. Co-developed with @nickrobinson251.
1 parent f1a90e0 commit f6a38e0

19 files changed

+796
-231
lines changed

NEWS.md

+2
Original file line numberDiff line numberDiff line change
@@ -213,4 +213,6 @@ External dependencies
213213
Tooling Improvements
214214
--------------------
215215

216+
- A wall-time profiler is now available for users who need a sampling profiler that captures tasks regardless of their scheduling or running state. This type of profiler enables profiling of I/O-heavy tasks and helps detect areas of heavy contention in the system ([#55889]).
217+
216218
<!--- generated by NEWS-update.jl: -->

doc/src/manual/img/cpu-profile.png

144 KB
Loading
351 KB
Loading
Loading
Loading

doc/src/manual/profile.md

+214
Original file line numberDiff line numberDiff line change
@@ -297,6 +297,220 @@ Of course, you can decrease the delay as well as increase it; however, the overh
297297
grows once the delay becomes similar to the amount of time needed to take a backtrace (~30 microseconds
298298
on the author's laptop).
299299

300+
## Wall-time Profiler
301+
302+
### Introduction & Problem Motivation
303+
304+
The profiler described in the previous section is a sampling CPU profiler. At a high level, the profiler periodically stops all Julia compute threads to collect their backtraces and estimates the time spent in each function based on the number of backtrace samples that include a frame from that function. However, note that only tasks currently running on system threads just before the profiler stops them will have their backtraces collected.
305+
306+
While this profiler is typically well-suited for workloads where the majority of tasks are compute-bound, it is less helpful for systems where most tasks are IO-heavy or for diagnosing contention on synchronization primitives in your code.
307+
308+
Let's consider this simple workload:
309+
310+
```Julia
311+
using Base.Threads
312+
using Profile
313+
using PProf
314+
315+
ch = Channel(1)
316+
317+
const N_SPAWNED_TASKS = (1 << 10)
318+
const WAIT_TIME_NS = 10_000_000
319+
320+
function spawn_a_bunch_of_tasks_waiting_on_channel()
321+
for i in 1:N_SPAWNED_TASKS
322+
Threads.@spawn begin
323+
take!(ch)
324+
end
325+
end
326+
end
327+
328+
function busywait()
329+
t0 = time_ns()
330+
while true
331+
if time_ns() - t0 > WAIT_TIME_NS
332+
break
333+
end
334+
end
335+
end
336+
337+
function main()
338+
spawn_a_bunch_of_tasks_waiting_on_channel()
339+
for i in 1:N_SPAWNED_TASKS
340+
put!(ch, i)
341+
busywait()
342+
end
343+
end
344+
345+
Profile.@profile main()
346+
```
347+
348+
Our goal is to detect whether there is contention on the `ch` channel—i.e., whether the number of waiters is excessive given the rate at which work items are being produced in the channel.
349+
350+
If we run this, we obtain the following [PProf](https://github.com/JuliaPerf/PProf.jl) flame graph:
351+
352+
![CPU Profile](./img/cpu-profile.png)()
353+
354+
This profile provides no information to help determine where contention occurs in the system’s synchronization primitives. Waiters on a channel will be blocked and descheduled, meaning no system thread will be running the tasks assigned to those waiters, and as a result, they won't be sampled by the profiler.
355+
356+
### Wall-time Profiler
357+
358+
Instead of sampling threads—and thus only sampling tasks that are running—a wall-time task profiler samples tasks independently of their scheduling state. For example, tasks that are sleeping on a synchronization primitive at the time the profiler is running will be sampled with the same probability as tasks that were actively running when the profiler attempted to capture backtraces.
359+
360+
This approach allows us to construct a profile where backtraces from tasks blocked on the `ch` channel, as in the example above, are actually represented.
361+
362+
Let's run the same example, but now with a wall-time profiler:
363+
364+
365+
```Julia
366+
using Base.Threads
367+
using Profile
368+
using PProf
369+
370+
ch = Channel(1)
371+
372+
const N_SPAWNED_TASKS = (1 << 10)
373+
const WAIT_TIME_NS = 10_000_000
374+
375+
function spawn_a_bunch_of_tasks_waiting_on_channel()
376+
for i in 1:N_SPAWNED_TASKS
377+
Threads.@spawn begin
378+
take!(ch)
379+
end
380+
end
381+
end
382+
383+
function busywait()
384+
t0 = time_ns()
385+
while true
386+
if time_ns() - t0 > WAIT_TIME_NS
387+
break
388+
end
389+
end
390+
end
391+
392+
function main()
393+
spawn_a_bunch_of_tasks_waiting_on_channel()
394+
for i in 1:N_SPAWNED_TASKS
395+
put!(ch, i)
396+
busywait()
397+
end
398+
end
399+
400+
Profile.@profile_walltime main()
401+
```
402+
403+
We obtain the following flame graph:
404+
405+
![Wall-time Profile Channel](./img/wall-time-profiler-channel-example.png)()
406+
407+
We see that a large number of samples come from channel-related `take!` functions, which allows us to determine that there is indeed an excessive number of waiters in `ch`.
408+
409+
### A Compute-Bound Workload
410+
411+
Despite the wall-time profiler sampling all live tasks in the system and not just the currently running ones, it can still be helpful for identifying performance hotspots, even if your code is compute-bound. Let’s consider a simple example:
412+
413+
```Julia
414+
using Base.Threads
415+
using Profile
416+
using PProf
417+
418+
ch = Channel(1)
419+
420+
const MAX_ITERS = (1 << 22)
421+
const N_TASKS = (1 << 12)
422+
423+
function spawn_a_task_waiting_on_channel()
424+
Threads.@spawn begin
425+
take!(ch)
426+
end
427+
end
428+
429+
function sum_of_sqrt()
430+
sum_of_sqrt = 0.0
431+
for i in 1:MAX_ITERS
432+
sum_of_sqrt += sqrt(i)
433+
end
434+
return sum_of_sqrt
435+
end
436+
437+
function spawn_a_bunch_of_compute_heavy_tasks()
438+
Threads.@sync begin
439+
for i in 1:N_TASKS
440+
Threads.@spawn begin
441+
sum_of_sqrt()
442+
end
443+
end
444+
end
445+
end
446+
447+
function main()
448+
spawn_a_task_waiting_on_channel()
449+
spawn_a_bunch_of_compute_heavy_tasks()
450+
end
451+
452+
Profile.@profile_walltime main()
453+
```
454+
455+
After collecting a wall-time profile, we get the following flame graph:
456+
457+
![Wall-time Profile Compute-Bound](./img/wall-time-profiler-compute-bound-example.png)()
458+
459+
Notice how many of the samples contain `sum_of_sqrt`, which is the expensive compute function in our example.
460+
461+
### Identifying Task Sampling Failures in your Profile
462+
463+
In the current implementation, the wall-time profiler attempts to sample from tasks that have been alive since the last garbage collection, along with those created afterward. However, if most tasks are extremely short-lived, you may end up sampling tasks that have already completed, resulting in missed backtrace captures.
464+
465+
If you encounter samples containing `failed_to_sample_task_fun` or `failed_to_stop_thread_fun`, this likely indicates a high volume of short-lived tasks, which prevented their backtraces from being collected.
466+
467+
Let's consider this simple example:
468+
469+
```Julia
470+
using Base.Threads
471+
using Profile
472+
using PProf
473+
474+
const N_SPAWNED_TASKS = (1 << 16)
475+
const WAIT_TIME_NS = 100_000
476+
477+
function spawn_a_bunch_of_short_lived_tasks()
478+
for i in 1:N_SPAWNED_TASKS
479+
Threads.@spawn begin
480+
# Do nothing
481+
end
482+
end
483+
end
484+
485+
function busywait()
486+
t0 = time_ns()
487+
while true
488+
if time_ns() - t0 > WAIT_TIME_NS
489+
break
490+
end
491+
end
492+
end
493+
494+
function main()
495+
GC.enable(false)
496+
spawn_a_bunch_of_short_lived_tasks()
497+
for i in 1:N_SPAWNED_TASKS
498+
busywait()
499+
end
500+
GC.enable(true)
501+
end
502+
503+
Profile.@profile_walltime main()
504+
```
505+
506+
Notice that the tasks spawned in `spawn_a_bunch_of_short_lived_tasks` are extremely short-lived. Since these tasks constitute the majority in the system, we will likely miss capturing a backtrace for most sampled tasks.
507+
508+
After collecting a wall-time profile, we obtain the following flame graph:
509+
510+
![Task Sampling Failure](./img/task-sampling-failure.png)()
511+
512+
The large number of samples from `failed_to_stop_thread_fun` confirms that we have a significant number of short-lived tasks in the system.
513+
300514
## Memory allocation analysis
301515

302516
One of the most common techniques to improve performance is to reduce memory allocation. Julia

src/gc-stacks.c

+33
Original file line numberDiff line numberDiff line change
@@ -296,6 +296,39 @@ void sweep_stack_pool_loop(void) JL_NOTSAFEPOINT
296296
jl_atomic_fetch_add(&gc_n_threads_sweeping_stacks, -1);
297297
}
298298

299+
// Builds a list of the live tasks. Racy: `live_tasks` can expand at any time.
300+
arraylist_t *jl_get_all_tasks_arraylist(void) JL_NOTSAFEPOINT
301+
{
302+
arraylist_t *tasks = (arraylist_t*)malloc_s(sizeof(arraylist_t));
303+
arraylist_new(tasks, 0);
304+
size_t nthreads = jl_atomic_load_acquire(&jl_n_threads);
305+
jl_ptls_t *allstates = jl_atomic_load_relaxed(&jl_all_tls_states);
306+
for (size_t i = 0; i < nthreads; i++) {
307+
// skip GC threads...
308+
if (gc_is_collector_thread(i)) {
309+
continue;
310+
}
311+
jl_ptls_t ptls2 = allstates[i];
312+
if (ptls2 == NULL) {
313+
continue;
314+
}
315+
jl_task_t *t = ptls2->root_task;
316+
if (t->ctx.stkbuf != NULL) {
317+
arraylist_push(tasks, t);
318+
}
319+
small_arraylist_t *live_tasks = &ptls2->gc_tls_common.heap.live_tasks;
320+
size_t n = mtarraylist_length(live_tasks);
321+
for (size_t i = 0; i < n; i++) {
322+
jl_task_t *t = (jl_task_t*)mtarraylist_get(live_tasks, i);
323+
assert(t != NULL);
324+
if (t->ctx.stkbuf != NULL) {
325+
arraylist_push(tasks, t);
326+
}
327+
}
328+
}
329+
return tasks;
330+
}
331+
299332
JL_DLLEXPORT jl_array_t *jl_live_tasks(void)
300333
{
301334
size_t nthreads = jl_atomic_load_acquire(&jl_n_threads);

src/gc-stock.c

+22-2
Original file line numberDiff line numberDiff line change
@@ -1025,7 +1025,22 @@ void gc_sweep_wait_for_all_stacks(void) JL_NOTSAFEPOINT
10251025
}
10261026
}
10271027

1028-
void sweep_stack_pools(jl_ptls_t ptls) JL_NOTSAFEPOINT
1028+
void sweep_mtarraylist_buffers(void) JL_NOTSAFEPOINT
1029+
{
1030+
for (int i = 0; i < gc_n_threads; i++) {
1031+
jl_ptls_t ptls = gc_all_tls_states[i];
1032+
if (ptls == NULL) {
1033+
continue;
1034+
}
1035+
small_arraylist_t *buffers = &ptls->lazily_freed_mtarraylist_buffers;
1036+
void *buf;
1037+
while ((buf = small_arraylist_pop(buffers)) != NULL) {
1038+
free(buf);
1039+
}
1040+
}
1041+
}
1042+
1043+
void sweep_stack_pools_and_mtarraylist_buffers(jl_ptls_t ptls) JL_NOTSAFEPOINT
10291044
{
10301045
// initialize ptls index for parallel sweeping of stack pools
10311046
assert(gc_n_threads);
@@ -1035,9 +1050,12 @@ void sweep_stack_pools(jl_ptls_t ptls) JL_NOTSAFEPOINT
10351050
else
10361051
jl_atomic_store_relaxed(&gc_stack_free_idx, stack_free_idx + 1);
10371052
jl_atomic_store_release(&gc_ptls_sweep_idx, gc_n_threads - 1); // idx == gc_n_threads = release stacks to the OS so it's serial
1053+
uv_mutex_lock(&live_tasks_lock);
10381054
gc_sweep_wake_all_stacks(ptls);
10391055
sweep_stack_pool_loop();
10401056
gc_sweep_wait_for_all_stacks();
1057+
sweep_mtarraylist_buffers();
1058+
uv_mutex_unlock(&live_tasks_lock);
10411059
}
10421060

10431061
static void gc_pool_sync_nfree(jl_gc_pagemeta_t *pg, jl_taggedvalue_t *last) JL_NOTSAFEPOINT
@@ -3084,7 +3102,7 @@ static int _jl_gc_collect(jl_ptls_t ptls, jl_gc_collection_t collection)
30843102
current_sweep_full = sweep_full;
30853103
sweep_weak_refs();
30863104
uint64_t stack_pool_time = jl_hrtime();
3087-
sweep_stack_pools(ptls);
3105+
sweep_stack_pools_and_mtarraylist_buffers(ptls);
30883106
stack_pool_time = jl_hrtime() - stack_pool_time;
30893107
gc_num.total_stack_pool_sweep_time += stack_pool_time;
30903108
gc_num.stack_pool_sweep_time = stack_pool_time;
@@ -3453,6 +3471,8 @@ void jl_init_thread_heap(jl_ptls_t ptls)
34533471
jl_atomic_store_relaxed(&q->bottom, 0);
34543472
jl_atomic_store_relaxed(&q->array, wsa2);
34553473
arraylist_new(&mq->reclaim_set, 32);
3474+
// Initialize `lazily_freed_mtarraylist_buffers`
3475+
small_arraylist_new(&ptls->lazily_freed_mtarraylist_buffers, 0);
34563476

34573477
memset(&ptls->gc_tls_common.gc_num, 0, sizeof(ptls->gc_tls_common.gc_num));
34583478
jl_atomic_store_relaxed(&ptls->gc_tls_common.gc_num.allocd, -(int64_t)gc_num.interval);

src/init.c

+4
Original file line numberDiff line numberDiff line change
@@ -744,6 +744,10 @@ JL_DLLEXPORT void julia_init(JL_IMAGE_SEARCH rel)
744744

745745
// initialize symbol-table lock
746746
uv_mutex_init(&symtab_lock);
747+
// initialize the live tasks lock
748+
uv_mutex_init(&live_tasks_lock);
749+
// initialize the profiler buffer lock
750+
uv_mutex_init(&bt_data_prof_lock);
747751

748752
// initialize backtraces
749753
jl_init_profile_lock();

src/julia_internal.h

+29
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,35 @@ JL_DLLEXPORT void jl_unlock_profile_wr(void) JL_NOTSAFEPOINT JL_NOTSAFEPOINT_LEA
211211
int jl_lock_stackwalk(void) JL_NOTSAFEPOINT JL_NOTSAFEPOINT_ENTER;
212212
void jl_unlock_stackwalk(int lockret) JL_NOTSAFEPOINT JL_NOTSAFEPOINT_LEAVE;
213213

214+
arraylist_t *jl_get_all_tasks_arraylist(void) JL_NOTSAFEPOINT;
215+
typedef struct {
216+
size_t bt_size;
217+
int tid;
218+
} jl_record_backtrace_result_t;
219+
JL_DLLEXPORT jl_record_backtrace_result_t jl_record_backtrace(jl_task_t *t, struct _jl_bt_element_t *bt_data,
220+
size_t max_bt_size, int all_tasks_profiler) JL_NOTSAFEPOINT;
221+
extern volatile struct _jl_bt_element_t *profile_bt_data_prof;
222+
extern volatile size_t profile_bt_size_max;
223+
extern volatile size_t profile_bt_size_cur;
224+
extern volatile int profile_running;
225+
extern volatile int profile_all_tasks;
226+
// Ensures that we can safely read the `live_tasks`field of every TLS when profiling.
227+
// We want to avoid the case that a GC gets interleaved with `jl_profile_task` and shrinks
228+
// the `live_tasks` array while we are reading it or frees tasks that are being profiled.
229+
// Because of that, this lock must be held in `jl_profile_task` and `sweep_stack_pools_and_mtarraylist_buffers`.
230+
extern uv_mutex_t live_tasks_lock;
231+
// Ensures that we can safely write to `profile_bt_data_prof` and `profile_bt_size_cur`.
232+
// We want to avoid the case that:
233+
// - We start to profile a task very close to the profiling time window end.
234+
// - The profiling time window ends and we start to read the profile data in a compute thread.
235+
// - We write to the profile in a profiler thread while the compute thread is reading it.
236+
// Locking discipline: `bt_data_prof_lock` must be held inside the scope of `live_tasks_lock`.
237+
extern uv_mutex_t bt_data_prof_lock;
238+
#define PROFILE_STATE_THREAD_NOT_SLEEPING (1)
239+
#define PROFILE_STATE_THREAD_SLEEPING (2)
240+
#define PROFILE_STATE_WALL_TIME_PROFILING (3)
241+
void jl_profile_task(void);
242+
214243
// number of cycles since power-on
215244
static inline uint64_t cycleclock(void) JL_NOTSAFEPOINT
216245
{

src/julia_threads.h

+1
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,7 @@ typedef struct _jl_tls_states_t {
157157
int finalizers_inhibited;
158158
jl_gc_tls_states_t gc_tls; // this is very large, and the offset of the first member is baked into codegen
159159
jl_gc_tls_states_common_t gc_tls_common; // common tls for both GCs
160+
small_arraylist_t lazily_freed_mtarraylist_buffers;
160161
volatile sig_atomic_t defer_signal;
161162
_Atomic(struct _jl_task_t*) current_task;
162163
struct _jl_task_t *next_task;

src/mtarraylist.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ static void mtarraylist_resizeto(small_mtarraylist_t *a, size_t len, size_t newl
3737
a->max = nm;
3838
if (olditems != (void*)&a->_space[0]) {
3939
jl_task_t *ct = jl_current_task;
40-
jl_gc_add_quiescent(ct->ptls, (void**)olditems, free);
40+
small_arraylist_push(&ct->ptls->lazily_freed_mtarraylist_buffers, olditems);
4141
}
4242
}
4343
}

0 commit comments

Comments
 (0)