You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One limitation of sampling CPU/thread profiles, as is currently done in
Julia, is that they primarily capture samples from CPU-intensive tasks.
If many tasks are performing IO or contending for concurrency primitives
like semaphores, these tasks won’t appear in the profile, as they aren't
scheduled on OS threads sampled by the profiler.
A wall-time profiler, like the one implemented in this PR, samples tasks
regardless of OS thread scheduling. This enables profiling of IO-heavy
tasks and detecting areas of heavy contention in the system.
Co-developed with @nickrobinson251.
Copy file name to clipboardExpand all lines: NEWS.md
+2
Original file line number
Diff line number
Diff line change
@@ -213,4 +213,6 @@ External dependencies
213
213
Tooling Improvements
214
214
--------------------
215
215
216
+
- A wall-time profiler is now available for users who need a sampling profiler that captures tasks regardless of their scheduling or running state. This type of profiler enables profiling of I/O-heavy tasks and helps detect areas of heavy contention in the system ([#55889]).
Copy file name to clipboardExpand all lines: doc/src/manual/profile.md
+214
Original file line number
Diff line number
Diff line change
@@ -297,6 +297,220 @@ Of course, you can decrease the delay as well as increase it; however, the overh
297
297
grows once the delay becomes similar to the amount of time needed to take a backtrace (~30 microseconds
298
298
on the author's laptop).
299
299
300
+
## Wall-time Profiler
301
+
302
+
### Introduction & Problem Motivation
303
+
304
+
The profiler described in the previous section is a sampling CPU profiler. At a high level, the profiler periodically stops all Julia compute threads to collect their backtraces and estimates the time spent in each function based on the number of backtrace samples that include a frame from that function. However, note that only tasks currently running on system threads just before the profiler stops them will have their backtraces collected.
305
+
306
+
While this profiler is typically well-suited for workloads where the majority of tasks are compute-bound, it is less helpful for systems where most tasks are IO-heavy or for diagnosing contention on synchronization primitives in your code.
307
+
308
+
Let's consider this simple workload:
309
+
310
+
```Julia
311
+
using Base.Threads
312
+
using Profile
313
+
using PProf
314
+
315
+
ch = Channel(1)
316
+
317
+
const N_SPAWNED_TASKS = (1 << 10)
318
+
const WAIT_TIME_NS = 10_000_000
319
+
320
+
function spawn_a_bunch_of_tasks_waiting_on_channel()
321
+
for i in 1:N_SPAWNED_TASKS
322
+
Threads.@spawn begin
323
+
take!(ch)
324
+
end
325
+
end
326
+
end
327
+
328
+
function busywait()
329
+
t0 = time_ns()
330
+
while true
331
+
if time_ns() - t0 > WAIT_TIME_NS
332
+
break
333
+
end
334
+
end
335
+
end
336
+
337
+
function main()
338
+
spawn_a_bunch_of_tasks_waiting_on_channel()
339
+
for i in 1:N_SPAWNED_TASKS
340
+
put!(ch, i)
341
+
busywait()
342
+
end
343
+
end
344
+
345
+
Profile.@profile main()
346
+
```
347
+
348
+
Our goal is to detect whether there is contention on the `ch` channel—i.e., whether the number of waiters is excessive given the rate at which work items are being produced in the channel.
349
+
350
+
If we run this, we obtain the following [PProf](https://github.com/JuliaPerf/PProf.jl) flame graph:
351
+
352
+
()
353
+
354
+
This profile provides no information to help determine where contention occurs in the system’s synchronization primitives. Waiters on a channel will be blocked and descheduled, meaning no system thread will be running the tasks assigned to those waiters, and as a result, they won't be sampled by the profiler.
355
+
356
+
### Wall-time Profiler
357
+
358
+
Instead of sampling threads—and thus only sampling tasks that are running—a wall-time task profiler samples tasks independently of their scheduling state. For example, tasks that are sleeping on a synchronization primitive at the time the profiler is running will be sampled with the same probability as tasks that were actively running when the profiler attempted to capture backtraces.
359
+
360
+
This approach allows us to construct a profile where backtraces from tasks blocked on the `ch` channel, as in the example above, are actually represented.
361
+
362
+
Let's run the same example, but now with a wall-time profiler:
363
+
364
+
365
+
```Julia
366
+
using Base.Threads
367
+
using Profile
368
+
using PProf
369
+
370
+
ch = Channel(1)
371
+
372
+
const N_SPAWNED_TASKS = (1 << 10)
373
+
const WAIT_TIME_NS = 10_000_000
374
+
375
+
function spawn_a_bunch_of_tasks_waiting_on_channel()
We see that a large number of samples come from channel-related `take!` functions, which allows us to determine that there is indeed an excessive number of waiters in`ch`.
408
+
409
+
### A Compute-Bound Workload
410
+
411
+
Despite the wall-time profiler sampling all live tasks in the system and not just the currently running ones, it can still be helpful for identifying performance hotspots, even if your code is compute-bound. Let’s consider a simple example:
412
+
413
+
```Julia
414
+
using Base.Threads
415
+
using Profile
416
+
using PProf
417
+
418
+
ch = Channel(1)
419
+
420
+
const MAX_ITERS = (1 << 22)
421
+
const N_TASKS = (1 << 12)
422
+
423
+
function spawn_a_task_waiting_on_channel()
424
+
Threads.@spawn begin
425
+
take!(ch)
426
+
end
427
+
end
428
+
429
+
function sum_of_sqrt()
430
+
sum_of_sqrt = 0.0
431
+
for i in 1:MAX_ITERS
432
+
sum_of_sqrt += sqrt(i)
433
+
end
434
+
return sum_of_sqrt
435
+
end
436
+
437
+
function spawn_a_bunch_of_compute_heavy_tasks()
438
+
Threads.@sync begin
439
+
for i in 1:N_TASKS
440
+
Threads.@spawn begin
441
+
sum_of_sqrt()
442
+
end
443
+
end
444
+
end
445
+
end
446
+
447
+
function main()
448
+
spawn_a_task_waiting_on_channel()
449
+
spawn_a_bunch_of_compute_heavy_tasks()
450
+
end
451
+
452
+
Profile.@profile_walltime main()
453
+
```
454
+
455
+
After collecting a wall-time profile, we get the following flame graph:
Notice how many of the samples contain `sum_of_sqrt`, which is the expensive compute functionin our example.
460
+
461
+
### Identifying Task Sampling Failures in your Profile
462
+
463
+
In the current implementation, the wall-time profiler attempts to sample from tasks that have been alive since the last garbage collection, along with those created afterward. However, if most tasks are extremely short-lived, you may end up sampling tasks that have already completed, resulting in missed backtrace captures.
464
+
465
+
If you encounter samples containing `failed_to_sample_task_fun` or `failed_to_stop_thread_fun`, this likely indicates a high volume of short-lived tasks, which prevented their backtraces from being collected.
466
+
467
+
Let's consider this simple example:
468
+
469
+
```Julia
470
+
using Base.Threads
471
+
using Profile
472
+
using PProf
473
+
474
+
const N_SPAWNED_TASKS = (1 << 16)
475
+
const WAIT_TIME_NS = 100_000
476
+
477
+
function spawn_a_bunch_of_short_lived_tasks()
478
+
for i in 1:N_SPAWNED_TASKS
479
+
Threads.@spawn begin
480
+
# Do nothing
481
+
end
482
+
end
483
+
end
484
+
485
+
function busywait()
486
+
t0 = time_ns()
487
+
while true
488
+
if time_ns() - t0 > WAIT_TIME_NS
489
+
break
490
+
end
491
+
end
492
+
end
493
+
494
+
function main()
495
+
GC.enable(false)
496
+
spawn_a_bunch_of_short_lived_tasks()
497
+
for i in 1:N_SPAWNED_TASKS
498
+
busywait()
499
+
end
500
+
GC.enable(true)
501
+
end
502
+
503
+
Profile.@profile_walltime main()
504
+
```
505
+
506
+
Notice that the tasks spawned in`spawn_a_bunch_of_short_lived_tasks` are extremely short-lived. Since these tasks constitute the majority in the system, we will likely miss capturing a backtrace for most sampled tasks.
507
+
508
+
After collecting a wall-time profile, we obtain the following flame graph:
0 commit comments