std::slice::partition_point performance regression when using rustc 1.82+ #138796

marvin-j97 · 2025-03-21T17:54:27Z

In https://github.com/fjall-rs/lsm-tree I am heavily using partition_point for various binary searches.

I found the std implementation to be somewhat slow.

As you can see below, the performance regression happens between rustc 1.81 and 1.82.

Comparing std between rustc versions:

git clone https://github.com/fjall-rs/lsm-tree && cd lsm-tree
cargo +1.81 bench --bench tli
cargo +1.82 bench --bench tli

The std implementation performs much worse in rustc 1.82 compared to 1.81.

As a sanity check, I wrote a simple alternative and found it performed much better (comparable to how partition_point used to be):

https://github.com/fjall-rs/lsm-tree/blob/0dca39eb54690beee2481cc3e31c9fc78d9e3fca/src/binary_search.rs#L9-L35

Reproduce (using monkey patch):

git clone https://github.com/fjall-rs/lsm-tree && cd lsm-tree
cargo +1.85 bench --bench tli
git checkout perf/replace-partition-point
cargo +1.85 bench --bench tli

The hand-written binary search performs much better in 1.82+.

Looking at flame graphs, you can see a whole chunk of PartialOrd that only occurs when using the std implementation in 1.85:

The text was updated successfully, but these errors were encountered:

marvin-j97 · 2025-03-22T01:53:14Z

Using

cargo bisect-rustc --start 2024-08-01 --end 2024-08-30 --prompt -- bench --bench tli

********************************************************************************
Regression in nightly-2024-08-03
********************************************************************************

get_commits_between returning commits, len: 8
commit[0] 2024-08-01: Auto merge of #127276 - aDotInTheVoid:no-opaque, r=camelid
commit[1] 2024-08-02: Auto merge of #127624 - Oneirical:a-test-of-lime, r=jieyouxu
commit[2] 2024-08-02: Auto merge of #128147 - lolbinarycat:fmt-write-bloat-rmake, r=jieyouxu
commit[3] 2024-08-02: Auto merge of #128529 - matthiaskrgr:rollup-gzq2slo, r=matthiaskrgr
commit[4] 2024-08-02: Auto merge of #128254 - Amanieu:orig-binary-search, r=tgross35
commit[5] 2024-08-02: Auto merge of #128352 - Oneirical:daLTOnist-vision, r=jieyouxu
commit[6] 2024-08-02: Auto merge of #127926 - Oneirical:classical-orctestra, r=jieyouxu
commit[7] 2024-08-02: Auto merge of #128361 - Oneirical:testle-deforestation, r=jieyouxu

moxian · 2025-03-22T02:06:13Z

that would be

Rewrite binary search implementation #128254

@rustbot labels: +S-has-bisection -E-needs-bisection +T-libs -needs-triage -regression-untriaged +regression-from-stable-to-stable

apiraino · 2025-03-24T14:36:57Z

cc @Amanieu do you have a comment on this regression? thanks

Amanieu · 2025-03-24T14:56:36Z

If you change the benchmark in lsm-tree to search for a random element instead of a fixed one then the new implementation is consistently faster:

diff --git a/benches/partition_point.rs b/benches/partition_point.rs
index 58d55bf..e2adcc4 100644
--- a/benches/partition_point.rs
+++ b/benches/partition_point.rs
@@ -7,12 +7,18 @@ fn bench_partition_point(c: &mut Criterion) {
     for item_count in [10, 100, 1_000, 10_000, 100_000, 1_000_000] {
         let items = (0..item_count).collect::<Vec<_>>();

+        let mut r = 0usize;
         group.bench_function(format!("native {item_count}"), |b| {
-            b.iter(|| items.partition_point(|&x| x <= 5_000))
+            r = r.wrapping_mul(1664525).wrapping_add(1013904223);
+            let i = r % item_count;
+            b.iter(|| items.partition_point(|&x| x <= i))
         });

+        let mut r = 0usize;
         group.bench_function(format!("rewrite {item_count}"), |b| {
-            b.iter(|| partition_point(&items, |&x| x <= 5_000))
+            r = r.wrapping_mul(1664525).wrapping_add(1013904223);
+            let i = r % item_count;
+            b.iter(|| partition_point(&items, |&x| x <= i))
         });
     }
 }

The reason for this is that the new implementation uses conditional move instructions instead of branches after a comparison. Branches are faster if the branch predictor can predict them accurately, but you suffer a large penalty on misses. Using conditional moves avoids this penalty but instead introduces a data dependency on the comparison. This prevents out-of-order CPUs from going ahead and performing later comparisons before the previous one has been resolved.

marvin-j97 · 2025-03-24T16:07:14Z

@Amanieu Do you think T can make a difference? For the TLI, which is an array of Vec<u8> essentially, and adjusted to use a random search key every iteration, I get:

main branch:

monkey patch:

Maybe there is no general purpose solution and I just have to stick to my version, but maybe there needs to be more investigation into how the binary_search performs for different types.

hanna-kruppe · 2025-03-24T16:30:09Z

The branch vs conditional move effects @Amanieu mentioned matter the most when each comparison requires few instructions (high throughput) relative to its latency. For comparing byte slices, if the branch-predictable outcome for most comparisons is that they bail out after a couple of bytes, that would be the case. Again, testing with random keys is likely more insightful than testing the same key over and over again.

marvin-j97 · 2025-03-24T16:44:37Z

I'm gonna close this for now, I think my monkey patch will work well enough for my use case.

marvin-j97 added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Mar 21, 2025

rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Mar 21, 2025

This comment has been minimized.

Sign in to view

rustbot added E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc I-slow Issue: Problems and improvements with respect to performance of generated code. labels Mar 22, 2025

marvin-j97 mentioned this issue Mar 22, 2025

perf: replace std partition_point fjall-rs/lsm-tree#112

Open

marvin-j97 closed this as completed Mar 24, 2025

jieyouxu removed regression-from-stable-to-stable Performance or correctness regression from one stable version to another. I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std::slice::partition_point performance regression when using rustc 1.82+ #138796

std::slice::partition_point performance regression when using rustc 1.82+ #138796

marvin-j97 commented Mar 21, 2025 •

edited

Loading

This comment has been minimized.

marvin-j97 commented Mar 22, 2025 •

edited

Loading

moxian commented Mar 22, 2025 •

edited

Loading

apiraino commented Mar 24, 2025

Amanieu commented Mar 24, 2025

marvin-j97 commented Mar 24, 2025 •

edited

Loading

hanna-kruppe commented Mar 24, 2025

marvin-j97 commented Mar 24, 2025

std::slice::partition_point performance regression when using rustc 1.82+ #138796

std::slice::partition_point performance regression when using rustc 1.82+ #138796

Comments

marvin-j97 commented Mar 21, 2025 • edited Loading

This comment has been minimized.

marvin-j97 commented Mar 22, 2025 • edited Loading

moxian commented Mar 22, 2025 • edited Loading

apiraino commented Mar 24, 2025

Amanieu commented Mar 24, 2025

marvin-j97 commented Mar 24, 2025 • edited Loading

hanna-kruppe commented Mar 24, 2025

marvin-j97 commented Mar 24, 2025

marvin-j97 commented Mar 21, 2025 •

edited

Loading

marvin-j97 commented Mar 22, 2025 •

edited

Loading

moxian commented Mar 22, 2025 •

edited

Loading

marvin-j97 commented Mar 24, 2025 •

edited

Loading