Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::slice::partition_point performance regression when using rustc 1.82+ #138796

Closed
marvin-j97 opened this issue Mar 21, 2025 · 8 comments
Closed
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. S-has-bisection Status: a bisection has been found for this issue T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@marvin-j97
Copy link

marvin-j97 commented Mar 21, 2025

In https://github.com/fjall-rs/lsm-tree I am heavily using partition_point for various binary searches.

I found the std implementation to be somewhat slow.

As you can see below, the performance regression happens between rustc 1.81 and 1.82.


Comparing std between rustc versions:

git clone https://github.com/fjall-rs/lsm-tree && cd lsm-tree
cargo +1.81 bench --bench tli
cargo +1.82 bench --bench tli

Image

The std implementation performs much worse in rustc 1.82 compared to 1.81.


As a sanity check, I wrote a simple alternative and found it performed much better (comparable to how partition_point used to be):

https://github.com/fjall-rs/lsm-tree/blob/0dca39eb54690beee2481cc3e31c9fc78d9e3fca/src/binary_search.rs#L9-L35

Reproduce (using monkey patch):

git clone https://github.com/fjall-rs/lsm-tree && cd lsm-tree
cargo +1.85 bench --bench tli
git checkout perf/replace-partition-point
cargo +1.85 bench --bench tli

Image

The hand-written binary search performs much better in 1.82+.

Looking at flame graphs, you can see a whole chunk of PartialOrd that only occurs when using the std implementation in 1.85:

Image

Image

@marvin-j97 marvin-j97 added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Mar 21, 2025
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Mar 21, 2025
@moxian

This comment has been minimized.

@rustbot rustbot added E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc I-slow Issue: Problems and improvements with respect to performance of generated code. labels Mar 22, 2025
@marvin-j97
Copy link
Author

marvin-j97 commented Mar 22, 2025

Using

cargo bisect-rustc --start 2024-08-01 --end 2024-08-30 --prompt -- bench --bench tli
********************************************************************************
Regression in nightly-2024-08-03
********************************************************************************

get_commits_between returning commits, len: 8
commit[0] 2024-08-01: Auto merge of #127276 - aDotInTheVoid:no-opaque, r=camelid
commit[1] 2024-08-02: Auto merge of #127624 - Oneirical:a-test-of-lime, r=jieyouxu
commit[2] 2024-08-02: Auto merge of #128147 - lolbinarycat:fmt-write-bloat-rmake, r=jieyouxu
commit[3] 2024-08-02: Auto merge of #128529 - matthiaskrgr:rollup-gzq2slo, r=matthiaskrgr
commit[4] 2024-08-02: Auto merge of #128254 - Amanieu:orig-binary-search, r=tgross35
commit[5] 2024-08-02: Auto merge of #128352 - Oneirical:daLTOnist-vision, r=jieyouxu
commit[6] 2024-08-02: Auto merge of #127926 - Oneirical:classical-orctestra, r=jieyouxu
commit[7] 2024-08-02: Auto merge of #128361 - Oneirical:testle-deforestation, r=jieyouxu

@moxian
Copy link
Contributor

moxian commented Mar 22, 2025

that would be

@rustbot labels: +S-has-bisection -E-needs-bisection +T-libs -needs-triage -regression-untriaged +regression-from-stable-to-stable

@rustbot rustbot added S-has-bisection Status: a bisection has been found for this issue T-libs Relevant to the library team, which will review and decide on the PR/issue. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. regression-untriaged Untriaged performance or correctness regression. labels Mar 22, 2025
@apiraino
Copy link
Contributor

cc @Amanieu do you have a comment on this regression? thanks

@Amanieu
Copy link
Member

Amanieu commented Mar 24, 2025

If you change the benchmark in lsm-tree to search for a random element instead of a fixed one then the new implementation is consistently faster:

diff --git a/benches/partition_point.rs b/benches/partition_point.rs
index 58d55bf..e2adcc4 100644
--- a/benches/partition_point.rs
+++ b/benches/partition_point.rs
@@ -7,12 +7,18 @@ fn bench_partition_point(c: &mut Criterion) {
     for item_count in [10, 100, 1_000, 10_000, 100_000, 1_000_000] {
         let items = (0..item_count).collect::<Vec<_>>();

+        let mut r = 0usize;
         group.bench_function(format!("native {item_count}"), |b| {
-            b.iter(|| items.partition_point(|&x| x <= 5_000))
+            r = r.wrapping_mul(1664525).wrapping_add(1013904223);
+            let i = r % item_count;
+            b.iter(|| items.partition_point(|&x| x <= i))
         });

+        let mut r = 0usize;
         group.bench_function(format!("rewrite {item_count}"), |b| {
-            b.iter(|| partition_point(&items, |&x| x <= 5_000))
+            r = r.wrapping_mul(1664525).wrapping_add(1013904223);
+            let i = r % item_count;
+            b.iter(|| partition_point(&items, |&x| x <= i))
         });
     }
 }

The reason for this is that the new implementation uses conditional move instructions instead of branches after a comparison. Branches are faster if the branch predictor can predict them accurately, but you suffer a large penalty on misses. Using conditional moves avoids this penalty but instead introduces a data dependency on the comparison. This prevents out-of-order CPUs from going ahead and performing later comparisons before the previous one has been resolved.

@marvin-j97
Copy link
Author

marvin-j97 commented Mar 24, 2025

@Amanieu Do you think T can make a difference? For the TLI, which is an array of Vec<u8> essentially, and adjusted to use a random search key every iteration, I get:

main branch:
Image

monkey patch:
Image

Maybe there is no general purpose solution and I just have to stick to my version, but maybe there needs to be more investigation into how the binary_search performs for different types.

@hanna-kruppe
Copy link
Contributor

The branch vs conditional move effects @Amanieu mentioned matter the most when each comparison requires few instructions (high throughput) relative to its latency. For comparing byte slices, if the branch-predictable outcome for most comparisons is that they bail out after a couple of bytes, that would be the case. Again, testing with random keys is likely more insightful than testing the same key over and over again.

@marvin-j97
Copy link
Author

I'm gonna close this for now, I think my monkey patch will work well enough for my use case.

@jieyouxu jieyouxu removed regression-from-stable-to-stable Performance or correctness regression from one stable version to another. I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. S-has-bisection Status: a bisection has been found for this issue T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants