-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std::slice::partition_point performance regression when using rustc 1.82+ #138796
Comments
This comment has been minimized.
This comment has been minimized.
Using
|
that would be @rustbot labels: +S-has-bisection -E-needs-bisection +T-libs -needs-triage -regression-untriaged +regression-from-stable-to-stable |
cc @Amanieu do you have a comment on this regression? thanks |
If you change the benchmark in lsm-tree to search for a random element instead of a fixed one then the new implementation is consistently faster: diff --git a/benches/partition_point.rs b/benches/partition_point.rs
index 58d55bf..e2adcc4 100644
--- a/benches/partition_point.rs
+++ b/benches/partition_point.rs
@@ -7,12 +7,18 @@ fn bench_partition_point(c: &mut Criterion) {
for item_count in [10, 100, 1_000, 10_000, 100_000, 1_000_000] {
let items = (0..item_count).collect::<Vec<_>>();
+ let mut r = 0usize;
group.bench_function(format!("native {item_count}"), |b| {
- b.iter(|| items.partition_point(|&x| x <= 5_000))
+ r = r.wrapping_mul(1664525).wrapping_add(1013904223);
+ let i = r % item_count;
+ b.iter(|| items.partition_point(|&x| x <= i))
});
+ let mut r = 0usize;
group.bench_function(format!("rewrite {item_count}"), |b| {
- b.iter(|| partition_point(&items, |&x| x <= 5_000))
+ r = r.wrapping_mul(1664525).wrapping_add(1013904223);
+ let i = r % item_count;
+ b.iter(|| partition_point(&items, |&x| x <= i))
});
}
} The reason for this is that the new implementation uses conditional move instructions instead of branches after a comparison. Branches are faster if the branch predictor can predict them accurately, but you suffer a large penalty on misses. Using conditional moves avoids this penalty but instead introduces a data dependency on the comparison. This prevents out-of-order CPUs from going ahead and performing later comparisons before the previous one has been resolved. |
@Amanieu Do you think T can make a difference? For the TLI, which is an array of Maybe there is no general purpose solution and I just have to stick to my version, but maybe there needs to be more investigation into how the |
The branch vs conditional move effects @Amanieu mentioned matter the most when each comparison requires few instructions (high throughput) relative to its latency. For comparing byte slices, if the branch-predictable outcome for most comparisons is that they bail out after a couple of bytes, that would be the case. Again, testing with random keys is likely more insightful than testing the same key over and over again. |
I'm gonna close this for now, I think my monkey patch will work well enough for my use case. |
In https://github.com/fjall-rs/lsm-tree I am heavily using
partition_point
for various binary searches.I found the std implementation to be somewhat slow.
As you can see below, the performance regression happens between rustc 1.81 and 1.82.
Comparing std between rustc versions:
The std implementation performs much worse in rustc 1.82 compared to 1.81.
As a sanity check, I wrote a simple alternative and found it performed much better (comparable to how
partition_point
used to be):https://github.com/fjall-rs/lsm-tree/blob/0dca39eb54690beee2481cc3e31c9fc78d9e3fca/src/binary_search.rs#L9-L35
Reproduce (using monkey patch):
The hand-written binary search performs much better in 1.82+.
Looking at flame graphs, you can see a whole chunk of
PartialOrd
that only occurs when using the std implementation in 1.85:The text was updated successfully, but these errors were encountered: