-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rustdoc search is excruciatingly slow on very large crates #131156
Comments
We should probably do automatic benchmarking of rustdoc search similar to how its done with rustc, it's very easy to accidentally cause huge perf regressions by adding something significant to an inner loop. i've also often wondered if it would be worth it to start using wasm in rustdoc search... |
There is benchmarking for rustdoc search, but it's only run when we make changes to it (and its also mostly handled by @notriddle ). |
https://gitlab.com/notriddle/rustdoc-js-profile is where the current effort is. |
And here's where we might find some guidance for a faster name matching tactic https://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html https://fossies.org/linux/lucene/gradle/generation/moman/createLevAutomata.py |
…=GuillaumeGomez rustdoc: use a trie for name-based search Potentially rust-lang#131156 — need to try reproducing the problem with `windows` Preview and profiler results ---------------------------- Here's some quick profiling in Firefox done on the rust compiler docs: - Before: https://share.firefox.dev/3UPm3M8 - After: https://share.firefox.dev/40LXvYb Here's the results for the node.js profiler: - https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html Here's a copy that you can use to try it out. Compare it with [the nightly]. Try typing `typecheckercontext` one character at a time, slowly. - https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html [the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/ The fuzzy match algo is based on [Fast String Correction with Levenshtein-Automata] and the corresponding implementation code in [moman] and [Lucene]; the bit-packing representation comes from Lucene, but the actual matcher is more based on `fsc.py`. As suggested in the paper, a trie is used to represent the FSA dictionary. The same trie is used for prefix matching. Substring matching is done with a side table of three-character[^1] windows that point into the trie. [Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf [Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java [moman]: https://gitlab.com/notriddle/moman-rustdoc User-visible changes -------------------- I don't expect anybody to notice anything, but it does cause two changes: - Substring matches, in the middle of a name, only apply if there's three or more characters in the search query. - Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters[^1] in them. - It uses more RAM. - It's faster (assuming you don't swap thrash). [^1]: technically utf-16 code units
Rollup merge of rust-lang#133005 - notriddle:notriddle/trie-search, r=GuillaumeGomez rustdoc: use a trie for name-based search Potentially rust-lang#131156 — need to try reproducing the problem with `windows` Preview and profiler results ---------------------------- Here's some quick profiling in Firefox done on the rust compiler docs: - Before: https://share.firefox.dev/3UPm3M8 - After: https://share.firefox.dev/40LXvYb Here's the results for the node.js profiler: - https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html Here's a copy that you can use to try it out. Compare it with [the nightly]. Try typing `typecheckercontext` one character at a time, slowly. - https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html [the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/ The fuzzy match algo is based on [Fast String Correction with Levenshtein-Automata] and the corresponding implementation code in [moman] and [Lucene]; the bit-packing representation comes from Lucene, but the actual matcher is more based on `fsc.py`. As suggested in the paper, a trie is used to represent the FSA dictionary. The same trie is used for prefix matching. Substring matching is done with a side table of three-character[^1] windows that point into the trie. [Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf [Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java [moman]: https://gitlab.com/notriddle/moman-rustdoc User-visible changes -------------------- I don't expect anybody to notice anything, but it does cause two changes: - Substring matches, in the middle of a name, only apply if there's three or more characters in the search query. - Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters[^1] in them. - It uses more RAM. - It's faster (assuming you don't swap thrash). [^1]: technically utf-16 code units
|
From what I could find online, IndexedDB doesn't seem to work on local files, so it's not compatible with rustdoc requirements. |
Am I correct in assuming that the idea with indexdb is that we would still have to load the 28MiB JSON file into memory in order to populate the db in the first place, but once it was in there we could reuse it without having to load it again? |
I suppose the idea is that the IndexedDB database would be readily queryable. So not only avoids the 38 MB download (although browsers already have caching functionality for that so it wouldn't be necessary anyway), it avoids the call to buildIndex. But it would not help with the edit distance issue. I think ideally the index could be shipped in a format that requires no further client-side processing before it can be used, and relies on browser caching to avoid repeated downloads. Perhaps data structures that facilitate efficient search could be embedded directly in the file (at the risk of inflating it). |
Does IndexedDB offer such a format? What I'm finding in MDN are functions to manipulate POJOs, which means we still need to deserialize the JSON before it can be inserted into the local database. |
Not strictly related but also worth mentioning that the generation of those Windows docs is manual (typically by me) as it requires 20GB+ of memory and generates ~quarter million files that need to get checked in. |
Steps to reproduce:
Expected behavior:
rustdoc quickly shows matching results while typing and when done typing.
Actual behavior:
Web page slows down to a crawl after every delayed keystroke. This is not just due to the search index download, which takes me 1.4 s and can probably not be improved much, but also due to matching the search query against the search index once the download is finished.
Here is Firefox Profiler output: https://share.firefox.dev/4eOgrcE. Of interest are the yellow areas in the graph (the salmon ones are mostly idle GC). Some observations:
convertNameToId
, which is documented to be used only for the In Parameters and In Return Types tabs. If the user is not going to click on those tabs then perhaps these calls are not necessary.The text was updated successfully, but these errors were encountered: