rustdoc search is excruciatingly slow on very large crates #131156

zopsicle · 2024-10-02T15:08:00Z

Steps to reproduce:

Use Firefox on a computer with Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz or comparable CPU.
Go to https://microsoft.github.io/windows-docs-rs. This rustdoc has a search index of 38 MB.
Focus on the search bar and slowly type "ID3D12GraphicsPipelineState".

Expected behavior:

rustdoc quickly shows matching results while typing and when done typing.

Actual behavior:

Web page slows down to a crawl after every delayed keystroke. This is not just due to the search index download, which takes me 1.4 s and can probably not be improved much, but also due to matching the search query against the search index once the download is finished.

Here is Firefox Profiler output: https://share.firefox.dev/4eOgrcE. Of interest are the yellow areas in the graph (the salmon ones are mostly idle GC). Some observations:

buildIndex takes about a second.
A lot of time seems to be spent in calculating edit distance. This is definitely a useful feature (as can be seen in the example query, "ID3D12GraphicsPipelineState", which has relevant but no exact matches).
Some of the edit distance calculations are done by convertNameToId, which is documented to be used only for the In Parameters and In Return Types tabs. If the user is not going to click on those tabs then perhaps these calls are not necessary.

The text was updated successfully, but these errors were encountered:

lolbinarycat · 2024-10-19T18:28:51Z

We should probably do automatic benchmarking of rustdoc search similar to how its done with rustc, it's very easy to accidentally cause huge perf regressions by adding something significant to an inner loop.

i've also often wondered if it would be worth it to start using wasm in rustdoc search...

GuillaumeGomez · 2024-10-24T16:30:23Z

There is benchmarking for rustdoc search, but it's only run when we make changes to it (and its also mostly handled by @notriddle ).

notriddle · 2024-11-02T02:37:39Z

There is benchmarking for rustdoc search

https://gitlab.com/notriddle/rustdoc-js-profile is where the current effort is.

notriddle · 2024-11-02T02:51:48Z

And here's where we might find some guidance for a faster name matching tactic https://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

https://fossies.org/linux/lucene/gradle/generation/moman/createLevAutomata.py

https://github.com/jpbarrette/moman

…=GuillaumeGomez rustdoc: use a trie for name-based search Potentially rust-lang#131156 — need to try reproducing the problem with `windows` Preview and profiler results ---------------------------- Here's some quick profiling in Firefox done on the rust compiler docs: - Before: https://share.firefox.dev/3UPm3M8 - After: https://share.firefox.dev/40LXvYb Here's the results for the node.js profiler: - https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html Here's a copy that you can use to try it out. Compare it with [the nightly]. Try typing `typecheckercontext` one character at a time, slowly. - https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html [the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/ The fuzzy match algo is based on [Fast String Correction with Levenshtein-Automata] and the corresponding implementation code in [moman] and [Lucene]; the bit-packing representation comes from Lucene, but the actual matcher is more based on `fsc.py`. As suggested in the paper, a trie is used to represent the FSA dictionary. The same trie is used for prefix matching. Substring matching is done with a side table of three-character[^1] windows that point into the trie. [Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf [Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java [moman]: https://gitlab.com/notriddle/moman-rustdoc User-visible changes -------------------- I don't expect anybody to notice anything, but it does cause two changes: - Substring matches, in the middle of a name, only apply if there's three or more characters in the search query. - Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters[^1] in them. - It uses more RAM. - It's faster (assuming you don't swap thrash). [^1]: technically utf-16 code units

Rollup merge of rust-lang#133005 - notriddle:notriddle/trie-search, r=GuillaumeGomez rustdoc: use a trie for name-based search Potentially rust-lang#131156 — need to try reproducing the problem with `windows` Preview and profiler results ---------------------------- Here's some quick profiling in Firefox done on the rust compiler docs: - Before: https://share.firefox.dev/3UPm3M8 - After: https://share.firefox.dev/40LXvYb Here's the results for the node.js profiler: - https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html Here's a copy that you can use to try it out. Compare it with [the nightly]. Try typing `typecheckercontext` one character at a time, slowly. - https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html [the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/ The fuzzy match algo is based on [Fast String Correction with Levenshtein-Automata] and the corresponding implementation code in [moman] and [Lucene]; the bit-packing representation comes from Lucene, but the actual matcher is more based on `fsc.py`. As suggested in the paper, a trie is used to represent the FSA dictionary. The same trie is used for prefix matching. Substring matching is done with a side table of three-character[^1] windows that point into the trie. [Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf [Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java [moman]: https://gitlab.com/notriddle/moman-rustdoc User-visible changes -------------------- I don't expect anybody to notice anything, but it does cause two changes: - Substring matches, in the middle of a name, only apply if there's three or more characters in the search query. - Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters[^1] in them. - It uses more RAM. - It's faster (assuming you don't swap thrash). [^1]: technically utf-16 code units

workingjubilee · 2025-02-13T22:09:42Z

From @abgros in #136990:

Microsoft's Windows crate has its documentation generated with rustdoc which can be found at https://microsoft.github.io/windows-docs-rs/doc/windows/.

Trying to use the search bar on this documentation causes the browser to hang, due to the site trying to load a 28 MB JSON file (the problematic code is at https://microsoft.github.io/windows-docs-rs/doc/search-index.js). On my laptop this causes the fans to spin up like jet engines, followed by (about half the time) the tab crashing with an out of memory error.

Instead of a massive JSON file, I would recommend instead using the IndexedDB API, which also has the benefit of not needing to be reloaded every time the user visits the site.

GuillaumeGomez · 2025-02-14T10:03:07Z

From what I could find online, IndexedDB doesn't seem to work on local files, so it's not compatible with rustdoc requirements.

notriddle · 2025-02-14T16:56:09Z

Am I correct in assuming that the idea with indexdb is that we would still have to load the 28MiB JSON file into memory in order to populate the db in the first place, but once it was in there we could reuse it without having to load it again?

zopsicle · 2025-02-16T11:55:35Z

I suppose the idea is that the IndexedDB database would be readily queryable. So not only avoids the 38 MB download (although browsers already have caching functionality for that so it wouldn't be necessary anyway), it avoids the call to buildIndex. But it would not help with the edit distance issue.

I think ideally the index could be shipped in a format that requires no further client-side processing before it can be used, and relies on browser caching to avoid repeated downloads. Perhaps data structures that facilitate efficient search could be embedded directly in the file (at the risk of inflating it).

notriddle · 2025-02-17T15:48:04Z

I think ideally the index could be shipped in a format that requires no further client-side processing before it can be used

Does IndexedDB offer such a format? What I'm finding in MDN are functions to manipulate POJOs, which means we still need to deserialize the JSON before it can be inserted into the local database.

riverar · 2025-02-24T17:15:56Z

Not strictly related but also worth mentioning that the generation of those Windows docs is manual (typically by me) as it requires 20GB+ of memory and generates ~quarter million files that need to get checked in.

zopsicle added the C-bug Category: This is a bug. label Oct 2, 2024

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Oct 2, 2024

lolbinarycat mentioned this issue Oct 24, 2024

"single tab mode" for rustdoc search #132110

Open

notriddle mentioned this issue Nov 13, 2024

rustdoc: use a trie for name-based search #133005

Merged

cyrgani mentioned this issue Feb 13, 2025

Searching the windows crate uses massive amounts of CPU and memory #136990

Closed

workingjubilee marked this as a duplicate of #136990 Feb 13, 2025

ChrisDenton mentioned this issue Feb 24, 2025

Clicking on search will crash browser microsoft/windows-rs#3509

Closed

kennykerr mentioned this issue Mar 21, 2025

Docs' index seems broken and crashes web tabs microsoft/windows-rs#3555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rustdoc search is excruciatingly slow on very large crates #131156

rustdoc search is excruciatingly slow on very large crates #131156

zopsicle commented Oct 2, 2024 •

edited

Loading

lolbinarycat commented Oct 19, 2024

GuillaumeGomez commented Oct 24, 2024

notriddle commented Nov 2, 2024

notriddle commented Nov 2, 2024 •

edited

Loading

workingjubilee commented Feb 13, 2025

GuillaumeGomez commented Feb 14, 2025

notriddle commented Feb 14, 2025

zopsicle commented Feb 16, 2025 •

edited

Loading

notriddle commented Feb 17, 2025

riverar commented Feb 24, 2025

rustdoc search is excruciatingly slow on very large crates #131156

rustdoc search is excruciatingly slow on very large crates #131156

Comments

zopsicle commented Oct 2, 2024 • edited Loading

lolbinarycat commented Oct 19, 2024

GuillaumeGomez commented Oct 24, 2024

notriddle commented Nov 2, 2024

notriddle commented Nov 2, 2024 • edited Loading

workingjubilee commented Feb 13, 2025

GuillaumeGomez commented Feb 14, 2025

notriddle commented Feb 14, 2025

zopsicle commented Feb 16, 2025 • edited Loading

notriddle commented Feb 17, 2025

riverar commented Feb 24, 2025

zopsicle commented Oct 2, 2024 •

edited

Loading

notriddle commented Nov 2, 2024 •

edited

Loading

zopsicle commented Feb 16, 2025 •

edited

Loading