Skip to content

Commit 037df45

Browse files
authored
Unrolled build for #144476
Rollup merge of #144476 - notriddle:notriddle/stringdex, r=lolbinarycat,GuillaumeGomez rustdoc-search: search backend with partitioned suffix tree Before: - https://notriddle.com/windows-docs-rs/doc-old/windows/ - https://doc.rust-lang.org/1.89.0/std/index.html - https://doc.rust-lang.org/1.89.0/nightly-rustc/rustc_hir/index.html After: - https://notriddle.com/windows-docs-rs/doc/windows/ - https://notriddle.com/rustdoc-html-demo-12/stringdex/doc/std/index.html - https://notriddle.com/rustdoc-html-demo-12/stringdex/compiler-doc/rustc_hir/index.html ## Summary Rewrites the rustdoc search engine to use an indexed data structure, factored out as a crate called [stringdex](https://crates.io/crates/stringdex), that allows it to perform modified-levenshtein distance calculations, substring matches, and prefix matches in a reasonably efficient, and, more importantly, *incremental* algorithm. ## Motivation Fixes #131156 While the windows-rs crate is definitely the worst offender, I've noticed performance problems with the compiler crates as well. It makes no sense for rustdoc-search to have poor performance: it's basically a spell checker, and those have been usable since the 90's. Stringdex is particularly designed to quickly return exact matches, to always report those matches first, and to never, ever [place new matches on top of old ones](https://web.dev/articles/cls). It also tries to yield to the event loop occasionally as it runs. This way, you can click the exactly-matched result before the rest of the search finishes running. ## Explanation A longer description of how name search works can be found in stringdex's [HACKING.md](https://gitlab.com/notriddle/stringdex/-/blob/main/HACKING.md) file. Type search is done by performing a name search on each element, then performing bitmap operations to narrow down a list of potential matches, then performing type unification on each match. ## Drawbacks It's rather complex, and takes up more disk space than the current flat list of strings. ## Rationale and alternatives Instead of a suffix tree, I could've used a different [approximate matching data structure](https://en.wikipedia.org/wiki/Approximate_string_matching). I didn't do that because I wanted to keep the current behavior (for example, a straightforward trigram index won't match [oepn](https://doc.rust-lang.org/nightly/std/?search=oepn) like the current system does). ## Prior art [Sherlodoc](https://github.com/art-w/sherlodoc) is based on a similar concept, but they: - use edit distance over a suffix tree for type-based search, instead of the binary matching that's implemented here - use substring matching for name-based search, [but not fuzzy name matching](art-w/sherlodoc#21) - actually implement body text search, which is a potential-future feature, but not implemented in this PR ## Future possibilities ### Low-level optimization in stringdex There are half a dozen low-level optimizations that I still need to explore. I haven't done them yet, because I've been working on bug fixes and rebasing on rustdoc's side, and a more solid and diverse test suite for stringdex itself. - Stringdex decides whether to bundle two nodes into the same file based on size. To figure out a node's size, I have to run compression on it. This is probably slower than it needs to be. - Stack compression is limited to the same 256-slot sliding windows as backref compression, and it doesn't have to be. (stack and backref compression are used to optimize the representation of the edge pointer from a parent node to its child; backref uses one byte, while stack is entirely implicit) - The JS-side decoder is pretty naive. It performs unnecessary hash table lookups when decoding compressed nodes, and retains a list of hashes that it doesn't need. It needs to calculate the hashes in order to construct the merkle tree correctly, but it doesn't need to keep them. - Data compression happens at the end, while emitting the node. This means it's not being counted when deciding on how to bundle, which is pretty dumb. ### Improved recall in type-driven search Right now, type-driven search performs very strict matching. It's very precise, but misses a lot of things people would want. What I'm not sure about is whether to focus more on edit-distance-based approaches, or to focus on type-theoretical approaches. Both gives avenues to improve, but edit distance is going to be faster while type checking is going to be more precise. For example, a type theoretical improvement would fix [`Iterator<T>, (T -> U) -> Iterator<U>`](https://doc.rust-lang.org/nightly/std/?search=Iterator%3CT%3E%2C%20(T%20-%3E%20U)%20-%3E%20Iterator%3CU%3E) to give [`Iterator::map`](https://doc.rust-lang.org/nightly/std/iter/trait.Iterator.html#method.map), because it would recognize that the Map struct implements the Iterator trait. I don't know of any clean way to get this result to work without implementing significant type checking logic in search.js, and an edit-distance-based "dirty" approach would likely give a bunch of other results on top of this one. ## Full-text search Once you've got this fuzzy dictionary matching to work, the logical next step is to implement some kind of information retrieval-based approach to phrase matching. Like applying edit distance to types, phrase search gets you significantly better recall, but with a few major drawbacks: - You have to pick between index bloat and the use of stopwords. Stopwords are bad because they might actually be important (try searching "if let" in mdBook if you're feeling brave), but without them, you spend a lot of space on text that doesn't matter. - Example code also tends to have a lot of irrelevant stuff in it. Like stop words, we'd have to pick potentially-confusing or bloat. Neither of these problems are deal-breakers, but they're worth keeping in mind.
2 parents b96868f + 79a40c9 commit 037df45

File tree

148 files changed

+9088
-5055
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

148 files changed

+9088
-5055
lines changed

Cargo.lock

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4812,6 +4812,7 @@ dependencies = [
48124812
"serde_json",
48134813
"sha2",
48144814
"smallvec",
4815+
"stringdex",
48154816
"tempfile",
48164817
"threadpool",
48174818
"tracing",
@@ -5225,6 +5226,15 @@ dependencies = [
52255226
"quote",
52265227
]
52275228

5229+
[[package]]
5230+
name = "stringdex"
5231+
version = "0.0.1-alpha4"
5232+
source = "registry+https://github.com/rust-lang/crates.io-index"
5233+
checksum = "2841fd43df5b1ff1b042e167068a1fe9b163dc93041eae56ab2296859013a9a0"
5234+
dependencies = [
5235+
"stacker",
5236+
]
5237+
52285238
[[package]]
52295239
name = "strsim"
52305240
version = "0.11.1"
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
8.6.0
1+
8.57.1

src/etc/htmldocck.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
import re
1616
import shlex
1717
from collections import namedtuple
18+
from pathlib import Path
1819

1920
try:
2021
from html.parser import HTMLParser
@@ -242,6 +243,11 @@ def resolve_path(self, path):
242243
return self.last_path
243244

244245
def get_absolute_path(self, path):
246+
if "*" in path:
247+
paths = list(Path(self.root).glob(path))
248+
if len(paths) != 1:
249+
raise FailedCheck("glob path does not resolve to one file")
250+
path = str(paths[0])
245251
return os.path.join(self.root, path)
246252

247253
def get_file(self, path):

src/librustdoc/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ rustdoc-json-types = { path = "../rustdoc-json-types" }
2121
serde = { version = "1.0", features = ["derive"] }
2222
serde_json = "1.0"
2323
smallvec = "1.8.1"
24+
stringdex = { version = "0.0.1-alpha4" }
2425
tempfile = "3"
2526
threadpool = "1.8.1"
2627
tracing = "0.1"

src/librustdoc/build.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ fn main() {
1010
"static/css/normalize.css",
1111
"static/js/main.js",
1212
"static/js/search.js",
13+
"static/js/stringdex.js",
1314
"static/js/settings.js",
1415
"static/js/src-script.js",
1516
"static/js/storage.js",

src/librustdoc/formats/cache.rs

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
use std::mem;
22

3-
use rustc_ast::join_path_syms;
43
use rustc_data_structures::fx::{FxHashMap, FxHashSet, FxIndexMap, FxIndexSet};
54
use rustc_hir::StabilityLevel;
65
use rustc_hir::def_id::{CrateNum, DefId, DefIdMap, DefIdSet};
@@ -574,7 +573,6 @@ fn add_item_to_search_index(tcx: TyCtxt<'_>, cache: &mut Cache, item: &clean::It
574573
clean::ItemKind::ImportItem(import) => import.source.did.unwrap_or(item_def_id),
575574
_ => item_def_id,
576575
};
577-
let path = join_path_syms(parent_path);
578576
let impl_id = if let Some(ParentStackItem::Impl { item_id, .. }) = cache.parent_stack.last() {
579577
item_id.as_def_id()
580578
} else {
@@ -593,11 +591,11 @@ fn add_item_to_search_index(tcx: TyCtxt<'_>, cache: &mut Cache, item: &clean::It
593591
ty: item.type_(),
594592
defid: Some(defid),
595593
name,
596-
path,
594+
module_path: parent_path.to_vec(),
597595
desc,
598596
parent: parent_did,
599597
parent_idx: None,
600-
exact_path: None,
598+
exact_module_path: None,
601599
impl_id,
602600
search_type,
603601
aliases,

src/librustdoc/formats/item_type.rs

Lines changed: 51 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ use std::fmt;
44

55
use rustc_hir::def::{CtorOf, DefKind, MacroKinds};
66
use rustc_span::hygiene::MacroKind;
7-
use serde::{Serialize, Serializer};
7+
use serde::{Deserialize, Deserializer, Serialize, Serializer, de};
88

99
use crate::clean;
1010

@@ -68,6 +68,52 @@ impl Serialize for ItemType {
6868
}
6969
}
7070

71+
impl<'de> Deserialize<'de> for ItemType {
72+
fn deserialize<D>(deserializer: D) -> Result<ItemType, D::Error>
73+
where
74+
D: Deserializer<'de>,
75+
{
76+
struct ItemTypeVisitor;
77+
impl<'de> de::Visitor<'de> for ItemTypeVisitor {
78+
type Value = ItemType;
79+
fn expecting(&self, formatter: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
80+
write!(formatter, "an integer between 0 and 25")
81+
}
82+
fn visit_u64<E: de::Error>(self, v: u64) -> Result<ItemType, E> {
83+
Ok(match v {
84+
0 => ItemType::Keyword,
85+
1 => ItemType::Primitive,
86+
2 => ItemType::Module,
87+
3 => ItemType::ExternCrate,
88+
4 => ItemType::Import,
89+
5 => ItemType::Struct,
90+
6 => ItemType::Enum,
91+
7 => ItemType::Function,
92+
8 => ItemType::TypeAlias,
93+
9 => ItemType::Static,
94+
10 => ItemType::Trait,
95+
11 => ItemType::Impl,
96+
12 => ItemType::TyMethod,
97+
13 => ItemType::Method,
98+
14 => ItemType::StructField,
99+
15 => ItemType::Variant,
100+
16 => ItemType::Macro,
101+
17 => ItemType::AssocType,
102+
18 => ItemType::Constant,
103+
19 => ItemType::AssocConst,
104+
20 => ItemType::Union,
105+
21 => ItemType::ForeignType,
106+
23 => ItemType::ProcAttribute,
107+
24 => ItemType::ProcDerive,
108+
25 => ItemType::TraitAlias,
109+
_ => return Err(E::missing_field("unknown number")),
110+
})
111+
}
112+
}
113+
deserializer.deserialize_any(ItemTypeVisitor)
114+
}
115+
}
116+
71117
impl<'a> From<&'a clean::Item> for ItemType {
72118
fn from(item: &'a clean::Item) -> ItemType {
73119
let kind = match &item.kind {
@@ -198,6 +244,10 @@ impl ItemType {
198244
pub(crate) fn is_adt(&self) -> bool {
199245
matches!(self, ItemType::Struct | ItemType::Union | ItemType::Enum)
200246
}
247+
/// Keep this the same as isFnLikeTy in search.js
248+
pub(crate) fn is_fn_like(&self) -> bool {
249+
matches!(self, ItemType::Function | ItemType::Method | ItemType::TyMethod)
250+
}
201251
}
202252

203253
impl fmt::Display for ItemType {

src/librustdoc/html/layout.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ pub(crate) struct Layout {
2727

2828
pub(crate) struct Page<'a> {
2929
pub(crate) title: &'a str,
30+
pub(crate) short_title: &'a str,
3031
pub(crate) css_class: &'a str,
3132
pub(crate) root_path: &'a str,
3233
pub(crate) static_root_path: Option<&'a str>,

src/librustdoc/html/render/context.rs

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,18 @@ impl<'tcx> Context<'tcx> {
204204
if !is_module {
205205
title.push_str(it.name.unwrap().as_str());
206206
}
207+
let short_title;
208+
let short_title = if is_module {
209+
let module_name = self.current.last().unwrap();
210+
short_title = if it.is_crate() {
211+
format!("Crate {module_name}")
212+
} else {
213+
format!("Module {module_name}")
214+
};
215+
&short_title[..]
216+
} else {
217+
it.name.as_ref().unwrap().as_str()
218+
};
207219
if !it.is_primitive() && !it.is_keyword() {
208220
if !is_module {
209221
title.push_str(" in ");
@@ -240,6 +252,7 @@ impl<'tcx> Context<'tcx> {
240252
root_path: &self.root_path(),
241253
static_root_path: self.shared.static_root_path.as_deref(),
242254
title: &title,
255+
short_title,
243256
description: &desc,
244257
resource_suffix: &self.shared.resource_suffix,
245258
rust_logo: has_doc_flag(self.tcx(), LOCAL_CRATE.as_def_id(), sym::rust_logo),
@@ -617,6 +630,7 @@ impl<'tcx> FormatRenderer<'tcx> for Context<'tcx> {
617630
let shared = &self.shared;
618631
let mut page = layout::Page {
619632
title: "List of all items in this crate",
633+
short_title: "All",
620634
css_class: "mod sys",
621635
root_path: "../",
622636
static_root_path: shared.static_root_path.as_deref(),

0 commit comments

Comments
 (0)