From c2286e84518de122e4aa68e7c57f48020ed99dd6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Wed, 7 Jan 2026 12:01:35 +0100 Subject: [PATCH 01/24] docfind blog post first draft --- blogs/2026/01/docfind.md | 204 ++++++++++++++++++++++++++++++++++++++ blogs/2026/01/docfind.mp4 | 3 + 2 files changed, 207 insertions(+) create mode 100644 blogs/2026/01/docfind.md create mode 100644 blogs/2026/01/docfind.mp4 diff --git a/blogs/2026/01/docfind.md b/blogs/2026/01/docfind.md new file mode 100644 index 0000000000..d3c1e31449 --- /dev/null +++ b/blogs/2026/01/docfind.md @@ -0,0 +1,204 @@ +--- +Order: 122 +TOCTitle: "Building docfind" +PageTitle: "Building docfind: Fast Client-Side Search with Rust and WebAssembly" +MetaDescription: How we built docfind, a high-performance client-side search engine using Rust and WebAssembly, and how GitHub Copilot accelerated development. +MetaSocialImage: TBD +Date: 2026-01-07 +Author: João Moreno +--- + +# Building docfind: Fast Client-Side Search with Rust and WebAssembly + +January 7, 2026 by [João Moreno](https://github.com/joaomoreno) + +If you've visited the [VS Code website](https://code.visualstudio.com/) recently, you might have noticed something new: a fast, responsive search experience that feels almost instant. + + + +Behind that experience is [docfind](https://github.com/microsoft/docfind), a search engine we built that runs entirely in your browser using WebAssembly. In this post, I want to share the story of how docfind came to be — a journey that took me from a decade-old blog post about automata theory to patching WebAssembly binaries, all empowered by GitHub Copilot. + +## The problem + +I'm currently a software engineering manager on the VS Code team, after spending over ten years as a software engineer. These days, I don't get much time to write code, and when I do, it's rarely in unfamiliar territory. But some problems just nag at you until you do something about them. + +For years, our documentation website had a basic search experience: you'd type a query, and it would redirect you to search results powered by a traditional search engine. Functional, but not the experience you'd expect from a product like VS Code. I wanted something better—something that felt as snappy as VS Code's Quick Open (`Ctrl+P`), where results appear instantly as you type. + +Together with my colleague [Nick Trogh](https://github.com/nicktrog), we researched the alternatives. The landscape looked something like this: + +- **[Algolia](https://www.algolia.com/)**: Excellent paid search-as-a-service. +- **[TypeSense](https://typesense.org/)**: Powerful open-source search, but requires running server-side code — another service to maintain and monitor. +- **[Lunr.js](https://lunrjs.com/)**: Client-side search in JavaScript, which sounded promising. We tried it with our docs (~3 MB of markdown), but it produced index files around 10 MB. Too large. +- **[Stork Search](https://stork-search.net/)**: WebAssembly-powered client-side search with a nice demo. But when we tested it, the indexes were still quite large, and the project appeared to be unmaintained. + +None of these options hit the sweet spot we were looking for: fast, client-side, compact, and maintenance-free. I started to wonder if we could build something ourselves. + +## The inspiration + +Thinking about client-side search reminded me of a blog post I'd read years ago. It was written by [Andrew Gallant](https://github.com/BurntSushi) (burntsushi), the creator of ripgrep, and it's titled [Index 1,600,000,000 Keys with Automata and Rust](https://burntsushi.net/transducers/). Published nearly a decade ago, it explains how to use **Finite State Transducers (FSTs)** to index massive amounts of string data in a compact binary format that supports fast lookups—including regex and fuzzy matching. + +The key insight is that FSTs can store sorted string keys in a state machine that's both memory-efficient and fast to query. Better yet, Andrew had published a Rust library called [fst](https://docs.rs/fst/latest/fst/) that implements exactly this. + +What if we could use FSTs to index keywords extracted from our documentation? The user types a query, we match it against keywords using the FST, and we get back a list of relevant documents—all in the browser, with no server round-trip. + +This led me to two more pieces of the puzzle: + +- **[RAKE](https://docs.rs/rake/latest/rake/)** (Rapid Automatic Keyword Extraction): An algorithm for extracting meaningful keywords and phrases from text. Feed it a document, and it returns keywords ranked by importance. +- **[FSST](https://docs.rs/fsst-rs/latest/fsst/index.html)** (Fast Static Symbol Table): A compression algorithm optimized for short strings. Since we'd need to store document titles, categories, and snippets in memory, compression would help keep the index small. + +With FST for fast keyword lookup, RAKE for keyword extraction, and FSST for string compression, I had the technical foundations. Now I just needed to build it—in Rust, a language I'm not particularly experienced with, during the limited time I could carve out from my day job. + +## The solution + +The architecture of docfind is straightforward, broken into three phases: + +```mermaid +flowchart LR + A([documents.json]) --> B[docfind] + B --> C[Keyword Extraction
RAKE] + B --> E[FSST Compression
document strings] + C --> D[FST Map
keywords → docs] + D --> F[[Index]] + E --> F + F --> G([docfind_bg.wasm
+ docfind.js]) + + style A fill:#e1f5ff + style G fill:#e1f5ff + style F fill:#ffffcc +``` + +**Phase 1: Indexing.** The CLI tool reads a JSON file containing your documents (title, category, URL, body text). For each document, it extracts keywords using RAKE, assigns relevance scores, and builds an FST that maps keywords to document indices. All the document strings are compressed using FSST and packed into a compact binary index. + +**Phase 2: Embedding.** Here's where things get interesting. Rather than shipping the index as a separate file that needs to be fetched at runtime, we embed it directly into the WebAssembly binary. The CLI tool patches the compiled WASM file to include the index as a data segment. When the browser loads the WASM module, the index is already in memory—no additional network request required. + +**Phase 3: Search.** When the user types a query, the WASM module searches the FST using a Levenshtein automaton (for typo tolerance) and prefix matching. It combines scores from multiple matching keywords, decompresses the relevant document strings on demand, and returns ranked results as JavaScript objects. + +The core data structure is surprisingly simple: + +```rust +pub struct Index { + /// FST mapping keywords to entry indices + fst: Vec, + + /// FSST-compressed document strings (title, category, href, body) + document_strings: FsstStrVec, + + /// For each keyword index, a list of (document_index, score) pairs + keyword_to_documents: Vec>, +} +``` + +The FST stores keywords and maps them to indices into `keyword_to_documents`. Each entry there points to the relevant documents with their relevance scores. Document strings are stored compressed and decompressed only when needed for display. + +## The challenge + +The trickiest part of this project wasn't the search algorithm or the keyword extraction—it was embedding the index into the WebAssembly binary. + +The naive approach would be to use Rust's `include_bytes!` macro to bake the index into the WASM at compile time. But that would mean recompiling the WASM module every time the documentation changes. Instead, I wanted a pre-compiled WASM "template" that the CLI tool could patch with any index. + +This meant I needed to: + +1. Parse the existing WASM binary to understand its structure +2. Find the memory section and calculate how much additional space the index needs +3. Add the index as a new data segment, updating the data count section accordingly +4. Locate placeholder global variables and patch them with the actual index location +5. Write out a valid WASM binary + +The WASM template declares two placeholder globals with a distinctive marker value: + +```rust +#[unsafe(no_mangle)] +pub static mut INDEX_BASE: u32 = 0xdead_beef; + +#[unsafe(no_mangle)] +pub static mut INDEX_LEN: u32 = 0xdead_beef; +``` + +At runtime, the search function uses these to locate the embedded index and parse it from the raw bytes: + +```rust +static INDEX: OnceLock = OnceLock::new(); + +pub fn search(query: &str, max_results: Option) -> Result { + let index = INDEX.get_or_init(|| { + let raw_index = unsafe { + std::slice::from_raw_parts(INDEX_BASE as *const u8, INDEX_LEN as usize) + }; + Index::from_bytes(raw_index).expect("Failed to deserialize index") + }); + // ... perform search +} +``` + +The CLI tool scans the WASM binary's export section to find these globals, reads the global section to get their memory addresses, then patches the data segment that contains those `0xdead_beef` values with the actual index base address and length: + +```rust +// Patch the data if it contains the INDEX_BASE or INDEX_LEN addresses +if index_base_global_address >= &start && index_base_global_address < &end { + data[base_relative_offset..base_relative_offset + 4] + .copy_from_slice(&(index_base as i32).to_le_bytes()); + data[length_relative_offset..length_relative_offset + 4] + .copy_from_slice(&(raw_index.len() as i32).to_le_bytes()); +} + +// Add index as new data segment +data_section.active( + 0, + &ConstExpr::i32_const(index_base as i32), + raw_index.iter().copied(), +); +``` + +This was, to put it mildly, not straightforward. Understanding the WASM binary format, figuring out how globals are stored and referenced, calculating memory offsets—these are the kinds of problems that can easily derail a side project. + +## The breakthrough: Copilot as an enabler + +I have to be honest, it's unlikely that I would have finished this project without GitHub Copilot. As a manager who doesn't code daily anymore, tackling a project in Rust—a language known for its steep learning curve—was ambitious. I'm not a Rust expert. I don't have the muscle memory for the borrow checker. And I certainly didn't have deep knowledge of the WebAssembly binary format. + +Copilot changed the equation entirely. + +**Research and exploration.** When I was evaluating FST, RAKE, and FSST, I used Copilot to understand how these libraries worked, ask clarifying questions, and bounce ideas around. It was like having a knowledgeable colleague available at any hour. + +**Efficient Rust development.** This was perhaps the biggest win. Copilot's completions and [Next Edit Suggestions](/docs/copilot/ai-powered-suggestions#_next-edit-suggestions) turned me into a productive Rust programmer. I no longer spent mental energy fighting the borrow checker or looking up syntax—Copilot handled the mechanical parts, letting me focus on the logic. + +**Scaffolding the WASM target.** When I asked Copilot to add a WebAssembly output target to the project, it didn't just add the configuration—it inferred that I wanted a search function exported and scaffolded the entire `lib.rs` with the right `wasm-bindgen` annotations. It even told me which command to run to build it. + +**The [docfind library](https://github.com/microsoft/docfind).** Copilot helped me scaffold the repository for docfind, including creating a working demo page, with performance vanity numbers. + +**Getting past the hard parts.** The WASM binary manipulation was the technical crux of this project. Understanding how to locate globals, patch data segments, and update memory sections required diving into details I'd never encountered before. Copilot helped me understand the WASM binary format, suggested the right `wasmparser` and `wasm-encoder` APIs, and helped debug issues when my patched binaries weren't valid. + +I'm confident this project would have taken me considerably longer without Copilot—and that's assuming I wouldn't have given up somewhere along the way. When you're time-constrained and working outside your expertise, having an AI assistant that can fill knowledge gaps and handle boilerplate isn't just convenient—it's the difference between shipping and abandoning. + +## The results + +Today, docfind powers the search experience on the VS Code documentation website. The numbers speak for themselves—you can see the current performance metrics in the [docfind README](https://github.com/microsoft/docfind#live-demo), which includes an interactive demo searching through 50,000 news articles entirely in your browser. + +For the VS Code website (~3 MB of markdown, ~3,700 documents partitioned by heading): + +- **Index size**: ~5.9 MB uncompressed, ~2.7 MB with Brotli compression +- **Search speed**: ~0.4ms per query, on my M2 Macbook Air +- **Network**: Single WASM file, downloaded only when the user shows intention to search + +No servers to maintain. No API keys to manage. No ongoing costs. Just a self-contained WebAssembly module that runs entirely in the browser, created at build time. + +## Try it yourself + +We've open-sourced docfind, and you can use it for your own static sites today. Installation is straightforward: + +```sh +curl -fsSL https://microsoft.github.io/docfind/install.sh | sh +``` + +Or, if you're on Windows: + +```psh +irm https://microsoft.github.io/docfind/install.ps1 | iex +``` + +Prepare a [JSON file](https://github.com/microsoft/docfind?tab=readme-ov-file#creating-a-search-index) with your documents, run `docfind documents.json output`, and you'll get a `docfind.js` and `docfind_bg.wasm` ready to use in your site. You need to bring your own client-side UI to show the search results. + +Building docfind was a reminder of why I became an engineer in the first place: the joy of solving a real problem with elegant technology. And it was a testament to how AI tools like Copilot are changing what's possible—letting us tackle projects that would have been out of reach given our constraints of time and expertise. + +If you have questions or feedback, feel free to open an issue on the [docfind repository](https://github.com/microsoft/docfind/issues). We'd love to hear how you're using it. + +Happy coding! 💙 \ No newline at end of file diff --git a/blogs/2026/01/docfind.mp4 b/blogs/2026/01/docfind.mp4 new file mode 100644 index 0000000000..2bb6355c37 --- /dev/null +++ b/blogs/2026/01/docfind.mp4 @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0b2a059f5b41c2ba3fb140e4af0ff8fa10e418396e26ae3db7222cd72bbd0075 +size 2728796 From bb26d265e53aa994e0e58394439c2db84581d4aa Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Wed, 7 Jan 2026 15:13:03 +0100 Subject: [PATCH 02/24] Update blogs/2026/01/docfind.md --- blogs/2026/01/docfind.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blogs/2026/01/docfind.md b/blogs/2026/01/docfind.md index d3c1e31449..dd5a3e2073 100644 --- a/blogs/2026/01/docfind.md +++ b/blogs/2026/01/docfind.md @@ -16,7 +16,7 @@ If you've visited the [VS Code website](https://code.visualstudio.com/) recently -Behind that experience is [docfind](https://github.com/microsoft/docfind), a search engine we built that runs entirely in your browser using WebAssembly. In this post, I want to share the story of how docfind came to be — a journey that took me from a decade-old blog post about automata theory to patching WebAssembly binaries, all empowered by GitHub Copilot. +Behind that experience is [docfind](https://github.com/microsoft/docfind), a search engine we built that runs entirely in your browser using WebAssembly. In this post, I want to share the story of how docfind came to be — a journey that took me from a decade-old blog post about automata theory to patching WebAssembly binaries. ## The problem From 49d8ccf5ab125ca956c973c04abd21ba37001166 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Wed, 7 Jan 2026 15:21:01 +0100 Subject: [PATCH 03/24] Update blogs/2026/01/docfind.md --- blogs/2026/01/docfind.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blogs/2026/01/docfind.md b/blogs/2026/01/docfind.md index dd5a3e2073..f4d5a66fcb 100644 --- a/blogs/2026/01/docfind.md +++ b/blogs/2026/01/docfind.md @@ -20,7 +20,7 @@ Behind that experience is [docfind](https://github.com/microsoft/docfind), a sea ## The problem -I'm currently a software engineering manager on the VS Code team, after spending over ten years as a software engineer. These days, I don't get much time to write code, and when I do, it's rarely in unfamiliar territory. But some problems just nag at you until you do something about them. +I'm currently a Software Engineering Manager on the VS Code team, so these days I don't get much time to write code. When I do, it's rarely in unfamiliar territory. But some problems just nag at you until you do something about them. For years, our documentation website had a basic search experience: you'd type a query, and it would redirect you to search results powered by a traditional search engine. Functional, but not the experience you'd expect from a product like VS Code. I wanted something better—something that felt as snappy as VS Code's Quick Open (`Ctrl+P`), where results appear instantly as you type. From 3d0d326575de3b26720239ac2819a9161a2ff03b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Wed, 7 Jan 2026 15:24:14 +0100 Subject: [PATCH 04/24] Update blogs/2026/01/docfind.md --- blogs/2026/01/docfind.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/blogs/2026/01/docfind.md b/blogs/2026/01/docfind.md index f4d5a66fcb..ec9e7f517d 100644 --- a/blogs/2026/01/docfind.md +++ b/blogs/2026/01/docfind.md @@ -26,8 +26,8 @@ For years, our documentation website had a basic search experience: you'd type a Together with my colleague [Nick Trogh](https://github.com/nicktrog), we researched the alternatives. The landscape looked something like this: -- **[Algolia](https://www.algolia.com/)**: Excellent paid search-as-a-service. -- **[TypeSense](https://typesense.org/)**: Powerful open-source search, but requires running server-side code — another service to maintain and monitor. +- **[Algolia](https://www.algolia.com/)**: State of the art search-as-a-service. But I wanted a pure client-side solution. +- **[TypeSense](https://typesense.org/)**: Powerful open-source search, but requires server-side code, just like Algolia. Plus, it'd be another service to maintain and monitor. - **[Lunr.js](https://lunrjs.com/)**: Client-side search in JavaScript, which sounded promising. We tried it with our docs (~3 MB of markdown), but it produced index files around 10 MB. Too large. - **[Stork Search](https://stork-search.net/)**: WebAssembly-powered client-side search with a nice demo. But when we tested it, the indexes were still quite large, and the project appeared to be unmaintained. From 553d277b1488be943c5e60f424384b15fc5e9c01 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Wed, 7 Jan 2026 16:17:05 +0100 Subject: [PATCH 05/24] updates --- blogs/2026/01/docfind.md | 42 ++++++++++++++++++++++------------------ 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/blogs/2026/01/docfind.md b/blogs/2026/01/docfind.md index ec9e7f517d..8d0cab077f 100644 --- a/blogs/2026/01/docfind.md +++ b/blogs/2026/01/docfind.md @@ -16,7 +16,7 @@ If you've visited the [VS Code website](https://code.visualstudio.com/) recently -Behind that experience is [docfind](https://github.com/microsoft/docfind), a search engine we built that runs entirely in your browser using WebAssembly. In this post, I want to share the story of how docfind came to be — a journey that took me from a decade-old blog post about automata theory to patching WebAssembly binaries. +Behind that experience is [docfind](https://github.com/microsoft/docfind), a search engine we built that runs entirely in your browser using [WebAssembly](https://en.wikipedia.org/wiki/WebAssembly). In this post, I want to share the story of how docfind came to be — a journey that took me from a decade-old blog post about automata theory to patching WebAssembly binaries. ## The problem @@ -48,32 +48,31 @@ This led me to two more pieces of the puzzle: With FST for fast keyword lookup, RAKE for keyword extraction, and FSST for string compression, I had the technical foundations. Now I just needed to build it—in Rust, a language I'm not particularly experienced with, during the limited time I could carve out from my day job. -## The solution +## The solution: A standalone CLI tool -The architecture of docfind is straightforward, broken into three phases: +I ended up creating a single CLI tool: docfind. The CLI tool is meant to create an index file out of a collection of documents. That index file is meant to be served to our website customers via regular HTTP and empower the search functionality. Users of the CLI tool shouldn't need any extenal dependencies other than docfind itself, in order to create index files. + +Here's an diagram of how docfind transforms a collection of documents (`documents.json`) into the respective index file (`docfind_bg.wasm`): + +> TODO: Replace with image instead of mermaid diagram ```mermaid flowchart LR - A([documents.json]) --> B[docfind] - B --> C[Keyword Extraction
RAKE] - B --> E[FSST Compression
document strings] + A([documents.json]) --> C[Keyword Extraction
RAKE] + A --> E[FSST Compression
document strings] C --> D[FST Map
keywords → docs] - D --> F[[Index]] - E --> F - F --> G([docfind_bg.wasm
+ docfind.js]) + D --> I([Index]) + E --> I + I --> G([docfind_bg.wasm
docfind.js]) style A fill:#e1f5ff style G fill:#e1f5ff - style F fill:#ffffcc + style I fill:#e1f5ff ``` -**Phase 1: Indexing.** The CLI tool reads a JSON file containing your documents (title, category, URL, body text). For each document, it extracts keywords using RAKE, assigns relevance scores, and builds an FST that maps keywords to document indices. All the document strings are compressed using FSST and packed into a compact binary index. - -**Phase 2: Embedding.** Here's where things get interesting. Rather than shipping the index as a separate file that needs to be fetched at runtime, we embed it directly into the WebAssembly binary. The CLI tool patches the compiled WASM file to include the index as a data segment. When the browser loads the WASM module, the index is already in memory—no additional network request required. - -**Phase 3: Search.** When the user types a query, the WASM module searches the FST using a Levenshtein automaton (for typo tolerance) and prefix matching. It combines scores from multiple matching keywords, decompresses the relevant document strings on demand, and returns ranked results as JavaScript objects. +Docfind first reads a JSON file containing information about your documents (title, category, URL, body text). For each document, it extracts keywords using RAKE, assigns relevance scores, and builds an FST that maps keywords to document indices. All the document strings are compressed using FSST. Both the FST and the compressed strings are then packed into a binary blob, representing the actual index. -The core data structure is surprisingly simple: +The data structure representing the index is surprisingly simple: ```rust pub struct Index { @@ -88,11 +87,16 @@ pub struct Index { } ``` -The FST stores keywords and maps them to indices into `keyword_to_documents`. Each entry there points to the relevant documents with their relevance scores. Document strings are stored compressed and decompressed only when needed for display. +The index stores keywords and maps them to indices into `keyword_to_documents`. Each entry there points to the relevant documents with their relevance scores. Document strings are stored compressed and decompressed only when needed for display. + +We could dump that index to a binary file, serve it up to our website customers and have some WebAssembly code which would parse it and use the FST library to perform the search operations. But here's where things get interesting. Rather than shipping the index as a separate binary file, docfind embeds it directly into the search library WASM file. Combining the search library with the index allows us to fetch a single HTTP resource whenever the user intends to search on the website. So as a last step, docfind outputs a single WASM file containig the client-side search code as well as the entire index created from the documents. + +So what happens client-side? When the user types a query, the WASM module is loaded to memory (containing both the code and the index) to execute that query as a search operation by going through the with FST data structure. We've found useful to use a [Levenshtein automaton](https://en.wikipedia.org/wiki/Levenshtein_automaton) (for typo tolerance) and prefix matching, to get a better experience. Finally, results are produced by combining scores from multiple matching keywords, decompressing the relevant document strings on demand, and returning ranked results as JavaScript objects. + -## The challenge +## The challenge: Patching the WASM library -The trickiest part of this project wasn't the search algorithm or the keyword extraction—it was embedding the index into the WebAssembly binary. +The trickiest part of this project wasn't the search algorithm or the keyword extraction — it was embedding the index into the WebAssembly binary. The naive approach would be to use Rust's `include_bytes!` macro to bake the index into the WASM at compile time. But that would mean recompiling the WASM module every time the documentation changes. Instead, I wanted a pre-compiled WASM "template" that the CLI tool could patch with any index. From ad32f562ed718888f78f752eb2086758273b6e42 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Wed, 7 Jan 2026 16:25:46 +0100 Subject: [PATCH 06/24] more updates --- blogs/2026/01/docfind.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/blogs/2026/01/docfind.md b/blogs/2026/01/docfind.md index 8d0cab077f..570f5e8d6d 100644 --- a/blogs/2026/01/docfind.md +++ b/blogs/2026/01/docfind.md @@ -31,7 +31,7 @@ Together with my colleague [Nick Trogh](https://github.com/nicktrog), we researc - **[Lunr.js](https://lunrjs.com/)**: Client-side search in JavaScript, which sounded promising. We tried it with our docs (~3 MB of markdown), but it produced index files around 10 MB. Too large. - **[Stork Search](https://stork-search.net/)**: WebAssembly-powered client-side search with a nice demo. But when we tested it, the indexes were still quite large, and the project appeared to be unmaintained. -None of these options hit the sweet spot we were looking for: fast, client-side, compact, and maintenance-free. I started to wonder if we could build something ourselves. +None of these options hit the sweet spot we were looking for: fast, client-side, compact, and low maintenance. I started to wonder if we could build something ourselves. ## The inspiration @@ -157,21 +157,21 @@ This was, to put it mildly, not straightforward. Understanding the WASM binary f ## The breakthrough: Copilot as an enabler -I have to be honest, it's unlikely that I would have finished this project without GitHub Copilot. As a manager who doesn't code daily anymore, tackling a project in Rust—a language known for its steep learning curve—was ambitious. I'm not a Rust expert. I don't have the muscle memory for the borrow checker. And I certainly didn't have deep knowledge of the WebAssembly binary format. +I have to be honest, it's unlikely that I would have finished this project without [Copilot Agent](https://code.visualstudio.com/docs/copilot/agents/overview). As a manager who doesn't code daily anymore, tackling a project in Rust—a language known for its steep learning curve—was ambitious. I'm not a Rust expert. I don't have the muscle memory for the borrow checker. And I certainly didn't have deep knowledge of the WebAssembly binary format. Copilot changed the equation entirely. **Research and exploration.** When I was evaluating FST, RAKE, and FSST, I used Copilot to understand how these libraries worked, ask clarifying questions, and bounce ideas around. It was like having a knowledgeable colleague available at any hour. -**Efficient Rust development.** This was perhaps the biggest win. Copilot's completions and [Next Edit Suggestions](/docs/copilot/ai-powered-suggestions#_next-edit-suggestions) turned me into a productive Rust programmer. I no longer spent mental energy fighting the borrow checker or looking up syntax—Copilot handled the mechanical parts, letting me focus on the logic. +**Efficient Rust development.** This was perhaps the biggest win. Copilot's [Next Edit Suggestions](/docs/copilot/ai-powered-suggestions#_next-edit-suggestions) turned me into a productive Rust programmer. I no longer spent mental energy fighting the borrow checker or looking up syntax—Copilot handled the mechanical parts, letting me focus on the logic. -**Scaffolding the WASM target.** When I asked Copilot to add a WebAssembly output target to the project, it didn't just add the configuration—it inferred that I wanted a search function exported and scaffolded the entire `lib.rs` with the right `wasm-bindgen` annotations. It even told me which command to run to build it. +**Scaffolding the WASM target.** When I asked Copilot to add a WebAssembly output target to the project, it didn't just add the configuration — it inferred that I wanted a search function exported and scaffolded the entire `lib.rs` with the right `wasm-bindgen` annotations. It even told me which command to run to build it. -**The [docfind library](https://github.com/microsoft/docfind).** Copilot helped me scaffold the repository for docfind, including creating a working demo page, with performance vanity numbers. +**The [docfind library](https://github.com/microsoft/docfind).** [Copilot helped me scaffold the repository for docfind](https://github.com/microsoft/docfind/pulls?q=is%3Apr+author%3A%40copilot+is%3Aclosed), including creating a working demo page, with performance vanity numbers. **Getting past the hard parts.** The WASM binary manipulation was the technical crux of this project. Understanding how to locate globals, patch data segments, and update memory sections required diving into details I'd never encountered before. Copilot helped me understand the WASM binary format, suggested the right `wasmparser` and `wasm-encoder` APIs, and helped debug issues when my patched binaries weren't valid. -I'm confident this project would have taken me considerably longer without Copilot—and that's assuming I wouldn't have given up somewhere along the way. When you're time-constrained and working outside your expertise, having an AI assistant that can fill knowledge gaps and handle boilerplate isn't just convenient—it's the difference between shipping and abandoning. +I'm confident this project would have taken me considerably longer without Copilot — and that's assuming I wouldn't have given up somewhere along the way. When you're time-constrained and working outside your expertise, I've found that having an AI assistant that can fill knowledge gaps and handle boilerplate isn't just convenient — it's the difference between shipping and abandoning. ## The results @@ -199,9 +199,9 @@ Or, if you're on Windows: irm https://microsoft.github.io/docfind/install.ps1 | iex ``` -Prepare a [JSON file](https://github.com/microsoft/docfind?tab=readme-ov-file#creating-a-search-index) with your documents, run `docfind documents.json output`, and you'll get a `docfind.js` and `docfind_bg.wasm` ready to use in your site. You need to bring your own client-side UI to show the search results. +Prepare a [JSON file](https://github.com/microsoft/docfind?tab=readme-ov-file#creating-a-search-index) with your documents, run `docfind documents.json output`, and you'll get a `docfind.js` and `docfind_bg.wasm` ready to use in your site. You need to bring your own client-side UI to show the search results (you can always create one using GitHub Copilot 😉). -Building docfind was a reminder of why I became an engineer in the first place: the joy of solving a real problem with elegant technology. And it was a testament to how AI tools like Copilot are changing what's possible—letting us tackle projects that would have been out of reach given our constraints of time and expertise. +Building docfind was a reminder of why I became an engineer in the first place: the joy of solving a real problem with elegant technology. And it was a testament to how AI tools like Copilot are changing what's possible—letting us tackle projects that would have been out of reach given our constraints of time and expertise. Finally, a quick shout-out to the [rust-analyzer](https://marketplace.visualstudio.com/items?itemName=rust-lang.rust-analyzer) VS Code extension, a must-have if you're working with Rust in VS Code. If you have questions or feedback, feel free to open an issue on the [docfind repository](https://github.com/microsoft/docfind/issues). We'd love to hear how you're using it. From 301ea4e2409ab5f396272273b29f203ab42efe77 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Wed, 7 Jan 2026 16:30:53 +0100 Subject: [PATCH 07/24] move blog post --- blogs/2026/01/{ => 07}/docfind.md | 0 blogs/2026/01/{ => 07}/docfind.mp4 | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename blogs/2026/01/{ => 07}/docfind.md (100%) rename blogs/2026/01/{ => 07}/docfind.mp4 (100%) diff --git a/blogs/2026/01/docfind.md b/blogs/2026/01/07/docfind.md similarity index 100% rename from blogs/2026/01/docfind.md rename to blogs/2026/01/07/docfind.md diff --git a/blogs/2026/01/docfind.mp4 b/blogs/2026/01/07/docfind.mp4 similarity index 100% rename from blogs/2026/01/docfind.mp4 rename to blogs/2026/01/07/docfind.mp4 From a1768a94adf246553c8d5d7e0c219a209a685511 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Wed, 7 Jan 2026 16:33:58 +0100 Subject: [PATCH 08/24] more updates --- blogs/2026/01/07/docfind.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index 570f5e8d6d..550a99d5b5 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -41,12 +41,12 @@ The key insight is that FSTs can store sorted string keys in a state machine tha What if we could use FSTs to index keywords extracted from our documentation? The user types a query, we match it against keywords using the FST, and we get back a list of relevant documents—all in the browser, with no server round-trip. -This led me to two more pieces of the puzzle: +But how could we get these document keywords? And wouldn't this just create a very large index file, given all the strings would need to be in memory? Could we use compression to create the smallest possible index? This led me to two more pieces of the puzzle: - **[RAKE](https://docs.rs/rake/latest/rake/)** (Rapid Automatic Keyword Extraction): An algorithm for extracting meaningful keywords and phrases from text. Feed it a document, and it returns keywords ranked by importance. - **[FSST](https://docs.rs/fsst-rs/latest/fsst/index.html)** (Fast Static Symbol Table): A compression algorithm optimized for short strings. Since we'd need to store document titles, categories, and snippets in memory, compression would help keep the index small. -With FST for fast keyword lookup, RAKE for keyword extraction, and FSST for string compression, I had the technical foundations. Now I just needed to build it—in Rust, a language I'm not particularly experienced with, during the limited time I could carve out from my day job. +With FST for fast keyword lookup, RAKE for keyword extraction, and FSST for string compression, I had the technical foundations. Now I just needed to build it — in Rust, a language I'm not particularly experienced with, during the limited time I could carve out from my day job. ## The solution: A standalone CLI tool From be0412e5755ff095f98aa67bf86f4f092bbe07c4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Wed, 7 Jan 2026 16:42:26 +0100 Subject: [PATCH 09/24] use excalidraw --- blogs/2026/01/07/docfind.md | 18 ++---------------- blogs/2026/01/07/docfind.svg | 4 ++++ 2 files changed, 6 insertions(+), 16 deletions(-) create mode 100644 blogs/2026/01/07/docfind.svg diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index 550a99d5b5..e76bf862e0 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -50,25 +50,11 @@ With FST for fast keyword lookup, RAKE for keyword extraction, and FSST for stri ## The solution: A standalone CLI tool -I ended up creating a single CLI tool: docfind. The CLI tool is meant to create an index file out of a collection of documents. That index file is meant to be served to our website customers via regular HTTP and empower the search functionality. Users of the CLI tool shouldn't need any extenal dependencies other than docfind itself, in order to create index files. +I ended up creating a single CLI tool, docfind, meant to create an index file out of a collection of documents. That index file should then be served to our website customers via regular HTTP and empower the search functionality. Users of the CLI tool shouldn't need any extenal dependencies other than docfind itself, in order to create index files. Here's an diagram of how docfind transforms a collection of documents (`documents.json`) into the respective index file (`docfind_bg.wasm`): -> TODO: Replace with image instead of mermaid diagram - -```mermaid -flowchart LR - A([documents.json]) --> C[Keyword Extraction
RAKE] - A --> E[FSST Compression
document strings] - C --> D[FST Map
keywords → docs] - D --> I([Index]) - E --> I - I --> G([docfind_bg.wasm
docfind.js]) - - style A fill:#e1f5ff - style G fill:#e1f5ff - style I fill:#e1f5ff -``` +![A diagram showing the flow of data in docfind](docfind.svg) Docfind first reads a JSON file containing information about your documents (title, category, URL, body text). For each document, it extracts keywords using RAKE, assigns relevance scores, and builds an FST that maps keywords to document indices. All the document strings are compressed using FSST. Both the FST and the compressed strings are then packed into a binary blob, representing the actual index. diff --git a/blogs/2026/01/07/docfind.svg b/blogs/2026/01/07/docfind.svg new file mode 100644 index 0000000000..becb4ddd38 --- /dev/null +++ b/blogs/2026/01/07/docfind.svg @@ -0,0 +1,4 @@ + + +Keyword ExtractionRAKEdocuments.jsonString CompressionFSSTKeyword to DocumentsFSTIndexLib + IndexWASM \ No newline at end of file From e5c1b0333120c636537617f04a63bd016023fff8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Wed, 7 Jan 2026 16:58:05 +0100 Subject: [PATCH 10/24] add a diagram showing an example of a document and its keywords --- blogs/2026/01/07/docfind.md | 5 ++++- blogs/2026/01/07/docfind2.svg | 7 +++++++ 2 files changed, 11 insertions(+), 1 deletion(-) create mode 100644 blogs/2026/01/07/docfind2.svg diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index e76bf862e0..cab5fe9d29 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -58,11 +58,14 @@ Here's an diagram of how docfind transforms a collection of documents (`document Docfind first reads a JSON file containing information about your documents (title, category, URL, body text). For each document, it extracts keywords using RAKE, assigns relevance scores, and builds an FST that maps keywords to document indices. All the document strings are compressed using FSST. Both the FST and the compressed strings are then packed into a binary blob, representing the actual index. + +![A visual explanation of what is a document, what are keywords and how are they represented in the index](docfind2.svg) + The data structure representing the index is surprisingly simple: ```rust pub struct Index { - /// FST mapping keywords to entry indices + /// FST mapping keywords to document indices fst: Vec, /// FSST-compressed document strings (title, category, href, body) diff --git a/blogs/2026/01/07/docfind2.svg b/blogs/2026/01/07/docfind2.svg new file mode 100644 index 0000000000..c2e85e306b --- /dev/null +++ b/blogs/2026/01/07/docfind2.svg @@ -0,0 +1,7 @@ + + +ExtractKeywordsGitHub Copilot coding agent/docs/copilot/copilot-coding-agentGitHub Copilot coding agent is aGitHub-hosted, autonomous AIdeveloper that works independently inthe background to completedevelopment tasks. To invoke the codingagent...That's adocumentAutonomousCopilotGitHubAgentThese arekeywordsCopilotAgentGitHubAutonomous...[..., GitHub Copilot Coding Agent, ...][..., GitHub Copilot Coding Agent, ...][..., GitHub Copilot Coding Agent, ...][..., GitHub Copilot Coding Agent, ...]...The index maps eachkeyword to thedocuments it appears in \ No newline at end of file From bfaea0869ca6377a3377c8068e78f0f3954a6a0d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Thu, 8 Jan 2026 09:38:47 +0100 Subject: [PATCH 11/24] remove emdashes --- blogs/2026/01/07/docfind.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index cab5fe9d29..11b9d3eff3 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -16,13 +16,13 @@ If you've visited the [VS Code website](https://code.visualstudio.com/) recently -Behind that experience is [docfind](https://github.com/microsoft/docfind), a search engine we built that runs entirely in your browser using [WebAssembly](https://en.wikipedia.org/wiki/WebAssembly). In this post, I want to share the story of how docfind came to be — a journey that took me from a decade-old blog post about automata theory to patching WebAssembly binaries. +Behind that experience is [docfind](https://github.com/microsoft/docfind), a search engine we built that runs entirely in your browser using [WebAssembly](https://en.wikipedia.org/wiki/WebAssembly). In this post, I want to share the story of how docfind came to be: a journey that took me from a decade-old blog post about automata theory to patching WebAssembly binaries. ## The problem I'm currently a Software Engineering Manager on the VS Code team, so these days I don't get much time to write code. When I do, it's rarely in unfamiliar territory. But some problems just nag at you until you do something about them. -For years, our documentation website had a basic search experience: you'd type a query, and it would redirect you to search results powered by a traditional search engine. Functional, but not the experience you'd expect from a product like VS Code. I wanted something better—something that felt as snappy as VS Code's Quick Open (`Ctrl+P`), where results appear instantly as you type. +For years, our documentation website had a basic search experience: you'd type a query, and it would redirect you to search results powered by a traditional search engine. Functional, but not the experience you'd expect from a product like VS Code. I wanted something better, something that felt as snappy as VS Code's Quick Open (`Ctrl+P`), where results appear instantly as you type. Together with my colleague [Nick Trogh](https://github.com/nicktrog), we researched the alternatives. The landscape looked something like this: @@ -35,18 +35,18 @@ None of these options hit the sweet spot we were looking for: fast, client-side, ## The inspiration -Thinking about client-side search reminded me of a blog post I'd read years ago. It was written by [Andrew Gallant](https://github.com/BurntSushi) (burntsushi), the creator of ripgrep, and it's titled [Index 1,600,000,000 Keys with Automata and Rust](https://burntsushi.net/transducers/). Published nearly a decade ago, it explains how to use **Finite State Transducers (FSTs)** to index massive amounts of string data in a compact binary format that supports fast lookups—including regex and fuzzy matching. +Thinking about client-side search reminded me of a blog post I'd read years ago. It was written by [Andrew Gallant](https://github.com/BurntSushi) (burntsushi), the creator of ripgrep, and it's titled [Index 1,600,000,000 Keys with Automata and Rust](https://burntsushi.net/transducers/). Published nearly a decade ago, it explains how to use **Finite State Transducers (FSTs)** to index massive amounts of string data in a compact binary format that supports fast lookups, including regex and fuzzy matching. The key insight is that FSTs can store sorted string keys in a state machine that's both memory-efficient and fast to query. Better yet, Andrew had published a Rust library called [fst](https://docs.rs/fst/latest/fst/) that implements exactly this. -What if we could use FSTs to index keywords extracted from our documentation? The user types a query, we match it against keywords using the FST, and we get back a list of relevant documents—all in the browser, with no server round-trip. +What if we could use FSTs to index keywords extracted from our documentation? The user types a query, we match it against keywords using the FST, and we get back a list of relevant documents, all in the browser, with no server round-trip. But how could we get these document keywords? And wouldn't this just create a very large index file, given all the strings would need to be in memory? Could we use compression to create the smallest possible index? This led me to two more pieces of the puzzle: - **[RAKE](https://docs.rs/rake/latest/rake/)** (Rapid Automatic Keyword Extraction): An algorithm for extracting meaningful keywords and phrases from text. Feed it a document, and it returns keywords ranked by importance. - **[FSST](https://docs.rs/fsst-rs/latest/fsst/index.html)** (Fast Static Symbol Table): A compression algorithm optimized for short strings. Since we'd need to store document titles, categories, and snippets in memory, compression would help keep the index small. -With FST for fast keyword lookup, RAKE for keyword extraction, and FSST for string compression, I had the technical foundations. Now I just needed to build it — in Rust, a language I'm not particularly experienced with, during the limited time I could carve out from my day job. +With FST for fast keyword lookup, RAKE for keyword extraction, and FSST for string compression, I had the technical foundations. Now I just needed to build it in Rust, a language I'm not particularly experienced with, during the limited time I could carve out from my day job. ## The solution: A standalone CLI tool @@ -85,7 +85,7 @@ So what happens client-side? When the user types a query, the WASM module is loa ## The challenge: Patching the WASM library -The trickiest part of this project wasn't the search algorithm or the keyword extraction — it was embedding the index into the WebAssembly binary. +The trickiest part of this project wasn't the search algorithm or the keyword extraction, it was embedding the index into the WebAssembly binary. The naive approach would be to use Rust's `include_bytes!` macro to bake the index into the WASM at compile time. But that would mean recompiling the WASM module every time the documentation changes. Instead, I wanted a pre-compiled WASM "template" that the CLI tool could patch with any index. @@ -142,29 +142,29 @@ data_section.active( ); ``` -This was, to put it mildly, not straightforward. Understanding the WASM binary format, figuring out how globals are stored and referenced, calculating memory offsets—these are the kinds of problems that can easily derail a side project. +This was, to put it mildly, not straightforward. Understanding the WASM binary format, figuring out how globals are stored and referenced, calculating memory offsets. These are the kinds of problems that can easily derail a side project. ## The breakthrough: Copilot as an enabler -I have to be honest, it's unlikely that I would have finished this project without [Copilot Agent](https://code.visualstudio.com/docs/copilot/agents/overview). As a manager who doesn't code daily anymore, tackling a project in Rust—a language known for its steep learning curve—was ambitious. I'm not a Rust expert. I don't have the muscle memory for the borrow checker. And I certainly didn't have deep knowledge of the WebAssembly binary format. +I have to be honest, it's unlikely that I would have finished this project without [Copilot Agent](https://code.visualstudio.com/docs/copilot/agents/overview). As a manager who doesn't code daily anymore, tackling a project in Rust, a language known for its steep learning curve, was ambitious. I'm not a Rust expert. I don't have the muscle memory for the borrow checker. And I certainly didn't have deep knowledge of the WebAssembly binary format. Copilot changed the equation entirely. **Research and exploration.** When I was evaluating FST, RAKE, and FSST, I used Copilot to understand how these libraries worked, ask clarifying questions, and bounce ideas around. It was like having a knowledgeable colleague available at any hour. -**Efficient Rust development.** This was perhaps the biggest win. Copilot's [Next Edit Suggestions](/docs/copilot/ai-powered-suggestions#_next-edit-suggestions) turned me into a productive Rust programmer. I no longer spent mental energy fighting the borrow checker or looking up syntax—Copilot handled the mechanical parts, letting me focus on the logic. +**Efficient Rust development.** This was perhaps the biggest win. Copilot's [Next Edit Suggestions](/docs/copilot/ai-powered-suggestions#_next-edit-suggestions) turned me into a productive Rust programmer. I no longer spent mental energy fighting the borrow checker or looking up syntax. Copilot handled the mechanical parts, letting me focus on the logic. -**Scaffolding the WASM target.** When I asked Copilot to add a WebAssembly output target to the project, it didn't just add the configuration — it inferred that I wanted a search function exported and scaffolded the entire `lib.rs` with the right `wasm-bindgen` annotations. It even told me which command to run to build it. +**Scaffolding the WASM target.** When I asked Copilot to add a WebAssembly output target to the project, it didn't just add the configuration, it inferred that I wanted a search function exported and scaffolded the entire `lib.rs` with the right `wasm-bindgen` annotations. It even told me which command to run to build it. **The [docfind library](https://github.com/microsoft/docfind).** [Copilot helped me scaffold the repository for docfind](https://github.com/microsoft/docfind/pulls?q=is%3Apr+author%3A%40copilot+is%3Aclosed), including creating a working demo page, with performance vanity numbers. **Getting past the hard parts.** The WASM binary manipulation was the technical crux of this project. Understanding how to locate globals, patch data segments, and update memory sections required diving into details I'd never encountered before. Copilot helped me understand the WASM binary format, suggested the right `wasmparser` and `wasm-encoder` APIs, and helped debug issues when my patched binaries weren't valid. -I'm confident this project would have taken me considerably longer without Copilot — and that's assuming I wouldn't have given up somewhere along the way. When you're time-constrained and working outside your expertise, I've found that having an AI assistant that can fill knowledge gaps and handle boilerplate isn't just convenient — it's the difference between shipping and abandoning. +I'm confident this project would have taken me considerably longer without Copilot, and that's assuming I wouldn't have given up somewhere along the way. When you're time-constrained and working outside your expertise, I've found that having an AI assistant that can fill knowledge gaps and handle boilerplate isn't just convenient, it's the difference between shipping and abandoning. ## The results -Today, docfind powers the search experience on the VS Code documentation website. The numbers speak for themselves—you can see the current performance metrics in the [docfind README](https://github.com/microsoft/docfind#live-demo), which includes an interactive demo searching through 50,000 news articles entirely in your browser. +Today, docfind powers the search experience on the VS Code documentation website. You can see the current performance metrics in the [docfind README](https://github.com/microsoft/docfind#live-demo), which includes an [interactive demo](https://microsoft.github.io/docfind) searching through 50,000 news articles entirely in your browser. For the VS Code website (~3 MB of markdown, ~3,700 documents partitioned by heading): @@ -190,7 +190,7 @@ irm https://microsoft.github.io/docfind/install.ps1 | iex Prepare a [JSON file](https://github.com/microsoft/docfind?tab=readme-ov-file#creating-a-search-index) with your documents, run `docfind documents.json output`, and you'll get a `docfind.js` and `docfind_bg.wasm` ready to use in your site. You need to bring your own client-side UI to show the search results (you can always create one using GitHub Copilot 😉). -Building docfind was a reminder of why I became an engineer in the first place: the joy of solving a real problem with elegant technology. And it was a testament to how AI tools like Copilot are changing what's possible—letting us tackle projects that would have been out of reach given our constraints of time and expertise. Finally, a quick shout-out to the [rust-analyzer](https://marketplace.visualstudio.com/items?itemName=rust-lang.rust-analyzer) VS Code extension, a must-have if you're working with Rust in VS Code. +Building docfind was a reminder of why I became an engineer in the first place: the joy of solving a real problem with elegant technology. And it was a testament to how AI tools like Copilot are changing what's possible, letting us tackle projects that would have been out of reach given our constraints of time and expertise. Finally, a quick shout-out to the [rust-analyzer](https://marketplace.visualstudio.com/items?itemName=rust-lang.rust-analyzer) VS Code extension, a must-have if you're working with Rust in VS Code. If you have questions or feedback, feel free to open an issue on the [docfind repository](https://github.com/microsoft/docfind/issues). We'd love to hear how you're using it. From 6dee8da85cb529509f1661209b00eacaae7ce5eb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Thu, 8 Jan 2026 09:49:35 +0100 Subject: [PATCH 12/24] update --- blogs/2026/01/07/docfind.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index 11b9d3eff3..171249f7f9 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -22,7 +22,7 @@ Behind that experience is [docfind](https://github.com/microsoft/docfind), a sea I'm currently a Software Engineering Manager on the VS Code team, so these days I don't get much time to write code. When I do, it's rarely in unfamiliar territory. But some problems just nag at you until you do something about them. -For years, our documentation website had a basic search experience: you'd type a query, and it would redirect you to search results powered by a traditional search engine. Functional, but not the experience you'd expect from a product like VS Code. I wanted something better, something that felt as snappy as VS Code's Quick Open (`Ctrl+P`), where results appear instantly as you type. +Until recently, our website still had that basic search experience: you'd type a query, and it would redirect you to search results powered by a traditional search engine. Not quite what developers are used to today. I wanted those searchs results to appear instantly as you type, similar to many other websites out there. It should be something as snappy as VS Code's Quick Open (`Ctrl+P`). Together with my colleague [Nick Trogh](https://github.com/nicktrog), we researched the alternatives. The landscape looked something like this: @@ -144,11 +144,9 @@ data_section.active( This was, to put it mildly, not straightforward. Understanding the WASM binary format, figuring out how globals are stored and referenced, calculating memory offsets. These are the kinds of problems that can easily derail a side project. -## The breakthrough: Copilot as an enabler +## The breakthrough: Solving hard problems with AI -I have to be honest, it's unlikely that I would have finished this project without [Copilot Agent](https://code.visualstudio.com/docs/copilot/agents/overview). As a manager who doesn't code daily anymore, tackling a project in Rust, a language known for its steep learning curve, was ambitious. I'm not a Rust expert. I don't have the muscle memory for the borrow checker. And I certainly didn't have deep knowledge of the WebAssembly binary format. - -Copilot changed the equation entirely. +I have to be honest, it's unlikely that I would have finished this project without [Copilot Agent](https://code.visualstudio.com/docs/copilot/agents/overview). As a manager who doesn't code daily anymore, tackling a project in Rust, a language known for its steep learning curve, was ambitious. I'm not a Rust expert. I don't have the muscle memory for the borrow checker. And I certainly didn't have deep knowledge of the WebAssembly binary format. But I did have a general sense of direction of where I wanted to go with all of this. Copilot helped me fill in the blanks and tackle the hard problems. **Research and exploration.** When I was evaluating FST, RAKE, and FSST, I used Copilot to understand how these libraries worked, ask clarifying questions, and bounce ideas around. It was like having a knowledgeable colleague available at any hour. From db5f89b1d2c6e9e4148787cc8b7efcf2c57b6322 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Thu, 8 Jan 2026 09:50:46 +0100 Subject: [PATCH 13/24] update --- blogs/2026/01/07/docfind.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index 171249f7f9..346d3fc404 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -48,7 +48,7 @@ But how could we get these document keywords? And wouldn't this just create a v With FST for fast keyword lookup, RAKE for keyword extraction, and FSST for string compression, I had the technical foundations. Now I just needed to build it in Rust, a language I'm not particularly experienced with, during the limited time I could carve out from my day job. -## The solution: A standalone CLI tool +## The solution I ended up creating a single CLI tool, docfind, meant to create an index file out of a collection of documents. That index file should then be served to our website customers via regular HTTP and empower the search functionality. Users of the CLI tool shouldn't need any extenal dependencies other than docfind itself, in order to create index files. @@ -83,7 +83,7 @@ We could dump that index to a binary file, serve it up to our website customers So what happens client-side? When the user types a query, the WASM module is loaded to memory (containing both the code and the index) to execute that query as a search operation by going through the with FST data structure. We've found useful to use a [Levenshtein automaton](https://en.wikipedia.org/wiki/Levenshtein_automaton) (for typo tolerance) and prefix matching, to get a better experience. Finally, results are produced by combining scores from multiple matching keywords, decompressing the relevant document strings on demand, and returning ranked results as JavaScript objects. -## The challenge: Patching the WASM library +## The challenge The trickiest part of this project wasn't the search algorithm or the keyword extraction, it was embedding the index into the WebAssembly binary. @@ -144,7 +144,7 @@ data_section.active( This was, to put it mildly, not straightforward. Understanding the WASM binary format, figuring out how globals are stored and referenced, calculating memory offsets. These are the kinds of problems that can easily derail a side project. -## The breakthrough: Solving hard problems with AI +## The breakthrough I have to be honest, it's unlikely that I would have finished this project without [Copilot Agent](https://code.visualstudio.com/docs/copilot/agents/overview). As a manager who doesn't code daily anymore, tackling a project in Rust, a language known for its steep learning curve, was ambitious. I'm not a Rust expert. I don't have the muscle memory for the borrow checker. And I certainly didn't have deep knowledge of the WebAssembly binary format. But I did have a general sense of direction of where I wanted to go with all of this. Copilot helped me fill in the blanks and tackle the hard problems. From f42b4b54a5fa5e9f4ba11a8e17621cbd02f69ba6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Thu, 8 Jan 2026 09:51:23 +0100 Subject: [PATCH 14/24] Update blogs/2026/01/07/docfind.md Co-authored-by: Nick Trogh --- blogs/2026/01/07/docfind.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index 346d3fc404..08eb30a478 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -31,7 +31,7 @@ Together with my colleague [Nick Trogh](https://github.com/nicktrog), we researc - **[Lunr.js](https://lunrjs.com/)**: Client-side search in JavaScript, which sounded promising. We tried it with our docs (~3 MB of markdown), but it produced index files around 10 MB. Too large. - **[Stork Search](https://stork-search.net/)**: WebAssembly-powered client-side search with a nice demo. But when we tested it, the indexes were still quite large, and the project appeared to be unmaintained. -None of these options hit the sweet spot we were looking for: fast, client-side, compact, and low maintenance. I started to wonder if we could build something ourselves. +None of these options hit the sweet spot: fast, client-side, compact, and easy to host and operate. I started to wonder if we could build something ourselves. ## The inspiration From be0047473406d4e3d67110ccd504f8036906ca57 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Thu, 8 Jan 2026 09:51:37 +0100 Subject: [PATCH 15/24] Update blogs/2026/01/07/docfind.md Co-authored-by: Nick Trogh --- blogs/2026/01/07/docfind.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index 08eb30a478..a83a711799 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -52,7 +52,7 @@ With FST for fast keyword lookup, RAKE for keyword extraction, and FSST for stri I ended up creating a single CLI tool, docfind, meant to create an index file out of a collection of documents. That index file should then be served to our website customers via regular HTTP and empower the search functionality. Users of the CLI tool shouldn't need any extenal dependencies other than docfind itself, in order to create index files. -Here's an diagram of how docfind transforms a collection of documents (`documents.json`) into the respective index file (`docfind_bg.wasm`): +Here's a diagram of how docfind transforms a collection of documents (`documents.json`) into the respective index file (`docfind_bg.wasm`): ![A diagram showing the flow of data in docfind](docfind.svg) From 2611de4f194b852574d84bdb93e9f06fd8737cb1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Thu, 8 Jan 2026 09:51:47 +0100 Subject: [PATCH 16/24] Update blogs/2026/01/07/docfind.md Co-authored-by: Nick Trogh --- blogs/2026/01/07/docfind.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index a83a711799..26e5b9b63a 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -107,7 +107,7 @@ pub static mut INDEX_BASE: u32 = 0xdead_beef; pub static mut INDEX_LEN: u32 = 0xdead_beef; ``` -At runtime, the search function uses these to locate the embedded index and parse it from the raw bytes: +At run-time, the search function uses these to locate the embedded index and parse it from the raw bytes: ```rust static INDEX: OnceLock = OnceLock::new(); From f8b155c132aba22d9ccfc9dfca886890b062f6ba Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Thu, 8 Jan 2026 09:51:57 +0100 Subject: [PATCH 17/24] Update blogs/2026/01/07/docfind.md Co-authored-by: Nick Trogh --- blogs/2026/01/07/docfind.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index 26e5b9b63a..dce2401f0f 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -78,7 +78,7 @@ pub struct Index { The index stores keywords and maps them to indices into `keyword_to_documents`. Each entry there points to the relevant documents with their relevance scores. Document strings are stored compressed and decompressed only when needed for display. -We could dump that index to a binary file, serve it up to our website customers and have some WebAssembly code which would parse it and use the FST library to perform the search operations. But here's where things get interesting. Rather than shipping the index as a separate binary file, docfind embeds it directly into the search library WASM file. Combining the search library with the index allows us to fetch a single HTTP resource whenever the user intends to search on the website. So as a last step, docfind outputs a single WASM file containig the client-side search code as well as the entire index created from the documents. +We could dump that index to a binary file, serve it up to our website visitors and have some WebAssembly code which would parse it and use the FST library to perform the search operations. But here's where things get interesting. Rather than shipping the index as a separate binary file, docfind embeds it directly into the search library WASM file. Combining the search library with the index allows us to fetch a single HTTP resource whenever the user intends to search on the website. So as a last step, docfind outputs a single WASM file containing the client-side search code as well as the entire index created from the documents. So what happens client-side? When the user types a query, the WASM module is loaded to memory (containing both the code and the index) to execute that query as a search operation by going through the with FST data structure. We've found useful to use a [Levenshtein automaton](https://en.wikipedia.org/wiki/Levenshtein_automaton) (for typo tolerance) and prefix matching, to get a better experience. Finally, results are produced by combining scores from multiple matching keywords, decompressing the relevant document strings on demand, and returning ranked results as JavaScript objects. From 1d2306ea9ec929fd7d9112b8e2009260007cb861 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Thu, 8 Jan 2026 09:52:15 +0100 Subject: [PATCH 18/24] Update blogs/2026/01/07/docfind.md Co-authored-by: Nick Trogh --- blogs/2026/01/07/docfind.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index dce2401f0f..44d4e89a7f 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -80,7 +80,7 @@ The index stores keywords and maps them to indices into `keyword_to_documents`. We could dump that index to a binary file, serve it up to our website visitors and have some WebAssembly code which would parse it and use the FST library to perform the search operations. But here's where things get interesting. Rather than shipping the index as a separate binary file, docfind embeds it directly into the search library WASM file. Combining the search library with the index allows us to fetch a single HTTP resource whenever the user intends to search on the website. So as a last step, docfind outputs a single WASM file containing the client-side search code as well as the entire index created from the documents. -So what happens client-side? When the user types a query, the WASM module is loaded to memory (containing both the code and the index) to execute that query as a search operation by going through the with FST data structure. We've found useful to use a [Levenshtein automaton](https://en.wikipedia.org/wiki/Levenshtein_automaton) (for typo tolerance) and prefix matching, to get a better experience. Finally, results are produced by combining scores from multiple matching keywords, decompressing the relevant document strings on demand, and returning ranked results as JavaScript objects. +So what happens client-side? When the user types a query, the WASM module is loaded in memory (code and document index) to execute that query as a search operation by going through the FST data structure. We've found it useful to use a [Levenshtein automaton](https://en.wikipedia.org/wiki/Levenshtein_automaton) (for typo tolerance) and prefix matching, to get more relevant matches. Finally, search results are produced by combining scores from multiple matching keywords, decompressing the relevant document strings on demand, and returning ranked results as JavaScript objects. ## The challenge From 953e788d9e4ab2cbac9baabfdbea08e28cef0eda Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Thu, 8 Jan 2026 09:52:25 +0100 Subject: [PATCH 19/24] Update blogs/2026/01/07/docfind.md Co-authored-by: Nick Trogh --- blogs/2026/01/07/docfind.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index 44d4e89a7f..9f72d58fe6 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -87,7 +87,7 @@ So what happens client-side? When the user types a query, the WASM module is loa The trickiest part of this project wasn't the search algorithm or the keyword extraction, it was embedding the index into the WebAssembly binary. -The naive approach would be to use Rust's `include_bytes!` macro to bake the index into the WASM at compile time. But that would mean recompiling the WASM module every time the documentation changes. Instead, I wanted a pre-compiled WASM "template" that the CLI tool could patch with any index. +The naive approach would be to use Rust's `include_bytes!` macro to bake the index into the WASM at compile-time. But that would mean recompiling the WASM module every time the documentation changes. Instead, I wanted a pre-compiled WASM "template" that the CLI tool could patch with an updated index. This meant I needed to: From 0bbdfa77d9594235f0b9a9dfee20b248e2855812 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Thu, 8 Jan 2026 09:53:18 +0100 Subject: [PATCH 20/24] update --- blogs/2026/01/07/docfind.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index 9f72d58fe6..ae5db07c40 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -146,7 +146,7 @@ This was, to put it mildly, not straightforward. Understanding the WASM binary f ## The breakthrough -I have to be honest, it's unlikely that I would have finished this project without [Copilot Agent](https://code.visualstudio.com/docs/copilot/agents/overview). As a manager who doesn't code daily anymore, tackling a project in Rust, a language known for its steep learning curve, was ambitious. I'm not a Rust expert. I don't have the muscle memory for the borrow checker. And I certainly didn't have deep knowledge of the WebAssembly binary format. But I did have a general sense of direction of where I wanted to go with all of this. Copilot helped me fill in the blanks and tackle the hard problems. +I have to be honest, it's unlikely that I would have finished this project without using [GitHub Copilot agents](https://code.visualstudio.com/docs/copilot/agents/overview). As a manager who doesn't code daily anymore, tackling a project in Rust, a language known for its steep learning curve, was ambitious. I'm not a Rust expert. I don't have the muscle memory for the borrow checker. And I certainly didn't have deep knowledge of the WebAssembly binary format. But I did have a general sense of direction of where I wanted to go with all of this. Copilot helped me fill in the blanks and tackle the hard problems. **Research and exploration.** When I was evaluating FST, RAKE, and FSST, I used Copilot to understand how these libraries worked, ask clarifying questions, and bounce ideas around. It was like having a knowledgeable colleague available at any hour. From b647dd40b456ee29d4b47aed9ad3f3983be428bc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Thu, 8 Jan 2026 12:13:37 +0100 Subject: [PATCH 21/24] more edits --- blogs/2026/01/07/docfind.md | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index ae5db07c40..70a9aa9f84 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -50,7 +50,9 @@ With FST for fast keyword lookup, RAKE for keyword extraction, and FSST for stri ## The solution -I ended up creating a single CLI tool, docfind, meant to create an index file out of a collection of documents. That index file should then be served to our website customers via regular HTTP and empower the search functionality. Users of the CLI tool shouldn't need any extenal dependencies other than docfind itself, in order to create index files. +I ended up creating a single CLI tool, docfind, meant to create an index file from our website documents whenever we build the website itself. Users of this CLI tool shouldn't need any extenal dependencies other than docfind itself in order to create index files. That index file ended up being a single WebAssembly module, easily served to visitors via HTTP. When visitors come to our website, their browser downloads the WebAssembly module in the background and is used to empower the search functionality. + +### Building the index Here's a diagram of how docfind transforms a collection of documents (`documents.json`) into the respective index file (`docfind_bg.wasm`): @@ -78,26 +80,28 @@ pub struct Index { The index stores keywords and maps them to indices into `keyword_to_documents`. Each entry there points to the relevant documents with their relevance scores. Document strings are stored compressed and decompressed only when needed for display. -We could dump that index to a binary file, serve it up to our website visitors and have some WebAssembly code which would parse it and use the FST library to perform the search operations. But here's where things get interesting. Rather than shipping the index as a separate binary file, docfind embeds it directly into the search library WASM file. Combining the search library with the index allows us to fetch a single HTTP resource whenever the user intends to search on the website. So as a last step, docfind outputs a single WASM file containing the client-side search code as well as the entire index created from the documents. +Now, we could dump that index data structure to a binary file, serve it up to our website visitors and have some WebAssembly module on the site which would parse it and use the FST library to perform the search operations. But here's where things get interesting. Rather than shipping the index as a separate binary file, docfind embeds it directly into the search library WebAssembly module, allowing visitors to fetch a single HTTP resource whenever they intend to search on the website. + +### Searching the index -So what happens client-side? When the user types a query, the WASM module is loaded in memory (code and document index) to execute that query as a search operation by going through the FST data structure. We've found it useful to use a [Levenshtein automaton](https://en.wikipedia.org/wiki/Levenshtein_automaton) (for typo tolerance) and prefix matching, to get more relevant matches. Finally, search results are produced by combining scores from multiple matching keywords, decompressing the relevant document strings on demand, and returning ranked results as JavaScript objects. +So what happens client-side? When the user types a query, the WebAssembly module is loaded in memory (code and document index) to execute that query as a search operation by going through the FST data structure. We've found it useful to use a [Levenshtein automaton](https://en.wikipedia.org/wiki/Levenshtein_automaton) (for typo tolerance) and prefix matching, to get more relevant matches. Finally, search results are produced by combining scores from multiple matching keywords, decompressing the relevant document strings on demand, and returning ranked results as JavaScript objects. ## The challenge The trickiest part of this project wasn't the search algorithm or the keyword extraction, it was embedding the index into the WebAssembly binary. -The naive approach would be to use Rust's `include_bytes!` macro to bake the index into the WASM at compile-time. But that would mean recompiling the WASM module every time the documentation changes. Instead, I wanted a pre-compiled WASM "template" that the CLI tool could patch with an updated index. +The naive approach would be to use Rust's `include_bytes!` macro to bake the index into the WebAssembly module at compile-time. But that would mean recompiling the WebAssembly module every time the documentation changes. Instead, I wanted a pre-compiled WASM "template" that the CLI tool could patch with an updated index. -This meant I needed to: +This meant I needed to statically create a WebAssembly module template, with an empty index, and embed that in docfind. Then, docfind could: -1. Parse the existing WASM binary to understand its structure +1. Parse the embedded WebAssembly module to understand its structure 2. Find the memory section and calculate how much additional space the index needs 3. Add the index as a new data segment, updating the data count section accordingly 4. Locate placeholder global variables and patch them with the actual index location -5. Write out a valid WASM binary +5. Write out a valid WebAssembly module -The WASM template declares two placeholder globals with a distinctive marker value: +The WebAssembly module template declares two placeholder globals with a distinctive marker value: ```rust #[unsafe(no_mangle)] @@ -123,7 +127,7 @@ pub fn search(query: &str, max_results: Option) -> Result Date: Tue, 13 Jan 2026 11:10:04 +0100 Subject: [PATCH 22/24] Fix typo Co-authored-by: Olivia Guzzardo <95261576+olguzzar@users.noreply.github.com> --- blogs/2026/01/07/docfind.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index 70a9aa9f84..df8333c4eb 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -50,7 +50,7 @@ With FST for fast keyword lookup, RAKE for keyword extraction, and FSST for stri ## The solution -I ended up creating a single CLI tool, docfind, meant to create an index file from our website documents whenever we build the website itself. Users of this CLI tool shouldn't need any extenal dependencies other than docfind itself in order to create index files. That index file ended up being a single WebAssembly module, easily served to visitors via HTTP. When visitors come to our website, their browser downloads the WebAssembly module in the background and is used to empower the search functionality. +I ended up creating a single CLI tool, docfind, meant to create an index file from our website documents whenever we build the website itself. Users of this CLI tool shouldn't need any external dependencies other than docfind itself in order to create index files. That index file ended up being a single WebAssembly module, easily served to visitors via HTTP. When visitors come to our website, their browser downloads the WebAssembly module in the background and is used to empower the search functionality. ### Building the index From 1d6aa5affa7b244ec496c3e08eb77e263fcddffa Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Tue, 13 Jan 2026 11:42:18 +0100 Subject: [PATCH 23/24] add MetaSocialImage --- blogs/2026/01/07/docfind-social.png | 3 +++ blogs/2026/01/07/docfind.md | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) create mode 100644 blogs/2026/01/07/docfind-social.png diff --git a/blogs/2026/01/07/docfind-social.png b/blogs/2026/01/07/docfind-social.png new file mode 100644 index 0000000000..410d60cac8 --- /dev/null +++ b/blogs/2026/01/07/docfind-social.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:30922ac0a35eb7b8226cb597eba3dbd43234c67ddf024316a73d62ffc4558755 +size 913617 diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/07/docfind.md index df8333c4eb..1e63ed6f7b 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/07/docfind.md @@ -3,7 +3,7 @@ Order: 122 TOCTitle: "Building docfind" PageTitle: "Building docfind: Fast Client-Side Search with Rust and WebAssembly" MetaDescription: How we built docfind, a high-performance client-side search engine using Rust and WebAssembly, and how GitHub Copilot accelerated development. -MetaSocialImage: TBD +MetaSocialImage: docfind-social.png Date: 2026-01-07 Author: João Moreno --- From 7978b5d0ef23ea37e11db08230bc8b557ac1eefb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Moreno?= Date: Tue, 13 Jan 2026 11:57:03 +0100 Subject: [PATCH 24/24] final changes --- blogs/2026/01/07/docfind.svg | 4 ---- blogs/2026/01/07/docfind2.svg | 7 ------- blogs/2026/01/{07 => 15}/docfind-social.png | 0 blogs/2026/01/{07 => 15}/docfind.md | 8 ++++---- blogs/2026/01/{07 => 15}/docfind.mp4 | 0 blogs/2026/01/15/docfind.svg | 4 ++++ blogs/2026/01/15/docfind2.svg | 7 +++++++ 7 files changed, 15 insertions(+), 15 deletions(-) delete mode 100644 blogs/2026/01/07/docfind.svg delete mode 100644 blogs/2026/01/07/docfind2.svg rename blogs/2026/01/{07 => 15}/docfind-social.png (100%) rename blogs/2026/01/{07 => 15}/docfind.md (97%) rename blogs/2026/01/{07 => 15}/docfind.mp4 (100%) create mode 100644 blogs/2026/01/15/docfind.svg create mode 100644 blogs/2026/01/15/docfind2.svg diff --git a/blogs/2026/01/07/docfind.svg b/blogs/2026/01/07/docfind.svg deleted file mode 100644 index becb4ddd38..0000000000 --- a/blogs/2026/01/07/docfind.svg +++ /dev/null @@ -1,4 +0,0 @@ - - -Keyword ExtractionRAKEdocuments.jsonString CompressionFSSTKeyword to DocumentsFSTIndexLib + IndexWASM \ No newline at end of file diff --git a/blogs/2026/01/07/docfind2.svg b/blogs/2026/01/07/docfind2.svg deleted file mode 100644 index c2e85e306b..0000000000 --- a/blogs/2026/01/07/docfind2.svg +++ /dev/null @@ -1,7 +0,0 @@ - - -ExtractKeywordsGitHub Copilot coding agent/docs/copilot/copilot-coding-agentGitHub Copilot coding agent is aGitHub-hosted, autonomous AIdeveloper that works independently inthe background to completedevelopment tasks. To invoke the codingagent...That's adocumentAutonomousCopilotGitHubAgentThese arekeywordsCopilotAgentGitHubAutonomous...[..., GitHub Copilot Coding Agent, ...][..., GitHub Copilot Coding Agent, ...][..., GitHub Copilot Coding Agent, ...][..., GitHub Copilot Coding Agent, ...]...The index maps eachkeyword to thedocuments it appears in \ No newline at end of file diff --git a/blogs/2026/01/07/docfind-social.png b/blogs/2026/01/15/docfind-social.png similarity index 100% rename from blogs/2026/01/07/docfind-social.png rename to blogs/2026/01/15/docfind-social.png diff --git a/blogs/2026/01/07/docfind.md b/blogs/2026/01/15/docfind.md similarity index 97% rename from blogs/2026/01/07/docfind.md rename to blogs/2026/01/15/docfind.md index 1e63ed6f7b..bbfbd64110 100644 --- a/blogs/2026/01/07/docfind.md +++ b/blogs/2026/01/15/docfind.md @@ -4,13 +4,13 @@ TOCTitle: "Building docfind" PageTitle: "Building docfind: Fast Client-Side Search with Rust and WebAssembly" MetaDescription: How we built docfind, a high-performance client-side search engine using Rust and WebAssembly, and how GitHub Copilot accelerated development. MetaSocialImage: docfind-social.png -Date: 2026-01-07 +Date: 2026-01-15 Author: João Moreno --- # Building docfind: Fast Client-Side Search with Rust and WebAssembly -January 7, 2026 by [João Moreno](https://github.com/joaomoreno) +January 15, 2026 by [João Moreno](https://github.com/joaomoreno) If you've visited the [VS Code website](https://code.visualstudio.com/) recently, you might have noticed something new: a fast, responsive search experience that feels almost instant. @@ -22,7 +22,7 @@ Behind that experience is [docfind](https://github.com/microsoft/docfind), a sea I'm currently a Software Engineering Manager on the VS Code team, so these days I don't get much time to write code. When I do, it's rarely in unfamiliar territory. But some problems just nag at you until you do something about them. -Until recently, our website still had that basic search experience: you'd type a query, and it would redirect you to search results powered by a traditional search engine. Not quite what developers are used to today. I wanted those searchs results to appear instantly as you type, similar to many other websites out there. It should be something as snappy as VS Code's Quick Open (`Ctrl+P`). +Until recently, our website still had that basic search experience: you'd type a query, and it would redirect you to search results powered by a traditional search engine. Not quite what developers are used to today. I wanted those search results to appear instantly as you type, similar to many other websites out there. It should be something as snappy as VS Code's Quick Open (`Ctrl+P`). Together with my colleague [Nick Trogh](https://github.com/nicktrog), we researched the alternatives. The landscape looked something like this: @@ -171,7 +171,7 @@ Today, docfind powers the search experience on the VS Code documentation website For the VS Code website (~3 MB of markdown, ~3,700 documents partitioned by heading): - **Index size**: ~5.9 MB uncompressed, ~2.7 MB with Brotli compression -- **Search speed**: ~0.4ms per query, on my M2 Macbook Air +- **Search speed**: ~0.4ms per query, on my M2 MacBook Air - **Network**: Single WebAssembly module, downloaded only when the user shows intention to search No servers to maintain. No API keys to manage. No ongoing costs. Just a self-contained WebAssembly module that runs entirely in the browser, created at build time. diff --git a/blogs/2026/01/07/docfind.mp4 b/blogs/2026/01/15/docfind.mp4 similarity index 100% rename from blogs/2026/01/07/docfind.mp4 rename to blogs/2026/01/15/docfind.mp4 diff --git a/blogs/2026/01/15/docfind.svg b/blogs/2026/01/15/docfind.svg new file mode 100644 index 0000000000..cef4f8bde5 --- /dev/null +++ b/blogs/2026/01/15/docfind.svg @@ -0,0 +1,4 @@ + + +Keyword ExtractionRAKEdocuments.jsonString CompressionFSSTKeyword to DocumentsFSTIndexLib + IndexWASM \ No newline at end of file diff --git a/blogs/2026/01/15/docfind2.svg b/blogs/2026/01/15/docfind2.svg new file mode 100644 index 0000000000..682743770a --- /dev/null +++ b/blogs/2026/01/15/docfind2.svg @@ -0,0 +1,7 @@ + + +ExtractKeywordsGitHub Copilot coding agent/docs/copilot/copilot-coding-agentGitHub Copilot coding agent is aGitHub-hosted, autonomous AIdeveloper that works independently inthe background to completedevelopment tasks. To invoke the codingagent...That's adocumentAutonomousCopilotGitHubAgentThese arekeywordsCopilotAgentGitHubAutonomous...[..., GitHub Copilot Coding Agent, ...][..., GitHub Copilot Coding Agent, ...][..., GitHub Copilot Coding Agent, ...][..., GitHub Copilot Coding Agent, ...]...The index maps eachkeyword to thedocuments it appears in \ No newline at end of file