tensorchord
diff --git a/Diff for: ‎README.md
+218-49 b/Diff for: ‎README.md
+218-49
diff --git a/Diff for: ‎scripts/README.md
+4-6 b/Diff for: ‎scripts/README.md
+4-6
@@ -7,32 +7,58 @@
 <a href="https://discord.gg/KqswhpVgdU"><img alt="discord invitation link" src="https://img.shields.io/discord/974584200327991326?style=flat&logo=discord&cacheSeconds=60"></a>
 <a href="https://twitter.com/TensorChord"><img src="https://img.shields.io/twitter/follow/tensorchord?style=flat&logo=X&cacheSeconds=60" alt="Twitter" /></a>
 <a href="https://hub.docker.com/r/tensorchord/vchord-postgres"><img src="https://img.shields.io/docker/pulls/tensorchord/vchord-postgres" alt="Docker pulls" /></a>
-<p>Docker pull for pgvecto.rs: <a href="https://hub.docker.com/r/tensorchord/pgvecto-rs"><img src="https://img.shields.io/docker/pulls/tensorchord/pgvecto-rs" alt="Previous Docker pulls" /></a></p>
 </p>
 
-VectorChord (vchord) is a PostgreSQL extension designed for scalable, high-performance, and disk-efficient vector similarity search, and serves as the successor to [pgvecto.rs](https://github.com/tensorchord/pgvecto.rs).
+> [!NOTE]
+> VectorChord serves as the successor to [pgvecto.rs](https://github.com/tensorchord/pgvecto.rs) <a href="https://hub.docker.com/r/tensorchord/pgvecto-rs"><img src="https://img.shields.io/docker/pulls/tensorchord/pgvecto-rs" alt="Previous Docker pulls" /></a> with better stability and performance. If you are interested in this new solution, you may find the [migration guide](https://docs.vectorchord.ai/vectorchord/admin/migration.html) helpful.
 
-With VectorChord, you can store 400,000 vectors for just $1, enabling significant savings: 6x more vectors compared to Pinecone's optimized storage and 26x more than pgvector/pgvecto.rs for the same price[^1]. For further insights, check out our [launch blog post](https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql).
+VectorChord (vchord) is a PostgreSQL extension designed for scalable, high-performance, and disk-efficient vector similarity search.
 
-[^1]: Based on [MyScale Benchmark](https://myscale.github.io/benchmark/#/) with 768-dimensional vectors and 95% recall.
+With VectorChord, you can store 400,000 vectors for just $1, enabling significant savings: 6x more vectors compared to Pinecone's optimized storage and 26x more than pgvector/pgvecto.rs for the same price[^1].
 
 ## Features
 
 VectorChord introduces remarkable enhancements over pgvecto.rs and pgvector:
 
-**⚡ Enhanced Performance**: Delivering optimized operations with up to 5x faster queries, 16x higher insert throughput, and 16x quicker[^3] index building compared to pgvector's HNSW implementation.
+**⚡ Enhanced Performance**: Delivering optimized operations with up to 5x faster queries, 16x higher insert throughput, and 16x quicker[^1] index building compared to pgvector's HNSW implementation.
 
-[^3]: Based on [MyScale Benchmark](https://myscale.github.io/benchmark/#/) with 768-dimensional vectors. Please checkout our [blog post](https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql) for more details.
+[^1]: Based on [MyScale Benchmark](https://myscale.github.io/benchmark/#/) with 768-dimensional vectors and 95% recall. Please checkout our [blog post](https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql) for more details.
 
 **💰 Affordable Vector Search**: Query 100M 768-dimensional vectors using just 32GB of memory, achieving 35ms P50 latency with top10 recall@95%, helping you keep infrastructure costs down while maintaining high search quality.
 
 **🔌 Seamless Integration**: Fully compatible with pgvector data types and syntax while providing optimal defaults out of the box - no manual parameter tuning needed. Just drop in VectorChord for enhanced performance.
 
-**🔧 External Index Build**: Leverage IVF to build indexes externally (e.g., on GPU) for faster KMeans clustering, combined with RaBitQ[^2] compression to efficiently store vectors while maintaining search quality through autonomous reranking.
+**🔧 External Index Build**: Leverage IVF to build indexes externally (e.g., on GPU) for faster KMeans clustering, combined with RaBitQ[^3] compression to efficiently store vectors while maintaining search quality through autonomous reranking.
 
-**📏 Long Vector Support**: Store and search vectors up to 65,535 dimensions, enabling the use of the best high-dimensional models like text-embedding-3-large with ease.
+[^3]: Gao, Jianyang, and Cheng Long. "RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search." Proceedings of the ACM on Management of Data 2.3 (2024): 1-27.
 
-[^2]: Gao, Jianyang, and Cheng Long. "RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search." Proceedings of the ACM on Management of Data 2.3 (2024): 1-27.
+**📏 Long Vector Support**: Store and search vectors up to 65,535[^4] dimensions, enabling the use of the best high-dimensional models like text-embedding-3-large with ease.
+
+[^4]: There is a [limitation](https://github.com/pgvector/pgvector#vector-type) at pgvector of 16,000 dimensions now. If you really have a large dimension(`16,000<dim<65,535`), consider to change [VECTOR_MAX_DIM](https://github.com/pgvector/pgvector/blob/fef635c9e5512597621e5669dce845c744170822/src/vector.h#L4) and compile pgvector yourself.
+
+**🌐 Scale As You Want**: Based on horizontal expansion, the query of 5M / 100M 768-dimensional vectors can be easily scaled to 10000+ QPS with top10 recall@90% at a competitive cost[^5]
+
+[^5]: Please check our [blog post](https://blog.vectorchord.ai/vector-search-at-10000-qps-in-postgresql-with-vectorchord)  for more details, the PostgreSQL scalability is powered by [CloudNative-PG](https://github.com/cloudnative-pg/cloudnative-pg).
+
+## Requirements
+
+> [!TIP]
+> If you are using the official [Docker image](https://hub.docker.com/r/tensorchord/vchord-postgres), you can skip this step.
+
+VectorChord depends on [pgvector](https://github.com/pgvector/pgvector), ensure the pgvector extension is available:
+
+```SQL
+SELECT * FROM pg_available_extensions WHERE name = 'vector';
+```
+If pgvector is not available, install it using the [pgvector installation instructions](https://github.com/pgvector/pgvector#installation).
+
+And make sure to add `vchord.so` to the `shared_preload_libraries` in `postgresql.conf`.
+
+```SQL
+-- Add vchord and pgvector to shared_preload_libraries --
+-- Note: A restart is required for this setting to take effect.
+ALTER SYSTEM SET shared_preload_libraries = 'vchord.so';
+```
 
 ## Quick Start
 For new users, we recommend using the Docker image to get started quickly.
@@ -41,49 +67,149 @@ docker run \
   --name vectorchord-demo \
   -e POSTGRES_PASSWORD=mysecretpassword \
   -p 5432:5432 \
-  -d tensorchord/vchord-postgres:pg17-v0.1.0
+  -d tensorchord/vchord-postgres:pg17-v0.2.1
 ```
 
 Then you can connect to the database using the `psql` command line tool. The default username is `postgres`, and the default password is `mysecretpassword`.
 
 ```bash
 psql -h localhost -p 5432 -U postgres
 ```
-Run the following SQL to ensure the extension is enabled.
+
+Now you can play with VectorChord!
+
+## Documentation
+
+- [Installation](#installation)
+  - [Docker](#installation)
+  - [APT](#apt)
+  - [More Methods](#more-methods)
+- [Usage](#usage)
+  - [Storing](#storing)
+  - [Indexing](#indexing)
+  - [Query](#query)
+- [Performance Tuning](#performance-tuning)
+  - [Index Build Time](#index-build-time)
+  - [Query Performance](#query-performance)
+- [Advanced Features](#advanced-features)
+  - [Indexing Prewarm](#indexing-prewarm)
+  - [Indexing Progress](#indexing-progress)
+  - [External Index Precomputation](#external-index-precomputation)
+  <!--TODO: Here we have a memory leak in rerank_in_table, show it until the feature is ready
+  - [Capacity-optimized Index](#capacity-optimized-index) -->
+  - [Range Query](#range-query)
+
+## Installation
+
+### [Docker](https://docs.vectorchord.ai/vectorchord/getting-started/installation.html#docker)
+
+You can easily get the Docker image from:
+
+```bash
+docker pull tensorchord/vchord-postgres:pg17-v0.2.1
+```
+
+### [APT](https://docs.vectorchord.ai/vectorchord/getting-started/installation.html#from-debian-package)
+
+Debian and Ubuntu packages can be found on [release page](https://github.com/tensorchord/VectorChord/releases).
+
+To install it:
+```bash
+wget https://github.com/tensorchord/VectorChord/releases/download/${VERSION}/postgresql-${PG}$-vchord_${VERSION}-1_amd64.deb
+sudo apt install vchord-pg17-*.deb
+```
+
+### More Methods
+
+VectorChord also supports other installation methods, including:
+- [From ZIP package](https://docs.vectorchord.ai/vectorchord/getting-started/installation.html#from-zip-package)
+
+## Usage
+
+VectorChord depends on pgvector, including the vector representation. 
+This way, we can keep the maximum compatibility of `pgvector` for both:
+- [vector storing](https://github.com/pgvector/pgvector#storing)
+- [vector query](https://github.com/pgvector/pgvector#querying).
+
+Since you can use them directly, your application can be easily migrated without pain!
+
+Before all, you need to run the following SQL to ensure the extension is enabled.
 
 ```SQL
 CREATE EXTENSION IF NOT EXISTS vchord CASCADE;
 ```
+It will install both `pgvector` and `VectorChord`, see [requirements](#requirements) for more detail.
 
-And make sure to add `vchord.so` to the `shared_preload_libraries` in `postgresql.conf`.
+### Storing
 
-```SQL
--- Add vchord and pgvector to shared_preload_libraries --
-ALTER SYSTEM SET shared_preload_libraries = 'vchord.so';
+Similar to pgvector, you can create a table with vector column in VectorChord and insert some rows to it.
+
+```sql
+CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
+INSERT INTO items (embedding) SELECT ARRAY[random(), random(), random()]::real[] FROM generate_series(1, 1000);
 ```
 
+### Indexing
+
+Similar to [ivfflat](https://github.com/pgvector/pgvector#ivfflat), the index type of VectorChord, RaBitQ(vchordrq) also divides vectors into lists, and then searches a subset of those lists that are closest to the query vector. It inherits the advantages of `ivfflat`, such as fast build times and less memory usage, but has [much better performance](https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql#heading-ivf-vs-hnsw) than hnsw and ivfflat.
+
+The RaBitQ(vchordrq) index is supported on some pgvector types and metrics:
+
+|                         | vector | halfvec | bit(n) | sparsevec |
+| ----------------------- | ------ | ------- | ------ | --------- |
+| L2 distance / `<->`     | ✅      | ✅       | 🆖      | ❌         |
+| inner product / `<#>`   | ✅      | ✅       | 🆖      | ❌         |
+| cosine distance / `<=>` | ✅      | ✅       | 🆖      | ❌         |
+| L1 distance / `<+>`     | ❌      | ❌       | 🆖      | ❌         |
+| Hamming distance/ `<~>` | 🆖      | 🆖       | ❌      | 🆖         |
+| Jaccard distance/ `<%>` | 🆖      | 🆖       | ❌      | 🆖         |
+
+Where:
+- ✅ means supported by pgvector and VectorChord
+- ❌ means supported by pgvector but not by VectorChord
+- 🆖 means not planned by pgvector and VectorChord
+- 🔜 means supported by pgvector now and will be supported by VectorChord soon
+
 To create the VectorChord RaBitQ(vchordrq) index, you can use the following SQL.
 
-```SQL
--- Set residual_quantization to true and spherical_centroids to false for L2 distance --
-CREATE INDEX ON gist_train USING vchordrq (embedding vector_l2_ops) WITH (options = $$
+L2 distance
+```sql
+CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
 residual_quantization = true
 [build.internal]
-lists = [4096]
+lists=[1000]
 spherical_centroids = false
 $$);
+```
 
+> [!NOTE]
+> - Set `residual_quantization` to true and `spherical_centroids` to false for L2 distance
+> - Use `halfvec_l2_ops` for `halfvec`
+> - The recommend `lists` could be rows / 1000 for up to 1M rows and 4 * sqrt(rows) for over 1M rows
 
--- Set residual_quantization to false and spherical_centroids to true for cos/dot distance --
-CREATE INDEX ON laion USING vchordrq (embedding vector_cos_ops) WITH (options = $$
+Inner product
+```sql
+CREATE INDEX ON items USING vchordrq (embedding vector_ip_ops) WITH (options = $$
 residual_quantization = false
 [build.internal]
-lists = [4096]
+lists=[1000]
 spherical_centroids = true
 $$);
 ```
 
-## Documentation
+Cosine distance
+```sql
+CREATE INDEX ON items USING vchordrq (embedding vector_cosine_ops) WITH (options = $$
+residual_quantization = false
+[build.internal]
+lists=[1000]
+spherical_centroids = true
+$$);
+```
+
+> [!NOTE]
+> - Set `residual_quantization` to false and `spherical_centroids` to true for inner product/cosine distance
+> - Use `vector_cosine_ops`/`vector_ip_ops` for `halfvec`
 
 ### Query
 
@@ -96,20 +222,23 @@ Supported distance functions are:
 - <#> - (negative) inner product
 - <=> - cosine distance
 
-<!-- ### Range Query
+## Performance Tuning
 
-> [!NOTE]  
-> Due to the limitation of postgresql query planner, we cannot support the range query like `SELECT embedding <-> '[3,1,2]' as distance WHERE distance < 0.1 ORDER BY distance` directly.
+### Index Build Time
+
+Index building can parallelized, and with external centroid precomputation, the total time is primarily limited by disk speed. Optimize parallelism using the following settings:
 
-To query vectors within a certain distance range, you can use the following syntax.
 ```SQL
--- Query vectors within a certain distance range
--- sphere(center, radius) means the vectors within the sphere with the center and radius, aka range query
--- <<->> is L2 distance, <<#>> is inner product, <<=>> is cosine distance
-SELECT vec FROM t WHERE vec <<->> sphere('[0.24, 0.24, 0.24]'::vector, 0.012) 
-``` -->
+-- Set this to the number of CPU cores available for parallel operations.
+SET max_parallel_maintenance_workers = 8;
+SET max_parallel_workers = 8;
+
+-- Adjust the total number of worker processes. 
+-- Note: A restart is required for this setting to take effect.
+ALTER SYSTEM SET max_worker_processes = 8;
+```
 
-### Query Performance Tuning
+## Query Performance
 You can fine-tune the search performance by adjusting the `probes` and `epsilon` parameters:
 
 ```sql
@@ -118,8 +247,8 @@ You can fine-tune the search performance by adjusting the `probes` and `epsilon`
 SET vchordrq.probes = 100;
 
 -- Set epsilon to control the reranking precision.
--- Larger value means more rerank for higher recall rate.
--- Don't change it unless you only have limited memory.
+-- Larger value means more rerank for higher recall rate and latency.
+-- If you need a less precise query, set it to 1.0 might be appropriate.
 -- Recommended range: 1.0–1.9. Default value is 1.9.
 SET vchordrq.epsilon = 1.9;
 
@@ -146,25 +275,21 @@ SET jit = off;
 ALTER SYSTEM SET shared_buffers = '8GB';
 ```
 
+## Advanced Features
+
 ### Indexing prewarm
-To prewarm the index, you can use the following SQL. It will significantly improve performance when using limited memory.
-```SQL
--- vchordrq_prewarm(index_name::regclass) to prewarm the index into the shared buffer
-SELECT vchordrq_prewarm('gist_train_embedding_idx'::regclass)"
-```
 
+For disk-first indexing, RaBitQ(vchordrq) is loaded from disk for the first query, 
+and then cached in memory if `shared_buffer` is sufficient.
 
-### Index Build Time
-Index building can parallelized, and with external centroid precomputation, the total time is primarily limited by disk speed. Optimize parallelism using the following settings:
+In most cases, reading from disk is 10x slower than reading from memory, 
+resulting in a significant cold-start slowdown.
 
-```SQL
--- Set this to the number of CPU cores available for parallel operations.
-SET max_parallel_maintenance_workers = 8;
-SET max_parallel_workers = 8;
+To improve performance for the first query, you can try the following SQL that preloads the index into memory.
 
--- Adjust the total number of worker processes. 
--- Note: A restart is required for this setting to take effect.
-ALTER SYSTEM SET max_worker_processes = 8;
+```SQL
+-- vchordrq_prewarm(index_name::regclass) to prewarm the index into the shared buffer
+SELECT vchordrq_prewarm('gist_train_embedding_idx'::regclass)"
 ```
 
 ### Indexing Progress
@@ -204,6 +329,50 @@ $$);
 
 To simplify the workflow, we provide end-to-end scripts for external index pre-computation, see [scripts](./scripts/README.md#run-external-index-precomputation-toolkit).
 
+<!-- TODO: Here we have a memory leak in rerank_in_table, show it until the feature is ready
+
+### Capacity-optimized Index
+
+The default behavior of Vectorchord is `performance-optimized`, 
+which uses more disk space but has a better latency:
+- About `80G` for `5M` 768 dim vectors
+- About `800G` for `100M` 768 dim vectors
+
+Although it is acceptable for such large data, it could be switched to `capacity-optimized` index and save about **50%** of your disk space. 
+
+For `capacity-optimized` index, just enable the `rerank_in_table` option when creating the index:
+```sql
+CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
+residual_quantization = true
+rerank_in_table = true
+[build.internal]
+...
+$$);
+```
+
+> [!CAUTION]
+> Compared to the `performance-optimized` index, the `capacity-optimized` index will have a **30-50%** increase in latency and QPS loss at query.
+
+-->
+
+### Range Query
+
+To query vectors within a certain distance range, you can use the following syntax.
+```SQL
+-- Query vectors within a certain distance range
+SELECT vec FROM t WHERE vec <<->> sphere('[0.24, 0.24, 0.24]'::vector, 0.012) 
+ORDER BY embedding <-> '[0.24, 0.24, 0.24]' LIMIT 5;
+```
+
+In this expression, `vec <<->> sphere('[0.24, 0.24, 0.24]'::vector, 0.012)` is equals to `vec <-> '[0.24, 0.24, 0.24]' < 0.012`. However, the latter one will trigger a **exact nearest neighbor search** as the grammar could not be pushed down.
+
+Supported range functions are:
+- `<<->>` - L2 distance
+- `<<#>>` - (negative) inner product
+- `<<=>>` - cosine distance
+
+## Development
+
 ### Build the Postgres Docker Image with VectorChord extension
 
 Follow the steps in [Dev Guidance](./scripts/README.md#build-docker).
 
@@ -87,7 +87,7 @@ conda install conda-forge::pgvector-python numpy pytorch::faiss-gpu conda-forge:
    python script/train.py -i [dataset file(export.hdf5)] -o [centroid filename(centroid.npy)] --lists [lists] -m [metric(l2/cos/dot)] -g --mmap
    ```
 
-   `lists` is the number of centroids for clustering, and a typical value could range from:
+   `lists` is the number of centroids for clustering, and a typical value for large datasets(>5M) could range from:
    
    $$
    4*\sqrt{len(vectors)} \le lists \le 16*\sqrt{len(vectors)}
@@ -96,13 +96,11 @@ conda install conda-forge::pgvector-python numpy pytorch::faiss-gpu conda-forge:
 3. To insert vectors and centroids into the database, and then create an index 
 
    ```shell
-   python script/index.py -n [table name] -i [dataset file(export.hdf5)] -c [centroid filename(centroid.npy)] -m [metric(l2/cos/dot)] -d [dim]
+   python script/index.py -n [table name] -i [dataset file(export.hdf5)] -c [centroid filename(centroid.npy)] -m [metric(l2/cos/dot)] -d [dim] --url postgresql://postgres:123@localhost:5432/postgres
    ```
 
 4. Let's start our tour to check the benchmark result of VectorChord
 
    ```shell
-   python script/bench.py -n [table name] -i [dataset file(export.hdf5)] -m [metric(l2/cos/dot)] -p [database password] --nprob 100 --epsilon 1.0
-   ```
-
-    Larger `nprobe` and `epsilon` will have a more precise query but at a slower speed.
+   python script/bench.py -n [table name] -i [dataset file(export.hdf5)] -m [metric(l2/cos/dot)] --nprob 100 --epsilon 1.0  --url postgresql://postgres:123@localhost:5432/postgres
+   ```