Skip to content

Commit 9737f50

Browse files
committed
update readme and scripts with url
Signed-off-by: cutecutecat <[email protected]>
1 parent d16d469 commit 9737f50

File tree

4 files changed

+284
-99
lines changed

4 files changed

+284
-99
lines changed

Diff for: README.md

+218-49
Original file line numberDiff line numberDiff line change
@@ -7,32 +7,58 @@
77
<a href="https://discord.gg/KqswhpVgdU"><img alt="discord invitation link" src="https://img.shields.io/discord/974584200327991326?style=flat&logo=discord&cacheSeconds=60"></a>
88
<a href="https://twitter.com/TensorChord"><img src="https://img.shields.io/twitter/follow/tensorchord?style=flat&logo=X&cacheSeconds=60" alt="Twitter" /></a>
99
<a href="https://hub.docker.com/r/tensorchord/vchord-postgres"><img src="https://img.shields.io/docker/pulls/tensorchord/vchord-postgres" alt="Docker pulls" /></a>
10-
<p>Docker pull for pgvecto.rs: <a href="https://hub.docker.com/r/tensorchord/pgvecto-rs"><img src="https://img.shields.io/docker/pulls/tensorchord/pgvecto-rs" alt="Previous Docker pulls" /></a></p>
1110
</p>
1211

13-
VectorChord (vchord) is a PostgreSQL extension designed for scalable, high-performance, and disk-efficient vector similarity search, and serves as the successor to [pgvecto.rs](https://github.com/tensorchord/pgvecto.rs).
12+
> [!NOTE]
13+
> VectorChord serves as the successor to [pgvecto.rs](https://github.com/tensorchord/pgvecto.rs) <a href="https://hub.docker.com/r/tensorchord/pgvecto-rs"><img src="https://img.shields.io/docker/pulls/tensorchord/pgvecto-rs" alt="Previous Docker pulls" /></a> with better stability and performance. If you are interested in this new solution, you may find the [migration guide](https://docs.vectorchord.ai/vectorchord/admin/migration.html) helpful.
1414
15-
With VectorChord, you can store 400,000 vectors for just $1, enabling significant savings: 6x more vectors compared to Pinecone's optimized storage and 26x more than pgvector/pgvecto.rs for the same price[^1]. For further insights, check out our [launch blog post](https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql).
15+
VectorChord (vchord) is a PostgreSQL extension designed for scalable, high-performance, and disk-efficient vector similarity search.
1616

17-
[^1]: Based on [MyScale Benchmark](https://myscale.github.io/benchmark/#/) with 768-dimensional vectors and 95% recall.
17+
With VectorChord, you can store 400,000 vectors for just $1, enabling significant savings: 6x more vectors compared to Pinecone's optimized storage and 26x more than pgvector/pgvecto.rs for the same price[^1].
1818

1919
## Features
2020

2121
VectorChord introduces remarkable enhancements over pgvecto.rs and pgvector:
2222

23-
**⚡ Enhanced Performance**: Delivering optimized operations with up to 5x faster queries, 16x higher insert throughput, and 16x quicker[^3] index building compared to pgvector's HNSW implementation.
23+
**⚡ Enhanced Performance**: Delivering optimized operations with up to 5x faster queries, 16x higher insert throughput, and 16x quicker[^1] index building compared to pgvector's HNSW implementation.
2424

25-
[^3]: Based on [MyScale Benchmark](https://myscale.github.io/benchmark/#/) with 768-dimensional vectors. Please checkout our [blog post](https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql) for more details.
25+
[^1]: Based on [MyScale Benchmark](https://myscale.github.io/benchmark/#/) with 768-dimensional vectors and 95% recall. Please checkout our [blog post](https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql) for more details.
2626

2727
**💰 Affordable Vector Search**: Query 100M 768-dimensional vectors using just 32GB of memory, achieving 35ms P50 latency with top10 recall@95%, helping you keep infrastructure costs down while maintaining high search quality.
2828

2929
**🔌 Seamless Integration**: Fully compatible with pgvector data types and syntax while providing optimal defaults out of the box - no manual parameter tuning needed. Just drop in VectorChord for enhanced performance.
3030

31-
**🔧 External Index Build**: Leverage IVF to build indexes externally (e.g., on GPU) for faster KMeans clustering, combined with RaBitQ[^2] compression to efficiently store vectors while maintaining search quality through autonomous reranking.
31+
**🔧 External Index Build**: Leverage IVF to build indexes externally (e.g., on GPU) for faster KMeans clustering, combined with RaBitQ[^3] compression to efficiently store vectors while maintaining search quality through autonomous reranking.
3232

33-
**📏 Long Vector Support**: Store and search vectors up to 65,535 dimensions, enabling the use of the best high-dimensional models like text-embedding-3-large with ease.
33+
[^3]: Gao, Jianyang, and Cheng Long. "RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search." Proceedings of the ACM on Management of Data 2.3 (2024): 1-27.
3434

35-
[^2]: Gao, Jianyang, and Cheng Long. "RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search." Proceedings of the ACM on Management of Data 2.3 (2024): 1-27.
35+
**📏 Long Vector Support**: Store and search vectors up to 65,535[^4] dimensions, enabling the use of the best high-dimensional models like text-embedding-3-large with ease.
36+
37+
[^4]: There is a [limitation](https://github.com/pgvector/pgvector#vector-type) at pgvector of 16,000 dimensions now. If you really have a large dimension(`16,000<dim<65,535`), consider to change [VECTOR_MAX_DIM](https://github.com/pgvector/pgvector/blob/fef635c9e5512597621e5669dce845c744170822/src/vector.h#L4) and compile pgvector yourself.
38+
39+
**🌐 Scale As You Want**: Based on horizontal expansion, the query of 5M / 100M 768-dimensional vectors can be easily scaled to 10000+ QPS with top10 recall@90% at a competitive cost[^5]
40+
41+
[^5]: Please check our [blog post](https://blog.vectorchord.ai/vector-search-at-10000-qps-in-postgresql-with-vectorchord) for more details, the PostgreSQL scalability is powered by [CloudNative-PG](https://github.com/cloudnative-pg/cloudnative-pg).
42+
43+
## Requirements
44+
45+
> [!TIP]
46+
> If you are using the official [Docker image](https://hub.docker.com/r/tensorchord/vchord-postgres), you can skip this step.
47+
48+
VectorChord depends on [pgvector](https://github.com/pgvector/pgvector), ensure the pgvector extension is available:
49+
50+
```SQL
51+
SELECT * FROM pg_available_extensions WHERE name = 'vector';
52+
```
53+
If pgvector is not available, install it using the [pgvector installation instructions](https://github.com/pgvector/pgvector#installation).
54+
55+
And make sure to add `vchord.so` to the `shared_preload_libraries` in `postgresql.conf`.
56+
57+
```SQL
58+
-- Add vchord and pgvector to shared_preload_libraries --
59+
-- Note: A restart is required for this setting to take effect.
60+
ALTER SYSTEM SET shared_preload_libraries = 'vchord.so';
61+
```
3662

3763
## Quick Start
3864
For new users, we recommend using the Docker image to get started quickly.
@@ -41,49 +67,149 @@ docker run \
4167
--name vectorchord-demo \
4268
-e POSTGRES_PASSWORD=mysecretpassword \
4369
-p 5432:5432 \
44-
-d tensorchord/vchord-postgres:pg17-v0.1.0
70+
-d tensorchord/vchord-postgres:pg17-v0.2.1
4571
```
4672

4773
Then you can connect to the database using the `psql` command line tool. The default username is `postgres`, and the default password is `mysecretpassword`.
4874

4975
```bash
5076
psql -h localhost -p 5432 -U postgres
5177
```
52-
Run the following SQL to ensure the extension is enabled.
78+
79+
Now you can play with VectorChord!
80+
81+
## Documentation
82+
83+
- [Installation](#installation)
84+
- [Docker](#installation)
85+
- [APT](#apt)
86+
- [More Methods](#more-methods)
87+
- [Usage](#usage)
88+
- [Storing](#storing)
89+
- [Indexing](#indexing)
90+
- [Query](#query)
91+
- [Performance Tuning](#performance-tuning)
92+
- [Index Build Time](#index-build-time)
93+
- [Query Performance](#query-performance)
94+
- [Advanced Features](#advanced-features)
95+
- [Indexing Prewarm](#indexing-prewarm)
96+
- [Indexing Progress](#indexing-progress)
97+
- [External Index Precomputation](#external-index-precomputation)
98+
<!--TODO: Here we have a memory leak in rerank_in_table, show it until the feature is ready
99+
- [Capacity-optimized Index](#capacity-optimized-index) -->
100+
- [Range Query](#range-query)
101+
102+
## Installation
103+
104+
### [Docker](https://docs.vectorchord.ai/vectorchord/getting-started/installation.html#docker)
105+
106+
You can easily get the Docker image from:
107+
108+
```bash
109+
docker pull tensorchord/vchord-postgres:pg17-v0.2.1
110+
```
111+
112+
### [APT](https://docs.vectorchord.ai/vectorchord/getting-started/installation.html#from-debian-package)
113+
114+
Debian and Ubuntu packages can be found on [release page](https://github.com/tensorchord/VectorChord/releases).
115+
116+
To install it:
117+
```bash
118+
wget https://github.com/tensorchord/VectorChord/releases/download/${VERSION}/postgresql-${PG}$-vchord_${VERSION}-1_amd64.deb
119+
sudo apt install vchord-pg17-*.deb
120+
```
121+
122+
### More Methods
123+
124+
VectorChord also supports other installation methods, including:
125+
- [From ZIP package](https://docs.vectorchord.ai/vectorchord/getting-started/installation.html#from-zip-package)
126+
127+
## Usage
128+
129+
VectorChord depends on pgvector, including the vector representation.
130+
This way, we can keep the maximum compatibility of `pgvector` for both:
131+
- [vector storing](https://github.com/pgvector/pgvector#storing)
132+
- [vector query](https://github.com/pgvector/pgvector#querying).
133+
134+
Since you can use them directly, your application can be easily migrated without pain!
135+
136+
Before all, you need to run the following SQL to ensure the extension is enabled.
53137

54138
```SQL
55139
CREATE EXTENSION IF NOT EXISTS vchord CASCADE;
56140
```
141+
It will install both `pgvector` and `VectorChord`, see [requirements](#requirements) for more detail.
57142

58-
And make sure to add `vchord.so` to the `shared_preload_libraries` in `postgresql.conf`.
143+
### Storing
59144

60-
```SQL
61-
-- Add vchord and pgvector to shared_preload_libraries --
62-
ALTER SYSTEM SET shared_preload_libraries = 'vchord.so';
145+
Similar to pgvector, you can create a table with vector column in VectorChord and insert some rows to it.
146+
147+
```sql
148+
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
149+
INSERT INTO items (embedding) SELECT ARRAY[random(), random(), random()]::real[] FROM generate_series(1, 1000);
63150
```
64151

152+
### Indexing
153+
154+
Similar to [ivfflat](https://github.com/pgvector/pgvector#ivfflat), the index type of VectorChord, RaBitQ(vchordrq) also divides vectors into lists, and then searches a subset of those lists that are closest to the query vector. It inherits the advantages of `ivfflat`, such as fast build times and less memory usage, but has [much better performance](https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql#heading-ivf-vs-hnsw) than hnsw and ivfflat.
155+
156+
The RaBitQ(vchordrq) index is supported on some pgvector types and metrics:
157+
158+
| | vector | halfvec | bit(n) | sparsevec |
159+
| ----------------------- | ------ | ------- | ------ | --------- |
160+
| L2 distance / `<->` ||| 🆖 ||
161+
| inner product / `<#>` ||| 🆖 ||
162+
| cosine distance / `<=>` ||| 🆖 ||
163+
| L1 distance / `<+>` ||| 🆖 ||
164+
| Hamming distance/ `<~>` | 🆖 | 🆖 || 🆖 |
165+
| Jaccard distance/ `<%>` | 🆖 | 🆖 || 🆖 |
166+
167+
Where:
168+
- ✅ means supported by pgvector and VectorChord
169+
- ❌ means supported by pgvector but not by VectorChord
170+
- 🆖 means not planned by pgvector and VectorChord
171+
- 🔜 means supported by pgvector now and will be supported by VectorChord soon
172+
65173
To create the VectorChord RaBitQ(vchordrq) index, you can use the following SQL.
66174

67-
```SQL
68-
-- Set residual_quantization to true and spherical_centroids to false for L2 distance --
69-
CREATE INDEX ON gist_train USING vchordrq (embedding vector_l2_ops) WITH (options = $$
175+
L2 distance
176+
```sql
177+
CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
70178
residual_quantization = true
71179
[build.internal]
72-
lists = [4096]
180+
lists=[1000]
73181
spherical_centroids = false
74182
$$);
183+
```
75184

185+
> [!NOTE]
186+
> - Set `residual_quantization` to true and `spherical_centroids` to false for L2 distance
187+
> - Use `halfvec_l2_ops` for `halfvec`
188+
> - The recommend `lists` could be rows / 1000 for up to 1M rows and 4 * sqrt(rows) for over 1M rows
76189
77-
-- Set residual_quantization to false and spherical_centroids to true for cos/dot distance --
78-
CREATE INDEX ON laion USING vchordrq (embedding vector_cos_ops) WITH (options = $$
190+
Inner product
191+
```sql
192+
CREATE INDEX ON items USING vchordrq (embedding vector_ip_ops) WITH (options = $$
79193
residual_quantization = false
80194
[build.internal]
81-
lists = [4096]
195+
lists=[1000]
82196
spherical_centroids = true
83197
$$);
84198
```
85199

86-
## Documentation
200+
Cosine distance
201+
```sql
202+
CREATE INDEX ON items USING vchordrq (embedding vector_cosine_ops) WITH (options = $$
203+
residual_quantization = false
204+
[build.internal]
205+
lists=[1000]
206+
spherical_centroids = true
207+
$$);
208+
```
209+
210+
> [!NOTE]
211+
> - Set `residual_quantization` to false and `spherical_centroids` to true for inner product/cosine distance
212+
> - Use `vector_cosine_ops`/`vector_ip_ops` for `halfvec`
87213
88214
### Query
89215

@@ -96,20 +222,23 @@ Supported distance functions are:
96222
- <#> - (negative) inner product
97223
- <=> - cosine distance
98224

99-
<!-- ### Range Query
225+
## Performance Tuning
100226

101-
> [!NOTE]
102-
> Due to the limitation of postgresql query planner, we cannot support the range query like `SELECT embedding <-> '[3,1,2]' as distance WHERE distance < 0.1 ORDER BY distance` directly.
227+
### Index Build Time
228+
229+
Index building can parallelized, and with external centroid precomputation, the total time is primarily limited by disk speed. Optimize parallelism using the following settings:
103230

104-
To query vectors within a certain distance range, you can use the following syntax.
105231
```SQL
106-
-- Query vectors within a certain distance range
107-
-- sphere(center, radius) means the vectors within the sphere with the center and radius, aka range query
108-
-- <<->> is L2 distance, <<#>> is inner product, <<=>> is cosine distance
109-
SELECT vec FROM t WHERE vec <<->> sphere('[0.24, 0.24, 0.24]'::vector, 0.012)
110-
``` -->
232+
-- Set this to the number of CPU cores available for parallel operations.
233+
SET max_parallel_maintenance_workers = 8;
234+
SET max_parallel_workers = 8;
235+
236+
-- Adjust the total number of worker processes.
237+
-- Note: A restart is required for this setting to take effect.
238+
ALTER SYSTEM SET max_worker_processes = 8;
239+
```
111240

112-
### Query Performance Tuning
241+
## Query Performance
113242
You can fine-tune the search performance by adjusting the `probes` and `epsilon` parameters:
114243

115244
```sql
@@ -118,8 +247,8 @@ You can fine-tune the search performance by adjusting the `probes` and `epsilon`
118247
SET vchordrq.probes = 100;
119248

120249
-- Set epsilon to control the reranking precision.
121-
-- Larger value means more rerank for higher recall rate.
122-
-- Don't change it unless you only have limited memory.
250+
-- Larger value means more rerank for higher recall rate and latency.
251+
-- If you need a less precise query, set it to 1.0 might be appropriate.
123252
-- Recommended range: 1.0–1.9. Default value is 1.9.
124253
SET vchordrq.epsilon = 1.9;
125254

@@ -146,25 +275,21 @@ SET jit = off;
146275
ALTER SYSTEM SET shared_buffers = '8GB';
147276
```
148277

278+
## Advanced Features
279+
149280
### Indexing prewarm
150-
To prewarm the index, you can use the following SQL. It will significantly improve performance when using limited memory.
151-
```SQL
152-
-- vchordrq_prewarm(index_name::regclass) to prewarm the index into the shared buffer
153-
SELECT vchordrq_prewarm('gist_train_embedding_idx'::regclass)"
154-
```
155281

282+
For disk-first indexing, RaBitQ(vchordrq) is loaded from disk for the first query,
283+
and then cached in memory if `shared_buffer` is sufficient.
156284

157-
### Index Build Time
158-
Index building can parallelized, and with external centroid precomputation, the total time is primarily limited by disk speed. Optimize parallelism using the following settings:
285+
In most cases, reading from disk is 10x slower than reading from memory,
286+
resulting in a significant cold-start slowdown.
159287

160-
```SQL
161-
-- Set this to the number of CPU cores available for parallel operations.
162-
SET max_parallel_maintenance_workers = 8;
163-
SET max_parallel_workers = 8;
288+
To improve performance for the first query, you can try the following SQL that preloads the index into memory.
164289

165-
-- Adjust the total number of worker processes.
166-
-- Note: A restart is required for this setting to take effect.
167-
ALTER SYSTEM SET max_worker_processes = 8;
290+
```SQL
291+
-- vchordrq_prewarm(index_name::regclass) to prewarm the index into the shared buffer
292+
SELECT vchordrq_prewarm('gist_train_embedding_idx'::regclass)"
168293
```
169294
170295
### Indexing Progress
@@ -204,6 +329,50 @@ $$);
204329
205330
To simplify the workflow, we provide end-to-end scripts for external index pre-computation, see [scripts](./scripts/README.md#run-external-index-precomputation-toolkit).
206331
332+
<!-- TODO: Here we have a memory leak in rerank_in_table, show it until the feature is ready
333+
334+
### Capacity-optimized Index
335+
336+
The default behavior of Vectorchord is `performance-optimized`,
337+
which uses more disk space but has a better latency:
338+
- About `80G` for `5M` 768 dim vectors
339+
- About `800G` for `100M` 768 dim vectors
340+
341+
Although it is acceptable for such large data, it could be switched to `capacity-optimized` index and save about **50%** of your disk space.
342+
343+
For `capacity-optimized` index, just enable the `rerank_in_table` option when creating the index:
344+
```sql
345+
CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
346+
residual_quantization = true
347+
rerank_in_table = true
348+
[build.internal]
349+
...
350+
$$);
351+
```
352+
353+
> [!CAUTION]
354+
> Compared to the `performance-optimized` index, the `capacity-optimized` index will have a **30-50%** increase in latency and QPS loss at query.
355+
356+
-->
357+
358+
### Range Query
359+
360+
To query vectors within a certain distance range, you can use the following syntax.
361+
```SQL
362+
-- Query vectors within a certain distance range
363+
SELECT vec FROM t WHERE vec <<->> sphere('[0.24, 0.24, 0.24]'::vector, 0.012)
364+
ORDER BY embedding <-> '[0.24, 0.24, 0.24]' LIMIT 5;
365+
```
366+
367+
In this expression, `vec <<->> sphere('[0.24, 0.24, 0.24]'::vector, 0.012)` is equals to `vec <-> '[0.24, 0.24, 0.24]' < 0.012`. However, the latter one will trigger a **exact nearest neighbor search** as the grammar could not be pushed down.
368+
369+
Supported range functions are:
370+
- `<<->>` - L2 distance
371+
- `<<#>>` - (negative) inner product
372+
- `<<=>>` - cosine distance
373+
374+
## Development
375+
207376
### Build the Postgres Docker Image with VectorChord extension
208377
209378
Follow the steps in [Dev Guidance](./scripts/README.md#build-docker).

Diff for: scripts/README.md

+4-6
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ conda install conda-forge::pgvector-python numpy pytorch::faiss-gpu conda-forge:
8787
python script/train.py -i [dataset file(export.hdf5)] -o [centroid filename(centroid.npy)] --lists [lists] -m [metric(l2/cos/dot)] -g --mmap
8888
```
8989
90-
`lists` is the number of centroids for clustering, and a typical value could range from:
90+
`lists` is the number of centroids for clustering, and a typical value for large datasets(>5M) could range from:
9191
9292
$$
9393
4*\sqrt{len(vectors)} \le lists \le 16*\sqrt{len(vectors)}
@@ -96,13 +96,11 @@ conda install conda-forge::pgvector-python numpy pytorch::faiss-gpu conda-forge:
9696
3. To insert vectors and centroids into the database, and then create an index
9797
9898
```shell
99-
python script/index.py -n [table name] -i [dataset file(export.hdf5)] -c [centroid filename(centroid.npy)] -m [metric(l2/cos/dot)] -d [dim]
99+
python script/index.py -n [table name] -i [dataset file(export.hdf5)] -c [centroid filename(centroid.npy)] -m [metric(l2/cos/dot)] -d [dim] --url postgresql://postgres:123@localhost:5432/postgres
100100
```
101101
102102
4. Let's start our tour to check the benchmark result of VectorChord
103103

104104
```shell
105-
python script/bench.py -n [table name] -i [dataset file(export.hdf5)] -m [metric(l2/cos/dot)] -p [database password] --nprob 100 --epsilon 1.0
106-
```
107-
108-
Larger `nprobe` and `epsilon` will have a more precise query but at a slower speed.
105+
python script/bench.py -n [table name] -i [dataset file(export.hdf5)] -m [metric(l2/cos/dot)] --nprob 100 --epsilon 1.0 --url postgresql://postgres:123@localhost:5432/postgres
106+
```

0 commit comments

Comments
 (0)