TypeScript framework for extending DuckDB with user-defined functions written in TypeScript, served to DuckDB over the Vector Gateway Interface (VGI) — an Apache Arrow IPC protocol.
You write scalar, table, aggregate, and table-in-out functions (plus catalogs) in
TypeScript; a worker exposes them; DuckDB attaches to the worker and calls them as
if they were native. It is wire-compatible with vgi-python,
so a worker written in either language is interchangeable.
LOAD vgi;
ATTACH 'demo' AS demo (TYPE vgi, LOCATION '/path/to/your-worker');
SELECT demo.upper_case(name) FROM users;npm install @query-farm/vgi
# or: bun add @query-farm/vgiRequires Node.js ≥ 22.15 or Bun.
Peer dependencies (since 0.3.0).
@query-farm/apache-arrow(^21.1.1) and@query-farm/vgi-rpc(^0.7.5) are peerDependencies — install them directly alongside@query-farm/vgi:npm install @query-farm/vgi @query-farm/apache-arrow @query-farm/vgi-rpcThe SDK bundles both as external, so a single shared instance avoids duplicate-type errors — notably the
vgi-rpcProtocolclash you hit when you also importcreateHttpHandler(or anyvgi-rpctype) directly and two copies get installed. SeeMIGRATION.md.
Using VGI from DuckDB also requires the vgi DuckDB extension. The easiest way to
get it is via Haybarn — Query Farm's distribution of
DuckDB — which publishes the extension (and a DuckDB-Wasm engine) as prebuilt npm
packages, so there's no C++ toolchain or manual INSTALL/compile step:
# The vgi extension as prebuilt binaries, from npm:
npm install @haybarn/ext-vgi-h1-5-3 # built for Haybarn / DuckDB 1.5.3
# Haybarn's DuckDB-Wasm engine, for running VGI in the browser / Cloudflare Workers:
npm install @haybarn/haybarn-wasmThe @haybarn/ext-* packages are meta-packages: npm pulls only the prebuilt binary
matching your platform (native leaves for linux/darwin/win, plus
wasm-mvp/wasm-eh/wasm-threads variants for DuckDB-Wasm). The Wasm engine is what
lets VGI run in the browser and on Cloudflare Workers — see
Runtimes & entry points.
Types are plain aliases (str, int, float, bool, bytes, …) and the input
arrives as a columnar Arrow batch:
// worker.ts
import { Worker, defineScalarFunction, str, int } from "@query-farm/vgi";
const upperCase = defineScalarFunction({
name: "upper_case", // SQL name → demo.upper_case(...)
description: "Convert a string to uppercase",
// Declares the argument types *and order*. DuckDB calls scalar functions
// positionally, so the keys are just parameter names (for docs/metadata) —
// inside compute you read the input columns by position, not by this name.
// Type aliases: str, int, int32, float, float32, bool, bytes — or any Arrow
// DataType (e.g. `new Decimal(18, 2)`).
params: { value: str },
// Static output type. For a type that depends on the input, omit `returns`
// and use `outputType: (bind) => <DataType>` (may be async).
returns: str,
// compute runs once per input batch. Read each argument as a column by
// position (column 0 = first arg); return one output value per row (null ok).
// Full signature: compute(batch, consts, info) where
// consts → values of any `constParams` (literals folded at bind time)
// info → { settings, secrets, auth } — session settings, secrets, caller
compute: (batch) => {
const values = batch.getChildAt(0)!; // first positional argument
return Array.from(values, (v) =>
v == null ? null : String(v).toUpperCase(),
);
},
// Other optional fields: constParams, nullHandling, stability (volatility),
// examples, categories, tags, requiredSettings, requiredSecrets, maxWorkers.
});
// A two-argument scalar — the args are positional, so column 0 is the first
// argument and column 1 is the second (the param keys are just names).
const multiply = defineScalarFunction({
name: "multiply",
description: "Multiply two numbers",
params: { a: int, b: int }, // two positional arguments
returns: int, // int = Int64 → values are bigint
compute: (batch) => {
const a = batch.getChildAt(0)!; // first argument
const b = batch.getChildAt(1)!; // second argument
return Array.from({ length: batch.numRows }, (_, i) => {
const x = a.get(i);
const y = b.get(i);
return x == null || y == null ? null : x * y; // NULL in → NULL out
});
},
});
// Functions are served through a catalog — that's what DuckDB ATTACHes to.
// `name` is the catalog DuckDB sees (matches the ATTACH target below).
new Worker({
catalog: { name: "demo", schemas: [{ name: "main", functions: [upperCase, multiply] }] },
}).run();A worker speaks Arrow IPC over stdin/stdout (or AF_UNIX / HTTP — see Transports). It is not interactive; DuckDB drives it.
Table functions are incremental producers: build a schema with toSchema, keep your
own state, and emit batches until you finish. DuckDB pulls lazily, so this
streams without materializing everything up front:
import { Worker, defineTableFunction, batchFromColumns, toSchema, int } from "@query-farm/vgi";
const schema = toSchema({ n: int }); // output columns → n BIGINT
const sequence = defineTableFunction({
name: "sequence",
description: "Emit integers 0..n-1, streamed in batches of 1000",
// Positional args by name. Also available: argDefaults, varargs, argDocs.
args: { n: int },
// onBind runs once at plan time → declare the output schema. May be async,
// and may also return opaqueData / secret + scope lookups for the call.
onBind: () => ({ outputSchema: schema }),
// Per-execution mutable state, built from the bound args.
initialState: ({ args }) => ({ i: 0, n: Number(args.n) }),
// process is the pull loop. DuckDB calls it repeatedly and consumes lazily:
// emit zero or more batches per call, then finish() to signal end-of-stream.
// You never have to materialize the whole result. May be async.
process: (_params, state, out) => {
if (state.i >= state.n) return out.finish(); // end-of-stream
const end = Math.min(state.i + 1000, state.n);
const ns: bigint[] = [];
for (let k = state.i; k < end; k++) ns.push(BigInt(k));
out.emit(batchFromColumns({ n: ns }, schema)); // hand a batch to DuckDB
state.i = end;
},
// Optional optimizer/runtime hooks — all omittable:
// cardinality: (bind) => ({ estimate, max }) row-count hints
// statistics: (bind) => ColumnStatistics[] min/max → filter folding
// projectionPushdown / filterPushdown / autoApplyFilters
// dynamicToString: (...) => Record<string,string> EXPLAIN ANALYZE counters
// onInit + shared storage for partitioned producers / work queues
// partitionKind, preservesOrder, lateMaterialization, samplingPushdown, …
});
// Serve every function from the same catalog.
new Worker({
catalog: { name: "demo", schemas: [{ name: "main", functions: [upperCase, multiply, sequence] }] },
}).run();The LOCATION is a shell command DuckDB runs to spawn the worker. For Bun:
LOAD vgi;
ATTACH 'demo' AS demo (TYPE vgi, LOCATION 'bun run /abs/path/to/worker.ts');
SELECT demo.upper_case('hello'); -- HELLO
SELECT demo.multiply(6, 7); -- 42
SELECT * FROM demo.sequence(5); -- 0,1,2,3,4Five factories, each mapping to a DuckDB function shape with its own lifecycle:
defineScalarFunction— row-in / row-out (SELECT f(x)). DuckDB hands you a columnar batch; you return one output value per input row. For transforms, parsing, formatting, crypto, encoding, per-row lookups.defineTableFunction— a set-returning producer (SELECT * FROM f(...)). You keep your ownstateandemitbatches untilfinish, and DuckDB pulls lazily. The richest type: projection / filter / sampling pushdown, cardinality and per-column statistics for the optimizer, Hive-style partitioning, late materialization, and EXPLAIN ANALYZE counters.defineAggregate— aggregate / window function (f(x) … GROUP BY). ImplementinitialState/update/combine/finalize;combinelets DuckDB aggregate partitions in parallel and merge the results.defineTableInOutFunction— table-in / table-out (SELECT * FROM f(TABLE t, …)). Per-partitionstate, a streamingprocessper input batch, and afinalizethat sees every worker's state — for windowing, reshaping, enrichment, sessionization.defineTableBufferingFunction— table-in / table-out that must observe all input before producing output (global sort, top-N, pivot). Sink (process) →combine→ source (finalize), with explicit sink/source ordering controls.
| Capability | Scalar | Table | Aggregate | Table-in-out | Buffering |
|---|---|---|---|---|---|
| Columnar input rows | ✓ | — | ✓ | ✓ | ✓ |
Streaming output (emit / finish) |
— | ✓ | — | ✓ | ✓ |
| Per-group / per-partition state | — | — | ✓ | ✓ | ✓ |
Parallel merge (combine) |
— | — | ✓ | — | ✓ |
| Bind-time dynamic output type | ✓ | ✓ | ✓ | ✓ | ✓ |
| Named args · defaults · varargs | ✓¹ | ✓ | ✓ | ✓ | ✓ |
| Constant (bind-folded) params | ✓ | ✓² | ✓ | ✓² | ✓² |
| Cardinality hints | — | ✓ | — | — | ✓ |
| Column statistics (optimizer) | — | ✓ | — | — | — |
| Projection pushdown | — | ✓ | — | ✓ | ✓ |
| Filter pushdown (+ auto-apply) | — | ✓ | — | ✓ | ✓ |
| Sampling pushdown | — | ✓ | — | — | — |
| Late materialization | — | ✓ | — | — | — |
| Hive-style partitioning | — | ✓ | — | — | — |
| Batch-index threading | — | ✓ | — | — | ✓ |
| Order preservation / dependence | — | ✓ | — | — | ✓³ |
| EXPLAIN ANALYZE diagnostics | — | ✓ | — | — | — |
| Shared storage (work queues, cross-worker state) | — | ✓ | — | ✓ | ✓ |
| Settings & secrets access | ✓ | ✓ | — | ✓ | ✓ |
| Volatility / stability hint | ✓ | ✓ | — | ✓ | ✓ |
| NULL-handling control | ✓ | ✓ | ✓ | ✓ | ✓ |
| Async lifecycle | bind⁴ | ✓ | bind⁴ | ✓ | ✓ |
¹ via the ordered parameters form. ² supplied as bind-time args.
³ via sinkOrderDependent / sourceOrderDependent. ⁴ onBind / outputType
may be async; compute / update / finalize run synchronously.
Workers can also expose a catalog (schemas, tables, views, macros, secrets) via
ReadOnlyCatalogInterface / CompositeCatalogInterface, so an ATTACHed worker
presents browsable database objects, not just functions.
Every value DuckDB exchanges with a worker has an Arrow type, and each Arrow type maps to exactly one JS value shape. A single codec layer is the only authority for that mapping: it converts between the JS value you read/write and an internal canonical pivot (the raw Arrow wire unit), identically across both Arrow backends.
There are two author-facing representations, selected per scalar function with
repr:
rich(the default) — the ergonomic shape. Identical to the canonical wire unit for every type exceptdate32/date64, which surface as a JSDate. Sub-second temporal types (time,timestamp,duration) stay numeric/bigint, becauseDatecannot hold microsecond/nanosecond precision losslessly.raw(opt-in) — the canonical wire unit with a branded TypeScript type that carries the unit (e.g.TimestampMicros,Date32,UnscaledDecimal). At runtime a branded value is the underlyingnumber/bigint; the brand exists only at compile time so a wrong-unit mix-up is a type error.
| Arrow type | rich JS value |
raw branded type |
|---|---|---|
bool |
boolean |
boolean |
int8 / int16 / int32 |
number |
number |
uint8 / uint16 / uint32 |
number |
number |
int64 |
bigint |
Int64 |
uint64 |
bigint |
Uint64 (exported as Uint64Raw) |
float16 / float32 / float64 |
number |
number |
utf8 / largeUtf8 |
string |
string |
binary / largeBinary / fixedSizeBinary |
Uint8Array |
Uint8Array |
date32 |
Date |
Date32 (days since epoch, number) |
date64 |
Date |
Date64Ms (ms since epoch, bigint) |
time32[s] / time32[ms] |
number (raw unit) |
Time32S / Time32Ms |
time64[us] / time64[ns] |
bigint (raw unit) |
Time64Us / Time64Ns |
timestamp[s/ms/us/ns] |
bigint (raw unit) |
TimestampSeconds / TimestampMillis / TimestampMicros / TimestampNanos |
duration[s/ms/us/ns] |
bigint (raw unit) |
DurationSeconds / DurationMillis / DurationMicros / DurationNanos |
decimal128 / decimal256 |
bigint (UNSCALED integer) |
UnscaledDecimal |
struct |
{ field: richValue } |
{ field: rawValue } |
list / largeList / fixedSizeList |
Array<richValue | null> |
Array<rawValue | null> |
map |
Array<[richKey, richValue]> |
Array<[rawKey, rawValue]> |
dictionary |
the decoded value's rich |
the decoded value's raw |
date32 / date64 are the only types where rich differs from the canonical
wire unit. Everywhere else, rich is the canonical value and raw is the same
value with a branded type. null / undefined pass through as null in every type.
Notes:
-
Discriminate Arrow types by
typeId/ predicates, neverconstructor.name. The factories return Arrow type instances, not classes named after the factory.dateDay()returns an arrow-jsDate_instance withtypeId === Type.Dateandunit === DateUnit.DAY— there is no class namedDateDay, so a check liketype.constructor.name === "DateDay"will always fail (and is brittle across the two Arrow backends and minified builds). Use the exportedisDate(type)predicate, or comparetype.typeId(toType.Date/ the backend-agnosticTypeId.Date) andtype.unit(toDateUnit.DAY/DateUnit.MILLISECOND) to distinguish day vs. millisecond dates:import { dateDay, isDate, DateUnit, TypeId } from "@query-farm/vgi"; const t = dateDay(); isDate(t); // ✓ true — backend-agnostic predicate t.typeId === TypeId.Date; // ✓ true t.unit === DateUnit.DAY; // ✓ true — day-resolution date32 t.constructor.name === "DateDay" // ✗ NEVER — no such class; this is always false
-
Decimals are unscaled. A
decimal128(18, 2)value of123.45is the bigint12345n; apply the scale yourself (Number(v) / 100). The declared precision/scale travel with the column type, not the value. -
Temporal units are lossless
bigint. Atimestamp[us]round-trips as the exact microsecond count — noDatenarrowing, no precision loss.
Reads and writes are symmetric: a value read from a column rebuilds into the same
column. build(read(x)) round-trips, in either representation —
iterRows(batch) (and scalar inputs) return rich values, and a rich value fed
back through batchFromColumns/a scalar compute return rebuilds the original
column (pass "raw" / repr: 'raw' on both ends for the branded form).
The codec validates and throws on invalid or lossy input: a non-integer where an
integer is required, a bigint that overflows the declared width or the safe-integer
range when narrowing to number, an out-of-range Date, the wrong number of bytes
for a fixedSizeBinary, etc. You get a clear codec[<type>]: … TypeError at build
time rather than silently corrupt data on the wire.
Declare params and returns (or args) with the typed factories and compute is
statically typed end to end — the input columns and the return value are checked
against the declared Arrow types and the chosen representation:
import { Worker, defineScalarFunction, timestampMicros, int64 } from "@query-farm/vgi";
// rich (default): timestamp values are plain bigint microsecond counts.
const addHour = defineScalarFunction({
name: "add_hour",
params: { ts: timestampMicros }, // input column: bigint (us)
returns: timestampMicros, // output: bigint (us)
compute: (batch) => {
const ts = batch.getChildAt(0)!;
return Array.from(ts, (v: bigint | null) =>
v == null ? null : v + 3_600_000_000n, // +1h in microseconds
); // returning a Date here is a COMPILE error
},
});import {
Worker, defineScalarFunction, timestampMicros,
asTimestampMicros, type TimestampMicros,
} from "@query-farm/vgi";
// raw mode: outputs are branded units, constructed with `asTimestampMicros`.
const epoch = defineScalarFunction({
name: "epoch_us",
params: { ts: timestampMicros },
returns: timestampMicros,
repr: "raw", // opt in to branded raw units
compute: (batch) => {
const ts = batch.getChildAt(0)!;
return Array.from(ts, (v: TimestampMicros | null) =>
v == null ? null : asTimestampMicros(v + 1n), // branded in, branded out
);
},
});For manual conversions outside a function, codecFor(type) returns the codec with
richToCanonical / canonicalToRich / rawToCanonical / canonicalToRaw.
The typed Arrow type factories int, int32, float32, and bool are not
re-exported from the package root, because @query-farm/vgi already re-exports
vgi-rpc argument builders of the same names. Import those four from the arrow facade
(they ship as the typed factories there); the rest of the typed factory set
(int8/int16/int64, uint*, float16/float64, decimal*, dateDay,
timestampMicros, struct, list, map, …) is exported from the package root as
usual.
This is a pre-1.0 breaking change to how columnar values are represented in and out of functions. The contract is now uniform across both Arrow backends and both directions (read and write). For consumers upgrading:
date32/date64columns are now JSDatein and out by default. Previously dates were inconsistent — a day-number went in but aDatecame back out. Both directions are nowDateunder the defaultrichrepresentation.- Reads return rich values.
iterRows, scalar inputs, and setting/secret reads all surface therichvalue for their type. - Non-date temporal types are lossless
bigintraw units.time64,timestamp, anddurationare the exactbigintcount in their declared unit (us, ns, …) — noDate, no precision loss. - Decimals are unscaled
bigint. Adecimal(18,2)of123.45is12345n. - Opt into
repr: 'raw'for branded, unit-tagged raw units everywhere (includingdate32/date64as plain day-number / ms-bigintrather thanDate).
Before / after for the common date case:
// BEFORE (old, inconsistent): wrote a day-number, read back a Date.
returns: dateDay,
compute: () => [20000], // 20000 days since epoch
// AFTER (rich, default): write a Date, read a Date — symmetric.
returns: dateDay,
compute: () => [new Date("2024-10-19")],
// AFTER (raw): opt in to the branded day-number.
returns: dateDay,
repr: "raw",
compute: () => [asDate32(20000)], // branded number, not a DateA worker serves the same functions over any of:
- stdin/stdout — the default; how DuckDB subprocess-spawns a worker.
- AF_UNIX — a long-lived warm worker (
--unix <path>), reused across calls. - HTTP — stateless; all state round-trips in a self-contained token, so requests can be load-balanced across hosts.
Transport handling lives in @query-farm/vgi-rpc.
The package ships a backend-agnostic Arrow facade and selects an implementation at build time per runtime, so the same source runs on servers, Cloudflare Workers, and browsers:
| Runtime | Arrow backend |
|---|---|
| Node.js / Bun | @query-farm/apache-arrow (arrow-js) |
Cloudflare Workers (workerd) / browser |
@query-farm/flechette |
Subpath exports:
@query-farm/vgi— main API (define functions,Worker,VgiClient, catalogs).@query-farm/vgi/client— client-only entry (no server-side code).@query-farm/vgi/worker-cf— Cloudflare Workers entry (workerd/browser bundle).
- Haybarn — Query Farm's distribution of DuckDB,
with the
vgiextension (and others) as prebuilt npm packages (@haybarn/haybarn-wasm,@haybarn/ext-vgi-h1-5-3). The host side: it runs the DuckDB engine that attaches to your worker. @query-farm/vgi-rpc— the RPC layer: protocol, server, transports, Arrow IPC framing. A runtime dependency of this package.vgi-python— the reference implementation. This package is a wire-compatible TypeScript port; workers from either side interoperate.
make install # bun install
make build # types + JS bundles
make test # integration tests (launcher transport)
make -j8 test-all # launcher + HTTP suites in parallelSee CLAUDE.md for the full build/test reference.
Copyright © 2025, 2026 Query Farm LLC — https://query.farm
Distributed under the Query Farm Source-Available License, Version 1.0. Use,
modification, redistribution, and non-production use are permitted; some production
uses (notably competing offerings and commercial marketplaces) require a separate
commercial license. Each version converts to Apache-2.0 ten years after its release.
See LICENSE for the full terms, or contact hello@query.farm for
commercial licensing.