Skip to content

ATTACH (TYPE LANCE) on a dataset with lance.json (pa.json_()) columns fails with "DuckDB does not support Strings over 4GB" on any query that projects the JSON column #201

@erdinc-aura

Description

@erdinc-aura

Environment

  • duckdb CLI: v1.4.4
  • DuckDB lance extension: version 7d49b91 (core, not community)
  • pylance: 4.0.1
  • lancedb: 0.30.2
  • Lance file format: 2.2 (data_storage_version="2.2" / storage_options={"new_table_data_storage_version": "2.2"})
  • OS: macOS (Darwin 24.6.0, arm64)
  • Python: 3.14

Symptom

ATTACH succeeds. SHOW ALL TABLES works. Selecting non-JSON columns works. Any query that projects the JSON column — or even operates on it (octet_length, CAST, hex, substring) — fails:

Conversion Error: DuckDB does not support Strings over 4GB

The column is exposed as BLOB in DuckDB but the values are not actually oversized — the test rows are tens of bytes each. The error appears to be a misclassification at the Arrow→DuckDB bridge: the lance.json encoding (large_binary with ARROW:extension:name="lance.json") is being handed to DuckDB's string path without the length/bounds normalization that
regular LargeBinary → BLOB goes through.

Minimal reproducer

  import lancedb, pyarrow as pa, json, subprocess, tempfile

  with tempfile.TemporaryDirectory() as d:
      db = lancedb.connect(
          d,
          storage_options={"new_table_data_storage_version": "2.2"},
      )
      schema = pa.schema([
          pa.field("id", pa.int64()),
          pa.field("attrs", pa.json_()),
      ])
      rows = [
          {"id": i, "attrs": json.dumps({"amount": i * 10, "kind": "x" if i % 2 else "y"})}
          for i in range(1, 6)
      ]
      db.create_table("t", data=rows, schema=schema)

      # Verify the on-disk schema is the JSONB ext type
      # (large_binary + ARROW:extension:name == b'lance.json')
      tbl = db.open_table("t")
      print(tbl.schema.field("attrs").metadata)
      # => {b'ARROW:extension:name': b'lance.json', b'ARROW:extension:metadata': b''}

      # SHOW ALL TABLES works — exposed as BLOB
      print(subprocess.run(
          ["duckdb", "-c",
           f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SHOW ALL TABLES;"],
          capture_output=True, text=True).stdout)
      # => id BIGINT, attrs BLOB  ✅

      # SELECT on non-JSON column works
      print(subprocess.run(
          ["duckdb", "-c",
           f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id FROM ldb.t;"],
          capture_output=True, text=True).stdout)
      # => 1..5  ✅

      # SELECT including attrs fails
      r = subprocess.run(
          ["duckdb", "-c",
           f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id, attrs FROM ldb.t;"],
          capture_output=True, text=True)
      print("stderr:", r.stderr)
      # => Conversion Error: DuckDB does not support Strings over 4GB  ❌

      # All of these fail with the same error:
      for expr in [
          "attrs",
          "octet_length(attrs)",
          "CAST(attrs AS VARCHAR)",
          "hex(attrs)",
          "substring(CAST(attrs AS VARCHAR), 1, 10)",
      ]:
          r = subprocess.run(
              ["duckdb", "-c",
               f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id, {expr} FROM ldb.t;"],
              capture_output=True, text=True)
          print(expr, "->", r.stderr.strip().splitlines()[0])

Expected output of that last loop:
attrs -> Conversion Error: DuckDB does not support Strings over 4GB
octet_length(attrs) -> Conversion Error: DuckDB does not support Strings over 4GB
CAST(attrs AS VARCHAR) -> Conversion Error: DuckDB does not support Strings over 4GB
hex(attrs) -> Conversion Error: DuckDB does not support Strings over 4GB
substring(CAST(attrs AS VARCHAR), 1, 10) -> Conversion Error: DuckDB does not support Strings over 4GB

Data that actually lives on disk

Payloads are tiny — the largest attrs value is {"amount":50,"kind":"y"} (~24 bytes). So the "over 4GB" text is misleading; it's a type-path error, not a size problem.

Reading the dataset directly via pylance returns the values as JSON strings without issue:

ds = tbl.to_lance()
ds.to_table().column("attrs").to_pylist()

=> ['{"amount":10,"kind":"x"}', '{"amount":20,"kind":"y"}', ...]

And DataFusion SQL against the same dataset works perfectly:

ds.sql("SELECT json_extract(attrs, '$.amount') AS a FROM dataset")
.build().to_stream_reader().read_all().to_pydict()

=> {'a': ['10', '20', '30', '40', '50']}

So the issue is specifically the DuckDB extension's handling of the lance.json / arrow.json extension-typed column.

Expected behavior

One of:

  1. Expose the column as VARCHAR (or JSON) — lance.json values are already UTF-8 JSON text; the underlying storage type is large_binary but the bytes are valid UTF-8.
  2. Expose it as BLOB but handle it correctly in projections and scalar functions that accept BLOB / VARCHAR casts.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions