diff --git a/sql-research/2025-12-16-metadata.md b/sql-research/2025-12-16-metadata.md
new file mode 100644
index 0000000..8b55b68
--- /dev/null
+++ b/sql-research/2025-12-16-metadata.md
@@ -0,0 +1,649 @@
+# Metadata
+
+## 1. Purpose and Scope of this Document
+
+This document represents a first iteration intended to initiate discussion on metadata
+for **generic data modeling** and **interoperability**.
+It does not constitute a complete specification;
+instead, it identifies key topics and proposes possible approaches.
+In particular, the document discusses:
+
+- [Overall Goals](#2-goals)
+
+- [Assumptions and Overall Structure](#3-elements-of-metadata)
+
+- [Schema definition](#34-schemas)
+
+- [Type system](#35-types)
+
+- [Keys and Indices](#36-keys-and-indices)
+
+- [Archive and Real-time data](#37-archive-and-real-time-data)
+
+- [Statistics](#38-statistics)
+
+- [Format](#4-metadata-format)
+
+Out of scope, for the moment are vector databases.
+
+## 2. Goals
+
+The overall goals of defining metadata are to provide
+
+- interoperability for **generic data** between different components
+(portals, workers, SDKs, SQL tool integrations - such as the DuckDB extension)
+
+- a **modeling facility** to define generic data in the first place.
+
+We are currently working with known and - to some extent - homogeneous data.
+Generic interoperability is significantly harder.
+We will need
+
+- Unambiguous data definition
+
+- Expressive power for data modeling.
+
+### 2.1 Long Term
+
+In the long run, metadata may enable
+
+- Validation on ingestion and data loading
+
+- Parsing of data "for free"
+
+- Automated ingestion
+
+- Automated data transformation and Conversion
+
+- Automated data repair
+
+- Automated Export
+
+- Data Generation
+
+- Integration with standard tools
+
+- Industry-standard modeling facilities.
+
+### 2.2 Short Term
+
+In the short run, particularly to implement the
+[Proof-of-Product](https://github.com/subsquid/specs/blob/main/sql-research/2025-11-27-agents-pop-1.md)
+the goals are less ambitious:
+
+- Provide data modeling by means of hand-written metadata
+
+- Interoperability rooted in such hand-written documents
+
+- Hand-crafted ingestion pipelines
+
+- Validation and Parsing of data for different components
+(portals, workers, SDKs, the DuckDB extension).
+
+For the first iteration we are currently working on, this means:
+
+OUT OF SCOPE:
+- IPLD
+- schema-on-write
+- real-time data
+- generic keys
+
+IN SCOPE:
+- Basic set of statstics (can be computed once, since we have no real-time data)
+- Block numbers as keys (done)
+- Timestamps as keys (on the way)
+- Complex data types
+- Consistent error handling
+- Better integration in Portal (use Portal datasets, keep datasets + schema in memory)
+
+## 3. Elements of Metadata
+
+### 3.1 Assumptions
+
+1. Schemas are (largely) immutable.
+I say "largely" because the possibility for corrections and adaptations cannot be foreclosed.
+The question is then whether such occasional changes should be automated or can be treated as mainenance.
+
+1. Data are inserted and occasionally deleted, but never updated.
+
+1. Data are stored in chunks.
+
+1. Chunks consist of parquet files with related data; each parquet corresponds to one table (e.g. `blocks`, `transactions`, etc.).
+
+1. New data (a.k.a. "hotblocks") are handled separately until a new chunk is ready and assigned to workers.
+
+1. Data is always redundant (on s3, on more than one worker, etc.);
+ no further backups are needed.
+
+### 3.2 Elements and their Lifecycles
+
+The structure of metadata as a whole may be:
+
+
+
+
+
+A dataset node contains the metadata describing all one dataset concerning
+
+- ownerschip and access
+- data location (e.g. s3)
+- the schema
+- the "archive", i.e. the chunks containing the data (including statistics)
+- the "real-time" data (a.k.a. "hotblocks"), i.e. data before they are appended to a chunk.
+
+#### Access and Owner
+
+This object contains the owner of this dataset (our datasets today would all have the same owner "Squid").
+It also contains access information (public, private, etc.) and elements of access control.
+It may contain additional routing information, indicating portals that should serve the dataset in question.
+Datasets should be submitted only to the portals for which they are intended.
+
+This node may change within the life of one dataset:
+- when the owner changes (rare)
+- when access information changes (more frequent)
+- when routing through portals changes (frequent)
+
+#### Location
+
+Locations from where data chunks can be downloaded (e.g.: an s3 bucket, a github repository, an ftp server).
+
+This node may change within the life of one dataset:
+- when new locations are added
+- when locations are removed or updated
+
+#### Schema
+
+See [below](#34-schemas).
+
+This node may change within the life of one dataset with changes to the schema (shall be rare!).
+
+#### Archive
+
+Data concerning chunks:
+- Statistics at dataset, table, row, and column level.
+
+This node may change within the life of one dataset:
+- Chunks are added (frequent)
+- Statistics are recomputed (frequent)
+- Chunks are reassigned (frequent)
+
+#### Real-Time Data
+
+Data enter the database in "real-time" mode. They are at this point not part of any chunk.
+They exist in a database that can be accessed by a portal and served directly from there.
+
+The real-time node contains information about data sources and ingestion procedures.
+The mechanism should generalise today's "hotblocks". For instance:
+- A publisher address to which the portal subscribes;
+- A stored procedure that is invoked with the data in question;
+ this procedure may transform the data and should store them in "hotblock" database;
+- A connect string for the local database.
+
+This node may change within the life of one dataset:
+- when the publisher or the ingestion process change (occasional).
+
+### 3.3 Self-Certification
+
+The whole metadata tree shall be self-certifiable in the sense that
+
+- Its identifier cryptographically commits to its contents
+
+- Any tampering is detectable without a trusted authority
+
+- References between document parts are verifiable.
+
+For details see below [IPLD](#43-ipld)
+
+### 3.4 Schemas
+
+A schema is a set of **integrity constraints** imposed on a set of data
+that together forms a queryable unit of data.
+In our terminology, such units of data are called **datasets**.
+
+Integrity Constraints are formulas (recipes) that describe **structured data**.
+They describe how data can be read from and written to **data stores** and
+how they can be queried efficiently.
+
+A query retrieves data, either in isolation or - more typically and more relevantly -
+in combination according to its integrity constraints. How data can be meaningfully combined
+is defined by **relations**.
+
+Some relations are predefined in the schema, others are produced by queries.
+Predefined relations are called **tables**.
+Tables are the building blocks of data modeling.
+They carry additional metadata, in particular **statistics**.
+
+Relations are organised into **rows and columns**.
+At present we don't exploit column-level metadata for retrieval or storage optimisation -
+we rely on storage engines to provide these capabilities.
+In the future, we may add statistics to columns or row groups
+to accelerate ingestion and, in particular, retrieval.
+
+Integrity Constraints are
+
+- Types, which restrict the domain of data assigned to a column;
+
+- Additional Column constraints (e.g. nullability);
+
+- Keys (including **referential integrity constraints**)
+
+**Keys** define internal relations between rows in a table;
+They constrain column values across rows.
+Keys are
+
+- **Primary Keys** (PK) uniquely define one row in a table.
+ A table has at most one PK;
+
+- **Secondary keys** (SKs) are one or more columns whose values may repeat
+ and are used to identify and retrieve sets of rows.
+ A table may have zero, one, or many SKs.
+ (Note: “secondary key” is not standard relational-algebra terminology.)
+
+- **Foreign Keys** (FK) define a secondary key in one table
+ that corresponds to the primary key in another table;
+
+Note: multi-column key are possible and should not cause to much overhead.
+But they have impact on memory requirements.
+
+**Indices** are search structures defined over tables.
+They are associated with keys.
+(Indices are not part of relational algebra. They are tools to implement it efficiently.)
+
+Today, tables together with indices defined over them are the only predefined relations.
+In the future, other types of relations and other objects may be considered, such as:
+
+- Views
+
+- Materialised Views
+
+- Triggers
+
+- Stored Procedures.
+
+Schemas shall also include elements for defining real-time data.
+This may include an endpoint from which data is read,
+and a stored procedure (or equivalent processing step)
+that transforms data and passes it on to an internal API.
+
+The **lifecycles** of schemas and the data they describe are independent.
+The same schema may be applied to multiple datasets.
+Therefore, the principial metadata of a dataset is a reference to a schema.
+
+Today we refer to a dataset's schema as its **kind**.
+From the perspective of generic datasets, this term is unfortunate.
+The natural term is schema.
+
+#### Schema-on-Read or Schema-on-Write
+
+In the long run, schemas shall be "on-write": data is validated and/or transformed to a fixed schema at ingestion time,
+so all stored data conforms to that schema before it is written.
+Reaching this goal will take bit of learning. In the short run, schemas will therefore be "on-read":
+data is stored with minimal or no upfront structure - it is in particular stored **manually** -
+and a schema is applied only when the data is read or queried.
+
+Non-conforming data is either
+- ignored
+- converted to NULL (in case of an invalid column)
+- flagged as error.
+Which strategy is chosen depends on the context. For instance: the DuckDB extension converts unknown or non-conformant fields to NULL;
+DuckDB itself may reject data that do not conform to constraints defined in the catalog (e.g. NOT NULL, UNIQUE, etc.).
+
+### 3.5 Types
+
+Types restrict the domain of data that may be associated with a column.
+Unlike types in programming languages, they don't imply how they should be outlined in memory.
+The physical in-memory layout depends on how the specific language (or tool) interprets metadata.
+
+The proposed **type system** is therefore **much simpler** than those found in modern programming languages.
+In particular, it does not define a type hierarchy or generics and provides only minimal type semantics.
+More advanced facilites can be delegated to surrounding languages and tools.
+
+However, types indicate how data may be used.
+Therefore, the **type name** shall support an intuitive understanding of the type's semantic.
+
+Types are distinguished into **primitive types** and **complex types**.
+
+Primitive types are defined in terms of
+
+- Type name (mandatory)
+
+- Size
+
+- Unit
+
+- Range
+
+- Encoding.
+
+A `uint8` can be defined as:
+
+```json
+{
+ "name": "integer",
+ "size": 1,
+ "range": "0..255",
+}
+```
+
+The unit of size is always byte. (Or is it more convenient to define integers and bit patterns in bit?)
+
+A boolean:
+
+```json
+{
+ "name": "bits",
+ "size": 1,
+}
+```
+
+A more involved example:
+
+```json
+{
+ "name": "blockchain_timestamp",
+ "size": 8,
+ "unit": "second",
+ "range": "1230940800..",
+}
+```
+
+Fundamental types like `bool`, `double`, and `integer` variants should be built in.
+For convenience, the system may support **type labels** to define common types.
+
+Note: We can use the SI to predefine a meaningful set of units.
+If we want to be crazy, we can implement the whole thing.
+
+Base types are:
+
+- Bit Pattern (`bits`), limit: 8 bytes
+
+- Integer (`int`), limit: 8 bytes
+
+- Floating Point Number (`float`), always 8 bytes
+
+- Timestamp (`timestamp`)
+
+- Char (`char`), one or more bytes, but always the indicated number of bytes. Encoded in UTF-8 by default.
+
+- String (`varchar`), one or more bytes, may vary within this range. Encoded in UTF-8 by default.
+
+- Text (`text`), unlimited number of bytes encoded in UTF-8 by default.
+
+- Blob (`blob`), unlimited binary object.
+
+Complex Types are defined in terms of
+
+- Type name (mandatory)
+
+- Base type
+
+- Size
+
+- Precision.
+
+Complex types are:
+
+- Biginteger (`bigint`), integers with more than 64bit, arbitrary length.
+ It can be restricted explicitly (and then corresponds to an array) or
+ unrestricted (and then corresponds to a list), e.g.: `bigint` or `bigint(256)`;
+
+- Arbitrary Precision Number (`decimal`), arbitrary size and arbitrary precision, e.g.:
+ `decimal(10,2)`, `decimal(2)`;
+
+- Array (`array`): `array(uint64, 1024)`;
+
+- List (`list`): `list(uint64)`;
+
+- Enum (`enum`): `enum(A|B|C|D)`;
+
+- Union (`union`): `union(A(typeOfA), B(typeOfB), C(typeOfC))`;
+
+- Structure (`struct`): `struct(a: typeOfA, b: typeOfB)`;
+
+- Mapping (`map`): `map(typOfKey, typeOfValue)`;
+
+- JSON (`json`).
+
+#### Conversions
+
+Metadata are used by
+
+- Portals
+
+- Workers
+
+- Clients (e.g. DuckDB extension, SDKs)
+
+- AI Agents
+
+Type information therefore needs to be converted to different languages / tools:
+Rust, C++, Typescript, Python, etc.
+
+Strategies to deal with it include
+
+1. A library in a language that can easily be integrated into other languages (i.e. ANSI C / Rust)
+
+1. Native implementation for each language.
+
+Note: The first option would be my choice for Rust, C++ and Python - but not for Typescript.
+
+In any way it should be clear how to interpret types and the associated data, either by
+
+- A detailed specification (which is hard for the general case.
+If we know the set of languages, we can just indicate the target types and behaviour for each language.)
+
+- A reference implementation.
+
+### 3.6 Keys and Indices
+
+Today, we have only one _PK_ - block number -
+which follows naturally from the predefined structure of blockchain data.
+Chunk summaries held in memory enable portals to locate the relevant block ranges quickly.
+
+This logic can easily be generalised to keys that **increase over time**
+(and therefore with ingestion order).
+Since this is by far the fastest way to locate data by key, schemas should
+provide an indicator (e.g. a column modifier) for keys with this property.
+Auto-increment keys, for instance, behave in this way.
+
+With time-incrementing keys, a **timestamp search** index can be obtained almost for free.
+The last timestamp in a chunk is part of our **chunk summary** and can therefore be used
+to quickly locate a key range nearby. This yields a two-level lookup, but it is still fast in practice.
+
+For PKs that are not aligned with time - and for all other indices -
+we need **explicit search structures**.
+
+I assume that datasets, although generic, can still be stored in chunks of reasonable size.
+However, chunks may lose some of their performance advantages they have for our data
+where they are organised into small hierarchies of related tables (_blocks_ - _transactions_ - etc.)
+
+Note:
+Datasets with many tables may need to be broken down into smaller sets of tables,
+which would result in different kinds of chunks for the same dataset.
+This can be modelled in terms of _subdatasets_.
+These subsets must be defined by the user.
+
+Generic chunks without time-increasing keys are not nicely aligned like our chunks.
+Rather they contain **arbitrary collections** of keys. Today we may ingest the keys
+
+`chunk_1: [1 5 9 3]`
+
+Tomorrow we may have
+
+`chunk_2: [8 2 7 6]`
+
+In other words, chunks remain tied to ingestion time, whereas keys may not.
+
+A reasonable approach is a **two-level key-to-chunk routing** structure.
+At the top level, we compute a stable hash of the key and use a fixed-width hash prefix (e.g. 16 bits)
+to map the key to a small set of candidate chunks.
+This directory is compact, grows slowly, and bounds the expected number of candidate chunks per lookup to a small constant,
+independent of the total number of chunks.
+
+At the chunk level, we attach a space-efficient, static membership filter (e.g. a binary fuse, XOR, or ribbon filter) to each chunk.
+These filters confirm key membership with low false-positive rates.
+Combined with hash-prefix routing, the expected number of unnecessary worker probes can be kept low.
+
+Note:
+As the number of user datasets grows, we cannot keep all relevant data
+(maps, assignments, chunks, etc.) in memory in a single portal. We will need
+file storage or **shard datasets across portals**.
+
+We may also want to explore other kinds of indices, for example:
+
+- Bitmap Index
+
+- Zone Index
+
+- Filters.
+
+### 3.7 Archive and Real-Time Data
+
+#### Archive
+
+Archive contains all chunk-related data:
+
+- The assignment file.
+ Note: this is now implemented separately with flat buffers.
+ I don't want to imply that it *must* be changed. Conceptually, it belongs to metadata.
+ But as long as the data are available in portal, there is no need to change it.
+
+- Chunk metadata (key range if applicable, timestamp range, size, number of keys)
+
+- Statistics per table, chunk, column
+ Alternative:
+ Statistics could be applied directly to the tables in schema.
+ That would avoid repeating the table information in the archive.
+ On the other hand, by removing statistics from schema, it becomes slimmer and easier to use.
+
+- Search structures for generic keys;
+ Note: the document would most likely only point to a location from where
+ the structures could be downloaded (e.g. s3);
+ the portal would then store them locally.
+
+#### Real-Time Data
+
+I assume that the ingestion cycle for real-time data is as follows:
+
+1. Receive data from a provider though an endpoint
+
+1. Transform the data
+
+1. Pass them on to a ingestion facility (e.g. in a local database).
+
+We thus need:
+
+- A set of endpoints from where to obtain the input
+
+- A stored procedure (which is user-defined) to transform the input
+
+- A library that stores real-time data (which is provided by us).
+
+Relevant for metadata:
+
+- The real-time shall contain the endpoint URL (or the set of URLs),
+ the name of the stored procedure that processes and ingests the data,
+ and information about the local database that stores the data.
+
+- Schemas shall provide the concept of **stored procedures**.
+
+- The portal shall provide a mechanism to invoke stored procedures periodically or event-based.
+
+### 3.8 Statistics
+
+The purpose of statics is to enable
+
+- Efficient queries
+
+- Estimations of duration (for user's convenience)
+
+Query engines (e.g. DuckDB) use statistics for **query planning**.
+The engine may decide, for example, to download a small table (and possibly cache it persistently)
+to execute joins more efficiently. Without estimates of table size that would not be possible.
+
+For a user working with an interactive client, it is inconvenient not to know how long the query will take (10 seconds? 10 hours?).
+Indicating an approximate duration (e.g. a progress bar) is extremely helpful.
+
+Relevant statistics are:
+
+- Row count per table and chunk
+
+- Row count per foreign key
+
+- Cardinality
+
+- Min/Max
+
+- NULL count.
+
+Unlike schema definitions, statistics **change frequently** and shall be refreshed periodically.
+Portals shall provide an endpoint
+
+- from where updates on statistcs can be polled periodically or
+
+- to which the portal can subscribe updates.
+
+For the first iteration, statistics will be generated by the ingestion process and stored in plain JSON.
+
+## 4. Metadata Format
+
+### 4.1 JSON
+
+At present, metadata is hand-written in JSON and contains only the information
+required by the DuckDB extension.
+Here is an [example](https://github.com/subsquid/sql4sqd-prototype/blob/main/charts/sql-central/config/ethereum_holesky_1.json).
+
+### 4.2 SQL
+I propose using SQL, in particular DDL, for schema management containing `CREATE|DROP|ALTER table_name...` statements.
+User can then use DuckDB (or other SQL clients) to manage their schema.
+
+Note:
+Even if we consider schemas largely immutable, a way to correct and adapt schemas is necessary.
+
+### 4.3 IPLD
+IPLD (InterPlanetary Linked Data) is a content-addressed data model
+for building verifiable, immutable, and linkable data structures,
+where each piece of data is identified by a cryptographic hash
+and can securely reference other data by hash.
+It generalizes Merkle trees into a flexible graph (a Merkle-DAG, where DAG is Directed-Acyclic Graph),
+letting systems evolve data by creating new versions while reusing unchanged parts,
+enabling integrity-by-construction, efficient deduplication, selective retrieval,
+and transport-agnostic distribution across untrusted networks.
+
+Every IPLD block is identified by a CID, which is:
+
+```
+CID = hash(codec + multihash(data))
+```
+
+where `codec`is a numerical identifier of the format (e.g. `dag-json` or `protobuf`)
+and multihash hashes the data, which themselves contain hashes.
+If any byte changes,
+(1) the hash changes
+(2) the CID changes
+(3) all parent links break.
+
+This guarantees strong, built-in integrity verification, but not authenticity.
+It is the same core property that makes Git objects, blockchains, and Merkle trees self-verifying —
+IPLD merely generalises it across formats and graphs.
+
+But every change creates a new immutable version; *current* is just a pointer.
+This is the rationale for some design decisions,
+in particular real-time data should not be part of the IPLD document;
+otherwise we would get multiple versions per second.
+The document would just point where and how to obtain new data.
+
+A common way to implement IPLD is through pub/sub:
+- New versions (the root CID) are announced on a topic
+- subscribers fetch the root and only those parts that (1) they require and (2) changed
+ (usually not through the topic directly but through some content routing mechanism).
+
+IPLD is *not* recommended if
+- There is no benefit in content addressing, deduplication, or audit history
+- When snapshot semantics is too much overhead, that is,
+ when in-place updates and high-rate writes are required
+- SQL-like query/update semantics over the config is required.
+
+In such cases a conventional content database is far more appropriate.
diff --git a/sql-research/attachments/metadata-structure.png b/sql-research/attachments/metadata-structure.png
new file mode 100644
index 0000000..80fb85a
Binary files /dev/null and b/sql-research/attachments/metadata-structure.png differ
diff --git a/sql-research/attachments/solana.sql b/sql-research/attachments/solana.sql
new file mode 100644
index 0000000..057ebc9
--- /dev/null
+++ b/sql-research/attachments/solana.sql
@@ -0,0 +1,60 @@
+----------------------------------------------------------------------------------------------------
+-- TABLE BLOCK
+----------------------------------------------------------------------------------------------------
+TYPE block_number AS UINT64;
+TYPE unix_timestamp AS TIMESTAMP(8){unit=second};
+TYPE hash_as_hex AS CHAR(64); -- or: hash_as_base64 as VARCHAR(44)
+
+TABLE blocks (
+ number block_number PRIMARY KEY, --#timeseries
+ hash hash_as_hex,
+ parent_number block_number,
+ parent_hash hash_as_hex,
+ height UINT64;
+ timestamp unix_timestamp
+);
+
+----------------------------------------------------------------------------------------------------
+-- TABLE TRANSACTIONS
+----------------------------------------------------------------------------------------------------
+TYPE tx_index AS UINT32;
+TYPE tx_version AS UINT16;
+TYPE key_list AS LIST(VARCHAR);
+
+TYPE address_st AS STRUCT(
+ account_key LIST(VARCHAR),
+ readonly_indexes LIST(UINT8),
+ writeable_indexes LIST(UINT8)
+);
+
+TYPE address_lookup_list AS LIST(address_st);
+TYPE signature_list AS LIST(VARCHAR);
+
+TYPE loaded_addresses_st AS STRUCT(
+ readonly LIST(VARCHAR),
+ writable LIST(VARCHAR)
+);
+
+TABLE transactions (
+ block_number block_number PRIMARY KEY, --#timeseries
+ transaction_index tx_index PRIMARY KEY,
+ version tx_version,
+ account_keys key_list,
+ address_table_lookups address_lookup_list,
+ num_readonly_signed_accounts UINT8,
+ num_readonly_unsigned_accounts UINT8,
+ num_required_signatures UINT8,
+ recent_blockhash VARCHAR,
+ signatures signature_list,
+ err VARCHAR,
+ compute_units_consumed UINT64,
+ fee UINT64,
+ loaded_addresses loaded_addresses_st,
+ has_dropped_log_messages BOOLEAN,
+ fee_payer VARCHAR,
+ account_keys_size UINT64,
+ address_table_lookups_size UINT64,
+ signatures_size UINT64,
+ loaded_addresses_size UINT64,
+ accounts_bloom BLOB
+);