Comprehensive reference datasets of countries, intergovernmental organizations, and country groups. Source YAML files in data/countries/ and data/intblocks/ are validated, enriched, and exported to multiple formats in data/datasets/. The project serves as a data source for the Dateno search engine.
- Multi-format export: JSONL, YAML, Parquet, and DuckDB (Zstandard compression, level 22)
- Countries quality pipeline: schema validation, completeness gates, entity status policy, and field-level provenance
- Profile enrichment: population, area, gini, timezones, and native names from World Bank, Wikidata, and IANA tzdata
- Build metadata:
countries.manifest.jsonwith version, commit, row count, and schema hash - CI validation: pull-request checks via
.github/workflows/validate.yml - CLI tools: Typer-based scripts with tqdm progress bars
pip install -r requirements.txt# Inspect data sources
python3 scripts/builder.py info
# Validate country YAML (no build)
python3 scripts/validate_countries.py
# Build all datasets
python3 scripts/builder.py build
# Build specific formats only
python3 scripts/builder.py build --formats parquet,duckdbEach build writes to data/datasets/:
| File | Description |
|---|---|
countries.jsonl.zst |
Countries (JSONL, zstd) |
countries.yaml.zst |
Countries (YAML, zstd) |
countries.parquet |
Countries (Parquet, zstd) |
countries.manifest.json |
Build metadata (version, commit, row count, schema hash) |
intblocks.jsonl.zst |
International blocks (JSONL, zstd) |
intblocks.yaml.zst |
International blocks (YAML, zstd) |
intblocks.parquet |
International blocks (Parquet, zstd) |
blocktypes.jsonl.zst |
Block types (JSONL, zstd) |
blocktypes.yaml.zst |
Block types (YAML, zstd) |
blocktypes.parquet |
Block types (Parquet, zstd) |
internacia.duckdb |
DuckDB database (countries, intblocks, blocktypes tables) |
Current row counts: 252 countries, 1065 intblocks, 85 blocktypes.
The builder runs validate_countries.py before export. Validation covers:
- JSON Schema conformance (
data/schemas/countries.schema.json) - ISO identifier formats and duplicate detection
- Completeness thresholds (
data/schemas/countries_completeness.yaml) - Entity status policy (
entity_type,code_status) - Intblock cross-references (country
includesresolve to country sources)
# Full validation with JSON report
python3 scripts/validate_countries.py --report completeness-report.json
# Enrich profile fields from external sources
python3 scripts/enrich_countries.py
python3 scripts/enrich_countries.py backfill-provenance
# Apply entity status annotations
python3 scripts/annotate_entity_status.py
# Audit intblock include name aliases (warn-only)
python3 scripts/report_country_include_names.py
# Compare manifest to main branch baseline
python3 scripts/diff_countries_baseline.pyCountry code policy (ISO vs user-assigned, filtering examples): docs/country-code-policy.md
Breaking and semantic changes in the latest countries schema (see CHANGELOG.md):
- Population / area / gini: structured as
{value, year, source, source_id}— use.valuefor the numeric field. - Borders: land neighbors as ISO alpha-3 codes (e.g.
CAN,MEX), not alpha-2. - Entity filter:
code_status == 'official_iso3166_1'returns 249 current ISO-style records. - Build metadata: compare
countries.manifest.jsonschema_hashwhen upgrading downstream pipelines.
Pandas example (structured population):
import pandas as pd
df = pd.read_parquet("data/datasets/countries.parquet")
pop = df["population"].struct.field("value")DuckDB example (nested intblock translations):
import duckdb
con = duckdb.connect("data/datasets/internacia.duckdb")
con.execute("""
SELECT id, name, t.name AS english_name
FROM intblocks, UNNEST(translations) AS t
WHERE t.lang = 'en'
LIMIT 5
""").fetchall()252 country and territory records. Key fields:
| Field | Type | Description |
|---|---|---|
code |
String | ISO 3166-1 alpha-2 code (e.g. US) |
entity_type |
String | sovereign_state, dependent_territory, historical_entity, etc. |
code_status |
String | official_iso3166_1, user_assigned, obsolete |
recognition_status |
Struct | Optional recognition/dispute metadata |
name |
String | Common name |
iso3code |
String | ISO 3166-1 alpha-3 code |
capital_city |
Struct | {name, lng, lat} |
region |
Struct | World Bank region {id, value} |
adminregion |
Struct | World Bank admin region {id, value} |
incomeLevel |
Struct | World Bank income level {id, value} |
lendingType |
Struct | World Bank lending type {id, value} |
numeric_code |
String | ISO 3166-1 numeric code |
wikidata_id |
String | Wikidata item ID |
official_name |
String | Official full name |
languages |
List[Struct] | {code, name, official} |
currencies |
List[Struct] | {code, name, symbol} |
un_member |
Boolean | UN member |
independent |
Boolean | Independent state |
subregion |
String | UN subregion |
continents |
List[String] | Continents |
borders |
List[String] | Land borders as ISO alpha-3 codes |
landlocked |
Boolean | Landlocked |
tld |
String | Top-level domain |
calling_codes |
List[String] | Telephone codes |
flag_emoji |
String | Flag emoji |
car_side |
String | Driving side |
start_of_week |
String | Start of week |
demonyms |
Struct | {female, male} |
m49_code |
String | UN M49 code |
population |
Struct | {value, year, source, source_id} |
area |
Struct | Land area sq km {value, year, source, source_id} |
gini |
Struct | Gini index {value, year, source, source_id} |
timezones |
List[String] | IANA timezone identifiers |
timezone_status |
String | not_applicable when no zones apply |
native_names |
Map | Lang code → {official, common} |
other_names |
List[Struct] | Translations {id, name} |
common_names |
List[String] | Aliases and common names |
provenance |
List[Struct] | Field sourcing {field, source, retrieved_at, url, license} |
Non-standard codes retained with explicit status: AN (obsolete), JG (user-assigned grouping), KV (user-assigned, disputed).
| Field | Type | Description |
|---|---|---|
id |
String | Unique identifier |
blocktype |
List[String] | Block types |
status |
String | Current status |
name |
String | Name |
languages |
List[String] | Official languages |
links |
List[Struct] | {url, type} |
other_names |
List[Struct] | {id, name} translations |
founded |
String | Foundation date |
geographic_scope |
String | Scope |
regions |
List[String] | Regions covered |
includes |
List[Struct] | Members {id, name, type, status, joined, role, note} — id is authoritative; name is a source label |
membership_count |
Integer | Member count |
wikidata_id |
String | Wikidata item ID |
legal_status |
String | Legal status |
description |
String | Description |
tags |
List[String] | Tags |
topics |
List[Struct] | {key, name} |
headquarters |
Struct | {city, country, coordinates} |
acronyms |
List[Struct] | {lang, value} |
partof |
List[String] | Parent organizations |
dissolved |
String | Dissolution date |
predecessor |
String | Predecessor |
successor |
String | Successor |
YAML sources
data/countries/*.yaml— 252 country/territory recordsdata/intblocks/**/*.yaml— 1065 international block records
External enrichment
- World Bank — population, area, gini, income classifications
- Wikidata — entity linking, native names, fallbacks
- IANA tzdata — timezone mapping (
scripts/data/zone1970.tab)
| Script | Purpose |
|---|---|
scripts/builder.py |
Validate and export datasets |
scripts/validate_countries.py |
Country schema, completeness, and cross-dataset checks |
scripts/validate_links.py |
Intblock URL and Wikidata validation |
scripts/enrich_countries.py |
Enrich country profiles; backfill-provenance subcommand |
scripts/annotate_entity_status.py |
Set entity_type and code_status |
scripts/report_country_include_names.py |
Intblock include name alias audit |
scripts/diff_countries_baseline.py |
Manifest diff vs git baseline |
- All text files use UTF-8 encoding; generated outputs overwrite existing files.
- Decompress zstd files:
zstd -d data/datasets/countries.jsonl.zst - Gap analysis research:
dev/research/countries_gaps_,manus_20260528.md
- internacia-api — REST API
- internacia-python — Python SDK
- Python SDK — internacia-python
- REST API — internacia-api