Workshop: Production-Ready Data Ingestion for Recovering Pandas Users

Supporting materials:

Presented by Adrian Brudaru, Co-founder @ dltHub

Presenter

Name: Adrian Brudaru
Role: Co-founder, dltHub; data engineer since 2012
Motivation: Created dlt (data load tool) as the tool he wished he had for building data warehouses and data products

Narrative

The workshop addresses the friction between Data Scientists ("Pandas Users") and Data Engineers.

Pandas / ML Perspective:
- pandas is used for everything: loading, transforming, memory management, unnesting, cleaning
- Perfect for prototyping in local Jupyter notebooks
- Problems begin when "moving to production"
Data Engineer Perspective:
- Focus on scale, efficiency, resilience, maintainability, and testing
- A simple pandas.load_json() + df.to_sql() hides a long production checklist
- Requirements include: schema management, atomicity, idempotency, state persistence, incremental loading, memory management, parallelism, retries, schema evolution, data normalization, and data contracts

The slides contrast a simple 4-step pandas flow with the complex, multi-component system required to be production-ready.

The Proposed Solution: `dlt`

dlt is introduced as the solution: as easy to use as df.to_sql() while providing real-life production features.

Shallow learning curve
Transparent: not a black box; users own their code
Vendor-agnostic: avoid lock-in; move between destinations easily

Workshop Agenda (Deep Dive Topics)

Rapid-fire explanations across four areas, with self-paced examples in notebooks

Topic 1: Schema Management

Schema Inference: Automatic schema is inferred on first run. Weakly typed JSON becomes strongly typed relational structures by flattening dictionaries and unpacking lists into sub-tables.
Schema Evolution: Handles source changes by adding new columns/tables by default; can be configured to alert (e.g., Slack) on change.
Data Contracts: When you "hate change", freeze the schema via schema_contract to control tables, columns, and data_type with modes like evolve, freeze (stop load), discard_row, or discard_value. Pydantic models can be used as

Topic 2: Hardware Bottleneck Management

Memory (RAM): Use Python generators to yield data in chunks; configure buffer_max_items so dlt buffers to files after N items.
Disk: For small disks (e.g., serverless):
- Chunk a source using chunk_size and loop pipeline.run() (one-offs/backfills)
- Mount storage and set DLT_DATA_DIR as an "infinite disk"
CPU (Async I/O): Speed up I/O-bound APIs:
- Use @dlt.resource(parallelized=True) to parallelize
- Use async def resources; dlt runs them in parallel automatically (configurable worker count)
CPU (Parallelism): Fully utilize CPU during normalize via a process pool; configure [normalize] workers.
Network (Retries): Use from dlt.sources.helpers import requests for retries, exponential backoff, and HTTP 429 handling (respecting Retry-After).

Topic 3: Incremental Loading & State

Write Dispositions: replace (full load), append (stateless), merge (upsert with primary key), scd2 (Type 2, valid_from/valid_to)
State Handling: dlt state is a Python dict persisted in a separate destination table (more robust than orchestrator/implicit state)

Topic 4: Bonus Round & dlt Ecosystem

REST API Source: Templated source for many APIs without writing Python; configure client, auth, paginator, and resources in a dictionary. Docs link: rest api source
Workspace Dashboard: Browse schemas, debug pipelines, overall view. Docs link: validate with dasbboard
Ibis Integration: Query datasets with a single, backend-agnostic API in ~20 systems (DuckDB, Snowflake, BigQuery, etc.). Docs link: Ibis
Marimo Notebooks: Reactive notebooks saved as clean .py files (not .ipynb). Docs link:
LLM-native Scaffolding: 4,100+ scaffolds generated from API docs to solve long-tail connectors. Link: marimo
LLM-native Workflow: (1) Init scaffold, (2) Generate running code, (3) Debug in dashboard, (4) Explore in Marimo.
Docs link: LLM native workflow

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Workshop: Production-Ready Data Ingestion for Recovering Pandas Users

Presenter

Narrative

The Proposed Solution: `dlt`

Workshop Agenda (Deep Dive Topics)

Topic 1: Schema Management

Topic 2: Hardware Bottleneck Management

Topic 3: Incremental Loading & State

Topic 4: Bonus Round & dlt Ecosystem

About

Uh oh!

Releases

Packages

dlt-hub/odsc-ai-west-2025

Folders and files

Latest commit

History

Repository files navigation

Workshop: Production-Ready Data Ingestion for Recovering Pandas Users

Presenter

Narrative

The Proposed Solution: dlt

Workshop Agenda (Deep Dive Topics)

Topic 1: Schema Management

Topic 2: Hardware Bottleneck Management

Topic 3: Incremental Loading & State

Topic 4: Bonus Round & dlt Ecosystem

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

The Proposed Solution: `dlt`

Packages