datajoint · dimitri-yatsenko · Dec 13, 2025 · Dec 13, 2025 · Dec 13, 2025 · Dec 13, 2025
diff --git a/book/40-operations/000-workflow-operations.md b/book/40-operations/000-workflow-operations.md
@@ -0,0 +1,219 @@
+# Workflow Operations
+
+## Executing the Workflow
+
+The previous sections established the **Relational Workflow Model** and schema design principles.
+Your schema defines *what* entities exist, *how* they depend on each other, and *when* they are created in the workflow.
+**Operations** are the actions that execute this workflow—populating your pipeline with actual data.
+
+In DataJoint, operations fall into two categories:
+
+1. **Manual operations** — Actions initiated *outside* the pipeline using `insert`, `delete`, and occasionally `update`
+2. **Automatic operations** — Pipeline-driven population using `populate` for Imported and Computed tables
+
+The term "manual" does not imply human involvement—it means the operation originates *external to the pipeline*.
+A script that parses instrument files and inserts session records is performing manual operations, even though no human is involved.
+The key distinction is *who initiates the action*: external processes (manual) versus the pipeline's own `populate` mechanism (automatic).
+
+This distinction maps directly to the table tiers introduced in the [Relational Workflow Model](../20-concepts/05-workflows.md):
+
+| Table Tier | How Data Enters | Typical Operations |
+|------------|-----------------|-------------------|
+| **Lookup** | Schema definition (`contents` property) | None—predefined |
+| **Manual** | External to pipeline | `insert`, `delete` |
+| **Imported** | Pipeline-driven acquisition | `populate` |
+| **Computed** | Pipeline-driven computation | `populate` |
+
+## Lookup Tables: Part of the Schema
+
+**Lookup tables are not part of the workflow**—they are part of the schema definition itself.
+
+Lookup tables contain reference data, controlled vocabularies, parameter sets, and configuration values that define the *context* in which the workflow operates.
+This data is:
+
+- Defined in the table class using the `contents` property
+- Automatically present when the schema is activated
+- Shared across all workflow executions
+
+Examples include:
+- Species names and codes
+- Experimental protocols
+- Processing parameter sets
+- Instrument configurations
+
+Because lookup data defines the problem space rather than recording workflow execution, it is specified declaratively as part of the table definition:
+
+```python
+@schema
+class BlobParamSet(dj.Lookup):
+    definition = """
+    blob_paramset : int
+    ---
+    min_sigma : float
+    max_sigma : float
+    threshold : float
+    """
+    contents = [
+        (1, 1.0, 5.0, 0.1),
+        (2, 2.0, 10.0, 0.05),
+    ]
+```
+
+When the schema is activated, an "empty" pipeline already has its lookup tables populated.
+This ensures that reference data is always available and consistent across all installations of the pipeline.
+
+## Manual Tables: The Workflow Entry Points
+
+**Manual tables** are where new information enters the workflow from external sources.
+The term "manual" refers to the data's origin—*outside the pipeline*—not to how it gets there.
+
+Manual tables capture information that originates external to the computational pipeline:
+
+- Experimental subjects and sessions
+- Observations and annotations
+- External system identifiers
+- Curated selections and decisions
+
+Data enters Manual tables through explicit `insert` operations from various sources:
+
+- **Human entry**: Data entry forms, lab notebooks, manual curation
+- **Automated scripts**: Parsing instrument files, syncing from external databases
+- **External systems**: Laboratory information management systems (LIMS), scheduling software
+- **Integration pipelines**: ETL processes that import data from other sources
+
+Each insert into a Manual table potentially triggers downstream computations—this is the "data enters the system" event that drives the pipeline forward.
+Whether a human clicks a button or a cron job runs a script, the effect is the same: new data enters the pipeline and becomes available for automatic processing.
+
+## Automatic Population: The Workflow Engine
+
+**Imported** and **Computed** tables are populated automatically through the `populate` mechanism.
+This is the core of workflow automation in DataJoint.
+
+When you call `populate()` on an auto-populated table, DataJoint:
+
+1. Identifies what work is missing by examining upstream dependencies
+2. Executes the table's `make()` method for each pending item
+3. Wraps each computation in a transaction for integrity
+4. Continues through all pending work, handling errors gracefully
+
+This automation embodies the Relational Workflow Model's key principle: **the schema is an executable specification**.
+You don't write scripts to orchestrate computations—you define dependencies, and the system figures out what to run.
+
+```python
+# The schema defines what should be computed
+# populate() executes it
+Detection.populate(display_progress=True)
+```
+
+## The Three Core Operations
+
+### Insert: Adding Data
+
+The `insert` operation adds new entities to Manual tables, representing new information entering the workflow from external sources.
+
+```python
+# Single row
+Subject.insert1({"subject_id": "M001", "species": "mouse", "sex": "M"})
+
+# Multiple rows
+Session.insert([
+    {"subject_id": "M001", "session_date": "2024-01-15"},
+    {"subject_id": "M001", "session_date": "2024-01-16"},
+])
+```
+
+### Delete: Removing Data with Cascade
+
+The `delete` operation removes entities and **all their downstream dependents**.
+This cascading behavior is fundamental to maintaining **computational validity**—the guarantee that derived data remains consistent with its inputs.
+
+When you delete an entity:
+- All entities that depend on it (via foreign keys) are also deleted
+- This cascades through the entire dependency graph
+- The result is a consistent database state
+
+```python
+# Deleting a session removes all its downstream analysis
+(Session & {"subject_id": "M001", "session_date": "2024-01-15"}).delete()
+```
+
+Cascading delete is the primary mechanism for:
+- **Correcting errors**: Delete incorrect upstream data; downstream results disappear automatically
+- **Reprocessing**: Delete computed results to regenerate them with updated code
+- **Data lifecycle**: Remove obsolete data and everything derived from it
+
+### Update: Rare and Deliberate
+
+The `update` operation modifies existing values **in place**.
+In DataJoint, updates are deliberately rare because they can violate computational validity.
+
+Consider: if you update an upstream value, downstream computed results become inconsistent—they were derived from the old value but now coexist with the new one.
+The proper approach is usually **delete and reinsert**:
+
+1. Delete the incorrect data (cascading removes dependent computations)
+2. Insert the corrected data
+3. Re-run `populate()` to regenerate downstream results
+
+The `update1` method exists for cases where in-place correction is truly needed—typically for:
+- Fixing typos in descriptive fields that don't affect computations
+- Correcting metadata that has no downstream dependencies
+- Administrative changes to non-scientific attributes
+
+```python
+# Use sparingly—only for corrections that don't affect downstream data
+Subject.update1({"subject_id": "M001", "notes": "Corrected housing info"})
+```
+
+## The Workflow Execution Pattern
+
+A typical DataJoint workflow follows this pattern:
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│  1. SCHEMA ACTIVATION                                       │
+│     - Define tables and dependencies                        │
+│     - Lookup tables are automatically populated (contents)  │
+└─────────────────────────────────────────────────────────────┘
+                              ↓
+┌─────────────────────────────────────────────────────────────┐
+│  2. EXTERNAL DATA ENTRY                                     │
+│     - Insert subjects, sessions, trials into Manual tables  │
+│     - Each insert is a potential trigger for downstream     │
+└─────────────────────────────────────────────────────────────┘
+                              ↓
+┌─────────────────────────────────────────────────────────────┐
+│  3. AUTOMATIC POPULATION                                    │
+│     - Call populate() on Imported tables (data acquisition) │
+│     - Call populate() on Computed tables (analysis)         │
+│     - System determines order from dependency graph         │
+└─────────────────────────────────────────────────────────────┘
+                              ↓
+┌─────────────────────────────────────────────────────────────┐
+│  4. ITERATION                                               │
+│     - New manual entries trigger new computations           │
+│     - Errors corrected via delete + reinsert + repopulate   │
+│     - Pipeline grows incrementally                          │
+└─────────────────────────────────────────────────────────────┘
+```
+
+## Transactions and Integrity
+
+All operations in DataJoint respect **ACID transactions** and **referential integrity**:
+
+- **Inserts** verify that all referenced foreign keys exist
+- **Deletes** cascade to maintain referential integrity
+- **Populate** wraps each `make()` call in a transaction
+
+This ensures that the database always represents a consistent state—there are no orphaned records, no dangling references, and no partially-completed computations visible to other users.
+
+## Chapter Overview
+
+The following chapters detail each operation:
+
+- **[Insert](010-insert.ipynb)** — Adding data to Manual tables
+- **[Delete](020-delete.ipynb)** — Removing data with cascading dependencies
+- **[Updates](030-updates.ipynb)** — Rare in-place modifications
+- **[Transactions](040-transactions.ipynb)** — ACID semantics and consistency
+- **[Populate](050-populate.ipynb)** — Automatic workflow execution
+- **[The `make` Method](055-make.ipynb)** — Defining computational logic
+- **[Orchestration](060-orchestration.ipynb)** — Infrastructure for running at scale
diff --git a/book/40-operations/010-insert.ipynb b/book/40-operations/010-insert.ipynb
@@ -3,108 +3,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "# Insert \n",
-    "\n",
-    "(This is an AI-generated placeholder -- to be updated soon.)\n",
-    "\n",
-    "DataJoint provides two primary commands for adding data to tables: `insert` and `insert1`. Both commands are essential for populating tables while ensuring data integrity, but they are suited for different scenarios depending on the quantity and structure of the data being inserted.\n",
-    "\n",
-    "## Overview of `insert1`\n",
-    "\n",
-    "The `insert1` command is used for adding a single row of data to a table. It expects a dictionary where each key corresponds to a table attribute and the associated value represents the data to be inserted.\n",
-    "\n",
-    "### Syntax\n",
-    "\n",
-    "```python\n",
-    "<Table>.insert1(data, ignore_extra_fields=False)\n",
-    "```\n",
-    "\n",
-    "### Parameters\n",
-    "\n",
-    "1. **`data`**: A dictionary representing a single row of data, with keys matching the table's attributes.\n",
-    "2. **`ignore_extra_fields`** *(default: False)*:\n",
-    "   - If `True`, attributes in the dictionary that are not part of the table schema are ignored.\n",
-    "   - If `False`, the presence of extra fields will result in an error.\n",
-    "\n",
-    "### Example\n",
-    "\n",
-    "```python\n",
-    "import datajoint as dj\n",
-    "\n",
-    "schema = dj.Schema('example_schema')\n",
-    "\n",
-    "@schema\n",
-    "class Animal(dj.Manual):\n",
-    "    definition = \"\"\"\n",
-    "    animal_id: int  # Unique identifier for the animal\n",
-    "    ---\n",
-    "    species: varchar(64)  # Species of the animal\n",
-    "    age: int             # Age of the animal in years\n",
-    "    \"\"\"\n",
-    "\n",
-    "# Insert a single row into the Animal table\n",
-    "Animal.insert1({\n",
-    "    'animal_id': 1,\n",
-    "    'species': 'Dog',\n",
-    "    'age': 5\n",
-    "})\n",
-    "```\n",
-    "\n",
-    "### Key Points\n",
-    "\n",
-    "- `insert1` is ideal for inserting a single, well-defined record.\n",
-    "- It ensures clarity when adding individual entries, reducing ambiguity in debugging.\n",
-    "\n",
-    "## Overview of `insert`\n",
-    "\n",
-    "The `insert` command is designed for batch insertion, allowing multiple rows to be added in a single operation. It accepts a list of dictionaries, where each dictionary represents a single row of data.\n",
-    "\n",
-    "### Syntax\n",
-    "\n",
-    "```python\n",
-    "<Table>.insert(data, ignore_extra_fields=False, skip_duplicates=False)\n",
-    "```\n",
-    "\n",
-    "### Parameters\n",
-    "\n",
-    "1. **`data`**: A list of dictionaries, where each dictionary corresponds to a row of data to insert.\n",
-    "2. **`ignore_extra_fields`** *(default: False)*:\n",
-    "   - If `True`, any extra keys in the dictionaries are ignored.\n",
-    "   - If `False`, extra keys result in an error.\n",
-    "3. **`skip_duplicates`** *(default: False)*:\n",
-    "   - If `True`, rows with duplicate primary keys are skipped.\n",
-    "   - If `False`, duplicate rows trigger an error.\n",
-    "\n",
-    "### Example\n",
-    "\n",
-    "```python\n",
-    "# Insert multiple rows into the Animal table\n",
-    "Animal.insert([\n",
-    "    {'animal_id': 2, 'species': 'Cat', 'age': 3},\n",
-    "    {'animal_id': 3, 'species': 'Rabbit', 'age': 2}\n",
-    "])\n",
-    "```\n",
-    "\n",
-    "### Key Points\n",
-    "\n",
-    "- `insert` is efficient for adding multiple records in a single operation.\n",
-    "- Use `skip_duplicates=True` to gracefully handle re-insertions of existing data.\n",
-    "\n",
-    "## Best Practices\n",
-    "\n",
-    "1. **Use ****`insert1`**** for Single Rows**: Prefer `insert1` when working with individual entries to maintain clarity.\n",
-    "2. **Validate Data Consistency**: Ensure the input data adheres to the schema definition.\n",
-    "3. **Batch Insert for Performance**: Use `insert` for larger datasets to minimize database interactions.\n",
-    "4. **Handle Extra Fields Carefully**: Use `ignore_extra_fields=False` to detect unexpected keys.\n",
-    "5. **Avoid Duplicates**: Use `skip_duplicates=True` when re-inserting known data to avoid errors.\n",
-    "\n",
-    "## Summary\n",
-    "\n",
-    "- Use `insert1` for single-row insertions and `insert` for batch operations.\n",
-    "- Both commands enforce schema constraints and maintain the integrity of the database.\n",
-    "- Proper use of these commands ensures efficient, accurate, and scalable data entry into your DataJoint pi\n"
-   ]
+   "source": "# Insert\n\nThe `insert` operation adds new entities to Manual tables.\nIn the context of the [Relational Workflow Model](../20-concepts/05-workflows.md), inserting data is how information enters the pipeline from external sources.\n\n## Insert in the Workflow\n\nThe `insert` operation applies to **Manual tables**—tables that receive data from outside the pipeline:\n\n| Table Tier | How Data Enters |\n|------------|-----------------|\n| **Lookup** | `contents` property (part of schema definition) |\n| **Manual** | `insert` from external sources |\n| **Imported/Computed** | `populate()` mechanism |\n\nFor **Manual tables**, each insert represents new information entering the workflow from an external source.\nThe term \"manual\" refers to the data's origin—*outside the pipeline*—not to how it arrives.\nInserts into Manual tables can come from human data entry, automated scripts parsing instrument files, or integrations with external systems.\nWhat matters is that the pipeline's `populate` mechanism does not create this data—it comes from outside.\n\nEach insert into a Manual table potentially triggers downstream computations: when you insert a new session, all Imported and Computed tables that depend on it become candidates for population.\n\n## The `insert1` Method\n\nUse `insert1` to add a single row:\n\n```python\n<Table>.insert1(row, ignore_extra_fields=False)\n```\n\n**Parameters:**\n- **`row`**: A dictionary with keys matching table attributes\n- **`ignore_extra_fields`**: If `True`, extra dictionary keys are silently ignored; if `False` (default), extra keys raise an error\n\n**Example:**\n```python\n# Insert a single subject into a Manual table\nSubject.insert1({\n    'subject_id': 'M001',\n    'species': 'mouse',\n    'sex': 'M',\n    'date_of_birth': '2023-06-15'\n})\n```\n\nUse `insert1` when:\n- Adding individual records interactively\n- Processing items one at a time in a loop where you need error handling per item\n- Debugging, where single-row operations provide clearer error messages\n\n## The `insert` Method\n\nUse `insert` for batch insertion of multiple rows:\n\n```python\n<Table>.insert(rows, ignore_extra_fields=False, skip_duplicates=False)\n```\n\n**Parameters:**\n- **`rows`**: A list of dictionaries (or any iterable of dict-like objects)\n- **`ignore_extra_fields`**: If `True`, extra keys are ignored\n- **`skip_duplicates`**: If `True`, rows with existing primary keys are silently skipped; if `False` (default), duplicates raise an error\n\n**Example:**\n```python\n# Batch insert multiple sessions (could be from a script parsing log files)\nSession.insert([\n    {'subject_id': 'M001', 'session_date': '2024-01-15', 'session_notes': 'baseline'},\n    {'subject_id': 'M001', 'session_date': '2024-01-16', 'session_notes': 'treatment'},\n    {'subject_id': 'M001', 'session_date': '2024-01-17', 'session_notes': 'follow-up'},\n])\n```\n\nUse `insert` when:\n- Loading data from files or external sources\n- Importing from external databases or APIs\n- Migrating or synchronizing data between systems\n\n## Referential Integrity\n\nDataJoint enforces referential integrity on insert.\nIf a table has foreign key dependencies, the referenced entities must already exist:\n\n```python\n# This will fail if subject 'M001' doesn't exist in Subject table\nSession.insert1({\n    'subject_id': 'M001',  # Must exist in Subject\n    'session_date': '2024-01-15'\n})\n```\n\nThis constraint ensures the dependency graph remains valid—you cannot create downstream entities without their upstream dependencies.\nNote that Lookup table data (defined via `contents`) is automatically available when the schema is activated, so foreign key references to Lookup tables are always satisfied.\n\n## Best Practices\n\n1. **Match insert method to use case**: Use `insert1` for single records, `insert` for batches\n2. **Keep `ignore_extra_fields=False`** (default): Helps catch data mapping errors early\n3. **Insert upstream before downstream**: Respect the dependency order defined by foreign keys\n4. **Use `skip_duplicates=True` for idempotent scripts**: When re-running import scripts, this avoids errors on existing data\n5. **Let `populate()` handle auto-populated tables**: Never insert directly into Imported or Computed tables"
   }
  ],
  "metadata": {
@@ -114,4 +13,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}
+}