Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
219 changes: 219 additions & 0 deletions book/40-operations/000-workflow-operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
# Workflow Operations

## Executing the Workflow

The previous sections established the **Relational Workflow Model** and schema design principles.
Your schema defines *what* entities exist, *how* they depend on each other, and *when* they are created in the workflow.
**Operations** are the actions that execute this workflow—populating your pipeline with actual data.

In DataJoint, operations fall into two categories:

1. **Manual operations** — Actions initiated *outside* the pipeline using `insert`, `delete`, and occasionally `update`
2. **Automatic operations** — Pipeline-driven population using `populate` for Imported and Computed tables

The term "manual" does not imply human involvement—it means the operation originates *external to the pipeline*.
A script that parses instrument files and inserts session records is performing manual operations, even though no human is involved.
The key distinction is *who initiates the action*: external processes (manual) versus the pipeline's own `populate` mechanism (automatic).

This distinction maps directly to the table tiers introduced in the [Relational Workflow Model](../20-concepts/05-workflows.md):

| Table Tier | How Data Enters | Typical Operations |
|------------|-----------------|-------------------|
| **Lookup** | Schema definition (`contents` property) | None—predefined |
| **Manual** | External to pipeline | `insert`, `delete` |
| **Imported** | Pipeline-driven acquisition | `populate` |
| **Computed** | Pipeline-driven computation | `populate` |

## Lookup Tables: Part of the Schema

**Lookup tables are not part of the workflow**—they are part of the schema definition itself.

Lookup tables contain reference data, controlled vocabularies, parameter sets, and configuration values that define the *context* in which the workflow operates.
This data is:

- Defined in the table class using the `contents` property
- Automatically present when the schema is activated
- Shared across all workflow executions

Examples include:
- Species names and codes
- Experimental protocols
- Processing parameter sets
- Instrument configurations

Because lookup data defines the problem space rather than recording workflow execution, it is specified declaratively as part of the table definition:

```python
@schema
class BlobParamSet(dj.Lookup):
definition = """
blob_paramset : int
---
min_sigma : float
max_sigma : float
threshold : float
"""
contents = [
(1, 1.0, 5.0, 0.1),
(2, 2.0, 10.0, 0.05),
]
```

When the schema is activated, an "empty" pipeline already has its lookup tables populated.
This ensures that reference data is always available and consistent across all installations of the pipeline.

## Manual Tables: The Workflow Entry Points

**Manual tables** are where new information enters the workflow from external sources.
The term "manual" refers to the data's origin—*outside the pipeline*—not to how it gets there.

Manual tables capture information that originates external to the computational pipeline:

- Experimental subjects and sessions
- Observations and annotations
- External system identifiers
- Curated selections and decisions

Data enters Manual tables through explicit `insert` operations from various sources:

- **Human entry**: Data entry forms, lab notebooks, manual curation
- **Automated scripts**: Parsing instrument files, syncing from external databases
- **External systems**: Laboratory information management systems (LIMS), scheduling software
- **Integration pipelines**: ETL processes that import data from other sources

Each insert into a Manual table potentially triggers downstream computations—this is the "data enters the system" event that drives the pipeline forward.
Whether a human clicks a button or a cron job runs a script, the effect is the same: new data enters the pipeline and becomes available for automatic processing.

## Automatic Population: The Workflow Engine

**Imported** and **Computed** tables are populated automatically through the `populate` mechanism.
This is the core of workflow automation in DataJoint.

When you call `populate()` on an auto-populated table, DataJoint:

1. Identifies what work is missing by examining upstream dependencies
2. Executes the table's `make()` method for each pending item
3. Wraps each computation in a transaction for integrity
4. Continues through all pending work, handling errors gracefully

This automation embodies the Relational Workflow Model's key principle: **the schema is an executable specification**.
You don't write scripts to orchestrate computations—you define dependencies, and the system figures out what to run.

```python
# The schema defines what should be computed
# populate() executes it
Detection.populate(display_progress=True)
```

## The Three Core Operations

### Insert: Adding Data

The `insert` operation adds new entities to Manual tables, representing new information entering the workflow from external sources.

```python
# Single row
Subject.insert1({"subject_id": "M001", "species": "mouse", "sex": "M"})

# Multiple rows
Session.insert([
{"subject_id": "M001", "session_date": "2024-01-15"},
{"subject_id": "M001", "session_date": "2024-01-16"},
])
```

### Delete: Removing Data with Cascade

The `delete` operation removes entities and **all their downstream dependents**.
This cascading behavior is fundamental to maintaining **computational validity**—the guarantee that derived data remains consistent with its inputs.

When you delete an entity:
- All entities that depend on it (via foreign keys) are also deleted
- This cascades through the entire dependency graph
- The result is a consistent database state

```python
# Deleting a session removes all its downstream analysis
(Session & {"subject_id": "M001", "session_date": "2024-01-15"}).delete()
```

Cascading delete is the primary mechanism for:
- **Correcting errors**: Delete incorrect upstream data; downstream results disappear automatically
- **Reprocessing**: Delete computed results to regenerate them with updated code
- **Data lifecycle**: Remove obsolete data and everything derived from it

### Update: Rare and Deliberate

The `update` operation modifies existing values **in place**.
In DataJoint, updates are deliberately rare because they can violate computational validity.

Consider: if you update an upstream value, downstream computed results become inconsistent—they were derived from the old value but now coexist with the new one.
The proper approach is usually **delete and reinsert**:

1. Delete the incorrect data (cascading removes dependent computations)
2. Insert the corrected data
3. Re-run `populate()` to regenerate downstream results

The `update1` method exists for cases where in-place correction is truly needed—typically for:
- Fixing typos in descriptive fields that don't affect computations
- Correcting metadata that has no downstream dependencies
- Administrative changes to non-scientific attributes

```python
# Use sparingly—only for corrections that don't affect downstream data
Subject.update1({"subject_id": "M001", "notes": "Corrected housing info"})
```

## The Workflow Execution Pattern

A typical DataJoint workflow follows this pattern:

```
┌─────────────────────────────────────────────────────────────┐
│ 1. SCHEMA ACTIVATION │
│ - Define tables and dependencies │
│ - Lookup tables are automatically populated (contents) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. EXTERNAL DATA ENTRY │
│ - Insert subjects, sessions, trials into Manual tables │
│ - Each insert is a potential trigger for downstream │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. AUTOMATIC POPULATION │
│ - Call populate() on Imported tables (data acquisition) │
│ - Call populate() on Computed tables (analysis) │
│ - System determines order from dependency graph │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. ITERATION │
│ - New manual entries trigger new computations │
│ - Errors corrected via delete + reinsert + repopulate │
│ - Pipeline grows incrementally │
└─────────────────────────────────────────────────────────────┘
```

## Transactions and Integrity

All operations in DataJoint respect **ACID transactions** and **referential integrity**:

- **Inserts** verify that all referenced foreign keys exist
- **Deletes** cascade to maintain referential integrity
- **Populate** wraps each `make()` call in a transaction

This ensures that the database always represents a consistent state—there are no orphaned records, no dangling references, and no partially-completed computations visible to other users.

## Chapter Overview

The following chapters detail each operation:

- **[Insert](010-insert.ipynb)** — Adding data to Manual tables
- **[Delete](020-delete.ipynb)** — Removing data with cascading dependencies
- **[Updates](030-updates.ipynb)** — Rare in-place modifications
- **[Transactions](040-transactions.ipynb)** — ACID semantics and consistency
- **[Populate](050-populate.ipynb)** — Automatic workflow execution
- **[The `make` Method](055-make.ipynb)** — Defining computational logic
- **[Orchestration](060-orchestration.ipynb)** — Infrastructure for running at scale
105 changes: 2 additions & 103 deletions book/40-operations/010-insert.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,108 +3,7 @@
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Insert \n",
"\n",
"(This is an AI-generated placeholder -- to be updated soon.)\n",
"\n",
"DataJoint provides two primary commands for adding data to tables: `insert` and `insert1`. Both commands are essential for populating tables while ensuring data integrity, but they are suited for different scenarios depending on the quantity and structure of the data being inserted.\n",
"\n",
"## Overview of `insert1`\n",
"\n",
"The `insert1` command is used for adding a single row of data to a table. It expects a dictionary where each key corresponds to a table attribute and the associated value represents the data to be inserted.\n",
"\n",
"### Syntax\n",
"\n",
"```python\n",
"<Table>.insert1(data, ignore_extra_fields=False)\n",
"```\n",
"\n",
"### Parameters\n",
"\n",
"1. **`data`**: A dictionary representing a single row of data, with keys matching the table's attributes.\n",
"2. **`ignore_extra_fields`** *(default: False)*:\n",
" - If `True`, attributes in the dictionary that are not part of the table schema are ignored.\n",
" - If `False`, the presence of extra fields will result in an error.\n",
"\n",
"### Example\n",
"\n",
"```python\n",
"import datajoint as dj\n",
"\n",
"schema = dj.Schema('example_schema')\n",
"\n",
"@schema\n",
"class Animal(dj.Manual):\n",
" definition = \"\"\"\n",
" animal_id: int # Unique identifier for the animal\n",
" ---\n",
" species: varchar(64) # Species of the animal\n",
" age: int # Age of the animal in years\n",
" \"\"\"\n",
"\n",
"# Insert a single row into the Animal table\n",
"Animal.insert1({\n",
" 'animal_id': 1,\n",
" 'species': 'Dog',\n",
" 'age': 5\n",
"})\n",
"```\n",
"\n",
"### Key Points\n",
"\n",
"- `insert1` is ideal for inserting a single, well-defined record.\n",
"- It ensures clarity when adding individual entries, reducing ambiguity in debugging.\n",
"\n",
"## Overview of `insert`\n",
"\n",
"The `insert` command is designed for batch insertion, allowing multiple rows to be added in a single operation. It accepts a list of dictionaries, where each dictionary represents a single row of data.\n",
"\n",
"### Syntax\n",
"\n",
"```python\n",
"<Table>.insert(data, ignore_extra_fields=False, skip_duplicates=False)\n",
"```\n",
"\n",
"### Parameters\n",
"\n",
"1. **`data`**: A list of dictionaries, where each dictionary corresponds to a row of data to insert.\n",
"2. **`ignore_extra_fields`** *(default: False)*:\n",
" - If `True`, any extra keys in the dictionaries are ignored.\n",
" - If `False`, extra keys result in an error.\n",
"3. **`skip_duplicates`** *(default: False)*:\n",
" - If `True`, rows with duplicate primary keys are skipped.\n",
" - If `False`, duplicate rows trigger an error.\n",
"\n",
"### Example\n",
"\n",
"```python\n",
"# Insert multiple rows into the Animal table\n",
"Animal.insert([\n",
" {'animal_id': 2, 'species': 'Cat', 'age': 3},\n",
" {'animal_id': 3, 'species': 'Rabbit', 'age': 2}\n",
"])\n",
"```\n",
"\n",
"### Key Points\n",
"\n",
"- `insert` is efficient for adding multiple records in a single operation.\n",
"- Use `skip_duplicates=True` to gracefully handle re-insertions of existing data.\n",
"\n",
"## Best Practices\n",
"\n",
"1. **Use ****`insert1`**** for Single Rows**: Prefer `insert1` when working with individual entries to maintain clarity.\n",
"2. **Validate Data Consistency**: Ensure the input data adheres to the schema definition.\n",
"3. **Batch Insert for Performance**: Use `insert` for larger datasets to minimize database interactions.\n",
"4. **Handle Extra Fields Carefully**: Use `ignore_extra_fields=False` to detect unexpected keys.\n",
"5. **Avoid Duplicates**: Use `skip_duplicates=True` when re-inserting known data to avoid errors.\n",
"\n",
"## Summary\n",
"\n",
"- Use `insert1` for single-row insertions and `insert` for batch operations.\n",
"- Both commands enforce schema constraints and maintain the integrity of the database.\n",
"- Proper use of these commands ensures efficient, accurate, and scalable data entry into your DataJoint pi\n"
]
"source": "# Insert\n\nThe `insert` operation adds new entities to Manual tables.\nIn the context of the [Relational Workflow Model](../20-concepts/05-workflows.md), inserting data is how information enters the pipeline from external sources.\n\n## Insert in the Workflow\n\nThe `insert` operation applies to **Manual tables**—tables that receive data from outside the pipeline:\n\n| Table Tier | How Data Enters |\n|------------|-----------------|\n| **Lookup** | `contents` property (part of schema definition) |\n| **Manual** | `insert` from external sources |\n| **Imported/Computed** | `populate()` mechanism |\n\nFor **Manual tables**, each insert represents new information entering the workflow from an external source.\nThe term \"manual\" refers to the data's origin—*outside the pipeline*—not to how it arrives.\nInserts into Manual tables can come from human data entry, automated scripts parsing instrument files, or integrations with external systems.\nWhat matters is that the pipeline's `populate` mechanism does not create this data—it comes from outside.\n\nEach insert into a Manual table potentially triggers downstream computations: when you insert a new session, all Imported and Computed tables that depend on it become candidates for population.\n\n## The `insert1` Method\n\nUse `insert1` to add a single row:\n\n```python\n<Table>.insert1(row, ignore_extra_fields=False)\n```\n\n**Parameters:**\n- **`row`**: A dictionary with keys matching table attributes\n- **`ignore_extra_fields`**: If `True`, extra dictionary keys are silently ignored; if `False` (default), extra keys raise an error\n\n**Example:**\n```python\n# Insert a single subject into a Manual table\nSubject.insert1({\n 'subject_id': 'M001',\n 'species': 'mouse',\n 'sex': 'M',\n 'date_of_birth': '2023-06-15'\n})\n```\n\nUse `insert1` when:\n- Adding individual records interactively\n- Processing items one at a time in a loop where you need error handling per item\n- Debugging, where single-row operations provide clearer error messages\n\n## The `insert` Method\n\nUse `insert` for batch insertion of multiple rows:\n\n```python\n<Table>.insert(rows, ignore_extra_fields=False, skip_duplicates=False)\n```\n\n**Parameters:**\n- **`rows`**: A list of dictionaries (or any iterable of dict-like objects)\n- **`ignore_extra_fields`**: If `True`, extra keys are ignored\n- **`skip_duplicates`**: If `True`, rows with existing primary keys are silently skipped; if `False` (default), duplicates raise an error\n\n**Example:**\n```python\n# Batch insert multiple sessions (could be from a script parsing log files)\nSession.insert([\n {'subject_id': 'M001', 'session_date': '2024-01-15', 'session_notes': 'baseline'},\n {'subject_id': 'M001', 'session_date': '2024-01-16', 'session_notes': 'treatment'},\n {'subject_id': 'M001', 'session_date': '2024-01-17', 'session_notes': 'follow-up'},\n])\n```\n\nUse `insert` when:\n- Loading data from files or external sources\n- Importing from external databases or APIs\n- Migrating or synchronizing data between systems\n\n## Referential Integrity\n\nDataJoint enforces referential integrity on insert.\nIf a table has foreign key dependencies, the referenced entities must already exist:\n\n```python\n# This will fail if subject 'M001' doesn't exist in Subject table\nSession.insert1({\n 'subject_id': 'M001', # Must exist in Subject\n 'session_date': '2024-01-15'\n})\n```\n\nThis constraint ensures the dependency graph remains valid—you cannot create downstream entities without their upstream dependencies.\nNote that Lookup table data (defined via `contents`) is automatically available when the schema is activated, so foreign key references to Lookup tables are always satisfied.\n\n## Best Practices\n\n1. **Match insert method to use case**: Use `insert1` for single records, `insert` for batches\n2. **Keep `ignore_extra_fields=False`** (default): Helps catch data mapping errors early\n3. **Insert upstream before downstream**: Respect the dependency order defined by foreign keys\n4. **Use `skip_duplicates=True` for idempotent scripts**: When re-running import scripts, this avoids errors on existing data\n5. **Let `populate()` handle auto-populated tables**: Never insert directly into Imported or Computed tables"
}
],
"metadata": {
Expand All @@ -114,4 +13,4 @@
},
"nbformat": 4,
"nbformat_minor": 2
}
}
Loading