diff --git a/book/40-operations/000-workflow-operations.md b/book/40-operations/000-workflow-operations.md new file mode 100644 index 0000000..7527b31 --- /dev/null +++ b/book/40-operations/000-workflow-operations.md @@ -0,0 +1,219 @@ +# Workflow Operations + +## Executing the Workflow + +The previous sections established the **Relational Workflow Model** and schema design principles. +Your schema defines *what* entities exist, *how* they depend on each other, and *when* they are created in the workflow. +**Operations** are the actions that execute this workflow—populating your pipeline with actual data. + +In DataJoint, operations fall into two categories: + +1. **Manual operations** — Actions initiated *outside* the pipeline using `insert`, `delete`, and occasionally `update` +2. **Automatic operations** — Pipeline-driven population using `populate` for Imported and Computed tables + +The term "manual" does not imply human involvement—it means the operation originates *external to the pipeline*. +A script that parses instrument files and inserts session records is performing manual operations, even though no human is involved. +The key distinction is *who initiates the action*: external processes (manual) versus the pipeline's own `populate` mechanism (automatic). + +This distinction maps directly to the table tiers introduced in the [Relational Workflow Model](../20-concepts/05-workflows.md): + +| Table Tier | How Data Enters | Typical Operations | +|------------|-----------------|-------------------| +| **Lookup** | Schema definition (`contents` property) | None—predefined | +| **Manual** | External to pipeline | `insert`, `delete` | +| **Imported** | Pipeline-driven acquisition | `populate` | +| **Computed** | Pipeline-driven computation | `populate` | + +## Lookup Tables: Part of the Schema + +**Lookup tables are not part of the workflow**—they are part of the schema definition itself. + +Lookup tables contain reference data, controlled vocabularies, parameter sets, and configuration values that define the *context* in which the workflow operates. +This data is: + +- Defined in the table class using the `contents` property +- Automatically present when the schema is activated +- Shared across all workflow executions + +Examples include: +- Species names and codes +- Experimental protocols +- Processing parameter sets +- Instrument configurations + +Because lookup data defines the problem space rather than recording workflow execution, it is specified declaratively as part of the table definition: + +```python +@schema +class BlobParamSet(dj.Lookup): + definition = """ + blob_paramset : int + --- + min_sigma : float + max_sigma : float + threshold : float + """ + contents = [ + (1, 1.0, 5.0, 0.1), + (2, 2.0, 10.0, 0.05), + ] +``` + +When the schema is activated, an "empty" pipeline already has its lookup tables populated. +This ensures that reference data is always available and consistent across all installations of the pipeline. + +## Manual Tables: The Workflow Entry Points + +**Manual tables** are where new information enters the workflow from external sources. +The term "manual" refers to the data's origin—*outside the pipeline*—not to how it gets there. + +Manual tables capture information that originates external to the computational pipeline: + +- Experimental subjects and sessions +- Observations and annotations +- External system identifiers +- Curated selections and decisions + +Data enters Manual tables through explicit `insert` operations from various sources: + +- **Human entry**: Data entry forms, lab notebooks, manual curation +- **Automated scripts**: Parsing instrument files, syncing from external databases +- **External systems**: Laboratory information management systems (LIMS), scheduling software +- **Integration pipelines**: ETL processes that import data from other sources + +Each insert into a Manual table potentially triggers downstream computations—this is the "data enters the system" event that drives the pipeline forward. +Whether a human clicks a button or a cron job runs a script, the effect is the same: new data enters the pipeline and becomes available for automatic processing. + +## Automatic Population: The Workflow Engine + +**Imported** and **Computed** tables are populated automatically through the `populate` mechanism. +This is the core of workflow automation in DataJoint. + +When you call `populate()` on an auto-populated table, DataJoint: + +1. Identifies what work is missing by examining upstream dependencies +2. Executes the table's `make()` method for each pending item +3. Wraps each computation in a transaction for integrity +4. Continues through all pending work, handling errors gracefully + +This automation embodies the Relational Workflow Model's key principle: **the schema is an executable specification**. +You don't write scripts to orchestrate computations—you define dependencies, and the system figures out what to run. + +```python +# The schema defines what should be computed +# populate() executes it +Detection.populate(display_progress=True) +``` + +## The Three Core Operations + +### Insert: Adding Data + +The `insert` operation adds new entities to Manual tables, representing new information entering the workflow from external sources. + +```python +# Single row +Subject.insert1({"subject_id": "M001", "species": "mouse", "sex": "M"}) + +# Multiple rows +Session.insert([ + {"subject_id": "M001", "session_date": "2024-01-15"}, + {"subject_id": "M001", "session_date": "2024-01-16"}, +]) +``` + +### Delete: Removing Data with Cascade + +The `delete` operation removes entities and **all their downstream dependents**. +This cascading behavior is fundamental to maintaining **computational validity**—the guarantee that derived data remains consistent with its inputs. + +When you delete an entity: +- All entities that depend on it (via foreign keys) are also deleted +- This cascades through the entire dependency graph +- The result is a consistent database state + +```python +# Deleting a session removes all its downstream analysis +(Session & {"subject_id": "M001", "session_date": "2024-01-15"}).delete() +``` + +Cascading delete is the primary mechanism for: +- **Correcting errors**: Delete incorrect upstream data; downstream results disappear automatically +- **Reprocessing**: Delete computed results to regenerate them with updated code +- **Data lifecycle**: Remove obsolete data and everything derived from it + +### Update: Rare and Deliberate + +The `update` operation modifies existing values **in place**. +In DataJoint, updates are deliberately rare because they can violate computational validity. + +Consider: if you update an upstream value, downstream computed results become inconsistent—they were derived from the old value but now coexist with the new one. +The proper approach is usually **delete and reinsert**: + +1. Delete the incorrect data (cascading removes dependent computations) +2. Insert the corrected data +3. Re-run `populate()` to regenerate downstream results + +The `update1` method exists for cases where in-place correction is truly needed—typically for: +- Fixing typos in descriptive fields that don't affect computations +- Correcting metadata that has no downstream dependencies +- Administrative changes to non-scientific attributes + +```python +# Use sparingly—only for corrections that don't affect downstream data +Subject.update1({"subject_id": "M001", "notes": "Corrected housing info"}) +``` + +## The Workflow Execution Pattern + +A typical DataJoint workflow follows this pattern: + +``` +┌─────────────────────────────────────────────────────────────┐ +│ 1. SCHEMA ACTIVATION │ +│ - Define tables and dependencies │ +│ - Lookup tables are automatically populated (contents) │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ 2. EXTERNAL DATA ENTRY │ +│ - Insert subjects, sessions, trials into Manual tables │ +│ - Each insert is a potential trigger for downstream │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ 3. AUTOMATIC POPULATION │ +│ - Call populate() on Imported tables (data acquisition) │ +│ - Call populate() on Computed tables (analysis) │ +│ - System determines order from dependency graph │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ 4. ITERATION │ +│ - New manual entries trigger new computations │ +│ - Errors corrected via delete + reinsert + repopulate │ +│ - Pipeline grows incrementally │ +└─────────────────────────────────────────────────────────────┘ +``` + +## Transactions and Integrity + +All operations in DataJoint respect **ACID transactions** and **referential integrity**: + +- **Inserts** verify that all referenced foreign keys exist +- **Deletes** cascade to maintain referential integrity +- **Populate** wraps each `make()` call in a transaction + +This ensures that the database always represents a consistent state—there are no orphaned records, no dangling references, and no partially-completed computations visible to other users. + +## Chapter Overview + +The following chapters detail each operation: + +- **[Insert](010-insert.ipynb)** — Adding data to Manual tables +- **[Delete](020-delete.ipynb)** — Removing data with cascading dependencies +- **[Updates](030-updates.ipynb)** — Rare in-place modifications +- **[Transactions](040-transactions.ipynb)** — ACID semantics and consistency +- **[Populate](050-populate.ipynb)** — Automatic workflow execution +- **[The `make` Method](055-make.ipynb)** — Defining computational logic +- **[Orchestration](060-orchestration.ipynb)** — Infrastructure for running at scale diff --git a/book/40-operations/010-insert.ipynb b/book/40-operations/010-insert.ipynb index aa82428..0beb080 100644 --- a/book/40-operations/010-insert.ipynb +++ b/book/40-operations/010-insert.ipynb @@ -3,108 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Insert \n", - "\n", - "(This is an AI-generated placeholder -- to be updated soon.)\n", - "\n", - "DataJoint provides two primary commands for adding data to tables: `insert` and `insert1`. Both commands are essential for populating tables while ensuring data integrity, but they are suited for different scenarios depending on the quantity and structure of the data being inserted.\n", - "\n", - "## Overview of `insert1`\n", - "\n", - "The `insert1` command is used for adding a single row of data to a table. It expects a dictionary where each key corresponds to a table attribute and the associated value represents the data to be inserted.\n", - "\n", - "### Syntax\n", - "\n", - "```python\n", - ".insert1(data, ignore_extra_fields=False)\n", - "```\n", - "\n", - "### Parameters\n", - "\n", - "1. **`data`**: A dictionary representing a single row of data, with keys matching the table's attributes.\n", - "2. **`ignore_extra_fields`** *(default: False)*:\n", - " - If `True`, attributes in the dictionary that are not part of the table schema are ignored.\n", - " - If `False`, the presence of extra fields will result in an error.\n", - "\n", - "### Example\n", - "\n", - "```python\n", - "import datajoint as dj\n", - "\n", - "schema = dj.Schema('example_schema')\n", - "\n", - "@schema\n", - "class Animal(dj.Manual):\n", - " definition = \"\"\"\n", - " animal_id: int # Unique identifier for the animal\n", - " ---\n", - " species: varchar(64) # Species of the animal\n", - " age: int # Age of the animal in years\n", - " \"\"\"\n", - "\n", - "# Insert a single row into the Animal table\n", - "Animal.insert1({\n", - " 'animal_id': 1,\n", - " 'species': 'Dog',\n", - " 'age': 5\n", - "})\n", - "```\n", - "\n", - "### Key Points\n", - "\n", - "- `insert1` is ideal for inserting a single, well-defined record.\n", - "- It ensures clarity when adding individual entries, reducing ambiguity in debugging.\n", - "\n", - "## Overview of `insert`\n", - "\n", - "The `insert` command is designed for batch insertion, allowing multiple rows to be added in a single operation. It accepts a list of dictionaries, where each dictionary represents a single row of data.\n", - "\n", - "### Syntax\n", - "\n", - "```python\n", - "
.insert(data, ignore_extra_fields=False, skip_duplicates=False)\n", - "```\n", - "\n", - "### Parameters\n", - "\n", - "1. **`data`**: A list of dictionaries, where each dictionary corresponds to a row of data to insert.\n", - "2. **`ignore_extra_fields`** *(default: False)*:\n", - " - If `True`, any extra keys in the dictionaries are ignored.\n", - " - If `False`, extra keys result in an error.\n", - "3. **`skip_duplicates`** *(default: False)*:\n", - " - If `True`, rows with duplicate primary keys are skipped.\n", - " - If `False`, duplicate rows trigger an error.\n", - "\n", - "### Example\n", - "\n", - "```python\n", - "# Insert multiple rows into the Animal table\n", - "Animal.insert([\n", - " {'animal_id': 2, 'species': 'Cat', 'age': 3},\n", - " {'animal_id': 3, 'species': 'Rabbit', 'age': 2}\n", - "])\n", - "```\n", - "\n", - "### Key Points\n", - "\n", - "- `insert` is efficient for adding multiple records in a single operation.\n", - "- Use `skip_duplicates=True` to gracefully handle re-insertions of existing data.\n", - "\n", - "## Best Practices\n", - "\n", - "1. **Use ****`insert1`**** for Single Rows**: Prefer `insert1` when working with individual entries to maintain clarity.\n", - "2. **Validate Data Consistency**: Ensure the input data adheres to the schema definition.\n", - "3. **Batch Insert for Performance**: Use `insert` for larger datasets to minimize database interactions.\n", - "4. **Handle Extra Fields Carefully**: Use `ignore_extra_fields=False` to detect unexpected keys.\n", - "5. **Avoid Duplicates**: Use `skip_duplicates=True` when re-inserting known data to avoid errors.\n", - "\n", - "## Summary\n", - "\n", - "- Use `insert1` for single-row insertions and `insert` for batch operations.\n", - "- Both commands enforce schema constraints and maintain the integrity of the database.\n", - "- Proper use of these commands ensures efficient, accurate, and scalable data entry into your DataJoint pi\n" - ] + "source": "# Insert\n\nThe `insert` operation adds new entities to Manual tables.\nIn the context of the [Relational Workflow Model](../20-concepts/05-workflows.md), inserting data is how information enters the pipeline from external sources.\n\n## Insert in the Workflow\n\nThe `insert` operation applies to **Manual tables**—tables that receive data from outside the pipeline:\n\n| Table Tier | How Data Enters |\n|------------|-----------------|\n| **Lookup** | `contents` property (part of schema definition) |\n| **Manual** | `insert` from external sources |\n| **Imported/Computed** | `populate()` mechanism |\n\nFor **Manual tables**, each insert represents new information entering the workflow from an external source.\nThe term \"manual\" refers to the data's origin—*outside the pipeline*—not to how it arrives.\nInserts into Manual tables can come from human data entry, automated scripts parsing instrument files, or integrations with external systems.\nWhat matters is that the pipeline's `populate` mechanism does not create this data—it comes from outside.\n\nEach insert into a Manual table potentially triggers downstream computations: when you insert a new session, all Imported and Computed tables that depend on it become candidates for population.\n\n## The `insert1` Method\n\nUse `insert1` to add a single row:\n\n```python\n
.insert1(row, ignore_extra_fields=False)\n```\n\n**Parameters:**\n- **`row`**: A dictionary with keys matching table attributes\n- **`ignore_extra_fields`**: If `True`, extra dictionary keys are silently ignored; if `False` (default), extra keys raise an error\n\n**Example:**\n```python\n# Insert a single subject into a Manual table\nSubject.insert1({\n 'subject_id': 'M001',\n 'species': 'mouse',\n 'sex': 'M',\n 'date_of_birth': '2023-06-15'\n})\n```\n\nUse `insert1` when:\n- Adding individual records interactively\n- Processing items one at a time in a loop where you need error handling per item\n- Debugging, where single-row operations provide clearer error messages\n\n## The `insert` Method\n\nUse `insert` for batch insertion of multiple rows:\n\n```python\n
.insert(rows, ignore_extra_fields=False, skip_duplicates=False)\n```\n\n**Parameters:**\n- **`rows`**: A list of dictionaries (or any iterable of dict-like objects)\n- **`ignore_extra_fields`**: If `True`, extra keys are ignored\n- **`skip_duplicates`**: If `True`, rows with existing primary keys are silently skipped; if `False` (default), duplicates raise an error\n\n**Example:**\n```python\n# Batch insert multiple sessions (could be from a script parsing log files)\nSession.insert([\n {'subject_id': 'M001', 'session_date': '2024-01-15', 'session_notes': 'baseline'},\n {'subject_id': 'M001', 'session_date': '2024-01-16', 'session_notes': 'treatment'},\n {'subject_id': 'M001', 'session_date': '2024-01-17', 'session_notes': 'follow-up'},\n])\n```\n\nUse `insert` when:\n- Loading data from files or external sources\n- Importing from external databases or APIs\n- Migrating or synchronizing data between systems\n\n## Referential Integrity\n\nDataJoint enforces referential integrity on insert.\nIf a table has foreign key dependencies, the referenced entities must already exist:\n\n```python\n# This will fail if subject 'M001' doesn't exist in Subject table\nSession.insert1({\n 'subject_id': 'M001', # Must exist in Subject\n 'session_date': '2024-01-15'\n})\n```\n\nThis constraint ensures the dependency graph remains valid—you cannot create downstream entities without their upstream dependencies.\nNote that Lookup table data (defined via `contents`) is automatically available when the schema is activated, so foreign key references to Lookup tables are always satisfied.\n\n## Best Practices\n\n1. **Match insert method to use case**: Use `insert1` for single records, `insert` for batches\n2. **Keep `ignore_extra_fields=False`** (default): Helps catch data mapping errors early\n3. **Insert upstream before downstream**: Respect the dependency order defined by foreign keys\n4. **Use `skip_duplicates=True` for idempotent scripts**: When re-running import scripts, this avoids errors on existing data\n5. **Let `populate()` handle auto-populated tables**: Never insert directly into Imported or Computed tables" } ], "metadata": { @@ -114,4 +13,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/book/40-operations/020-delete.ipynb b/book/40-operations/020-delete.ipynb index 62033c1..1483fc9 100644 --- a/book/40-operations/020-delete.ipynb +++ b/book/40-operations/020-delete.ipynb @@ -3,112 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Delete\n", - "\n", - "(This is an AI-generated placeholder to be edited.)\n", - "\n", - "The `delete` command in DataJoint provides a robust mechanism for removing data from tables. It ensures that deletions respect the dependency structure defined by the relational schema, preserving the integrity of your database. This command is powerful and should be used with a clear understanding of its effects on downstream dependencies.\n", - "\n", - "## Overview of `delete`\n", - "\n", - "The `delete` command removes entries from a table. When executed, it ensures that all dependent data in downstream tables is also removed, unless explicitly restricted.\n", - "\n", - "### Syntax\n", - "\n", - "```python\n", - "
.delete(safemode=True, quick=False)\n", - "```\n", - "\n", - "### Parameters\n", - "\n", - "1. **`safemode`** *(default: True)*:\n", - " - If `True`, prompts the user for confirmation before deleting any data.\n", - " - If `False`, proceeds with deletion without prompting.\n", - "2. **`quick`** *(default: False)*:\n", - " - If `True`, accelerates deletion by skipping certain checks, such as confirming dependencies.\n", - " - Use this option cautiously as it bypasses safety mechanisms.\n", - "\n", - "## Example Usage\n", - "\n", - "### Deleting Specific Entries\n", - "\n", - "To delete specific rows based on a condition:\n", - "\n", - "```python\n", - "import datajoint as dj\n", - "\n", - "schema = dj.Schema('example_schema')\n", - "\n", - "@schema\n", - "class Animal(dj.Manual):\n", - " definition = \"\"\"\n", - " animal_id: int # Unique identifier for the animal\n", - " ---\n", - " species: varchar(64) # Species of the animal\n", - " age: int # Age of the animal in years\n", - " \"\"\"\n", - "\n", - "# Insert example data\n", - "Animal.insert([\n", - " {'animal_id': 1, 'species': 'Dog', 'age': 5},\n", - " {'animal_id': 2, 'species': 'Cat', 'age': 3},\n", - "])\n", - "\n", - "# Delete rows where species is 'Cat'\n", - "(Animal & {'species': 'Cat'}).delete()\n", - "```\n", - "\n", - "### Deleting All Entries\n", - "\n", - "To delete all entries from a table:\n", - "\n", - "```python\n", - "Animal.delete()\n", - "```\n", - "\n", - "### Using `safemode`\n", - "\n", - "By default, `safemode=True` will prompt the user for confirmation before deletion. To bypass the prompt:\n", - "\n", - "```python\n", - "Animal.delete(safemode=False)\n", - "```\n", - "\n", - "## Dependency Management\n", - "\n", - "One of the key features of `delete` is its handling of dependencies. When deleting data, DataJoint ensures that:\n", - "\n", - "1. **Downstream Data is Removed**: Any dependent entries in other tables are recursively deleted to maintain referential integrity.\n", - "2. **Deletion is Acyclic**: The dependency graph is traversed in topological order to avoid cyclic deletion issues.\n", - "\n", - "### Restricting Deletions\n", - "\n", - "To delete specific entries while preserving others:\n", - "\n", - "```python\n", - "(Animal & {'animal_id': 1}).delete()\n", - "```\n", - "\n", - "In this example, only the entry with `animal_id=1` is deleted, and other rows remain intact.\n", - "\n", - "## Best Practices\n", - "\n", - "1. **Use `safemode=True`**: Always use `safemode` when testing or in uncertain situations to prevent accidental data loss.\n", - "2. **Test Deletion Queries**: Before running `delete`, test your restrictions with `fetch` to ensure you are targeting the correct data.\n", - "3. **Be Cautious with `quick=True`**: Use the `quick` parameter sparingly, as it skips important safety checks.\n", - "4. **Understand Dependencies**: Review your schema's dependency structure to anticipate the cascading effects of deletions.\n", - "\n", - "## Summary\n", - "\n", - "The `delete` command is a powerful tool for managing data lifecycle in a DataJoint pipeline. By respecting dependencies and offering safety mechanisms, it ensures that data deletions are controlled and consistent. Proper use of this command helps maintain the integrity and cleanliness of your database.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] + "source": "# Delete\n\nThe `delete` operation removes entities from the database along with **all their downstream dependents**.\nThis cascading behavior is fundamental to maintaining **computational validity**—the guarantee that derived data remains consistent with its inputs.\n\n## Cascading Delete and Computational Validity\n\nIn the [Relational Workflow Model](../20-concepts/05-workflows.md), every entity in a Computed or Imported table was derived from specific upstream data.\nIf that upstream data is deleted or found to be incorrect, the derived results become meaningless—they are artifacts of inputs that no longer exist or were never valid.\n\nDataJoint enforces this principle through **cascading deletes**:\n\n```\nSubject ← Session ← Recording ← SpikeSort ← UnitAnalysis\n │ │ │ │ │\n └─────────┴──────────┴───────────┴────────────┘\n Deleting a Session removes all of these\n```\n\nWhen you delete an entity:\n1. All entities that reference it (via foreign keys) are identified\n2. Those entities are recursively deleted\n3. The cascade continues through the entire dependency graph\n4. The final state is always referentially consistent\n\nThis is not merely cleanup—it is **enforcing the semantics of the workflow**.\nComputed results only have meaning in relation to their inputs.\n\n## The `delete` Method\n\n```python\n
.delete(safemode=True, quick=False)\n```\n\n**Parameters:**\n- **`safemode`** (default: `True`): Prompts for confirmation before deleting\n- **`quick`** (default: `False`): If `True`, skips dependency analysis (use with caution)\n\n**Examples:**\n\n```python\n# Delete with confirmation prompt\n(Session & {'subject_id': 'M001', 'session_date': '2024-01-15'}).delete()\n\n# Delete without confirmation (scripted use)\n(Session & {'subject_id': 'M001', 'session_date': '2024-01-15'}).delete(safemode=False)\n\n# Delete all entries in a table (with confirmation)\nSession.delete()\n```\n\n## Use Cases for Delete\n\n### 1. Correcting Upstream Errors\n\nThe most common use of delete is correcting errors in upstream data.\nRather than updating values (which would leave downstream computations inconsistent), you:\n\n1. **Delete** the incorrect upstream data (cascade removes all derived results)\n2. **Insert** the corrected data\n3. **Repopulate** to regenerate downstream computations\n\n```python\n# Discovered an error in session metadata\n(Session & bad_session_key).delete(safemode=False)\n\n# Insert corrected data\nSession.insert1(corrected_session_data)\n\n# Regenerate all downstream analysis\nRecording.populate()\nSpikeSort.populate()\nUnitAnalysis.populate()\n```\n\n### 2. Reprocessing with Updated Code\n\nWhen you update your analysis code, you may want to regenerate computed results:\n\n```python\n# Delete computed results to force recomputation\n(SpikeSort & restriction).delete(safemode=False)\n\n# Repopulate with updated make() method\nSpikeSort.populate()\n```\n\n### 3. Removing Obsolete Data\n\nWhen data is no longer needed:\n\n```python\n# Remove old pilot data\n(Subject & 'subject_id LIKE \"pilot%\"').delete()\n```\n\n### 4. Selective Deletion with Restrictions\n\nUse DataJoint's restriction syntax to target specific subsets:\n\n```python\n# Delete only failed recordings\n(Recording & 'quality < 0.5').delete()\n\n# Delete sessions from a specific date range\n(Session & 'session_date < \"2023-01-01\"').delete()\n\n# Delete based on joined conditions\n(SpikeSort & (Recording & 'brain_region = \"V1\"')).delete()\n```\n\n## The Delete-Reinsert-Repopulate Pattern\n\nThis pattern is the standard way to handle corrections in DataJoint:\n\n```python\ndef correct_session(session_key, corrected_data):\n \"\"\"Correct session data and regenerate all downstream analysis.\"\"\"\n \n # 1. Delete the session (cascades to all downstream)\n (Session & session_key).delete(safemode=False)\n \n # 2. Insert corrected data\n Session.insert1(corrected_data)\n \n # 3. Repopulate downstream tables\n # DataJoint's populate() automatically determines what needs to run\n Recording.populate()\n ProcessedRecording.populate()\n Analysis.populate()\n```\n\nThis pattern ensures:\n- No orphaned or inconsistent computed results\n- Full audit trail (original data is gone, not hidden)\n- All downstream results reflect the corrected inputs\n\n## Preview Before Deleting\n\nAlways verify what will be deleted before executing:\n\n```python\n# First, check what matches your restriction\nSession & {'subject_id': 'M001'}\n\n# Check downstream dependencies that will also be deleted\n(Session & {'subject_id': 'M001'}).descendants()\n\n# Then delete when confident\n(Session & {'subject_id': 'M001'}).delete()\n```\n\n## Safety Mechanisms\n\nDataJoint provides several safeguards:\n\n1. **`safemode=True`** (default): Requires interactive confirmation showing what will be deleted\n2. **Dependency preview**: Shows the count of entries in dependent tables that will be affected\n3. **Transaction wrapping**: The entire cascading delete is atomic—it either fully succeeds or fully rolls back\n\n## Best Practices\n\n1. **Trust the cascade**: Don't manually delete downstream tables first—let DataJoint handle dependencies\n2. **Use restrictions**: Target specific subsets rather than deleting entire tables\n3. **Preview first**: Check what matches before deleting, especially with complex restrictions\n4. **Keep `safemode=True`** for interactive work: Only use `safemode=False` in tested scripts\n5. **Think in terms of workflow**: Deleting is not \"cleaning up\"—it's rolling back the workflow to an earlier state\n6. **Follow with repopulate**: After correcting data, run `populate()` to bring the pipeline back to a complete state" } ], "metadata": { @@ -118,4 +13,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/book/40-operations/030-updates.ipynb b/book/40-operations/030-updates.ipynb index 0f4c61c..bcd951d 100644 --- a/book/40-operations/030-updates.ipynb +++ b/book/40-operations/030-updates.ipynb @@ -3,30 +3,12 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Updates\n", - "\n", - "## Updating existing rows" - ] + "source": "# Updates\n\nIn DataJoint, **updates are deliberately rare**.\nThis reflects a core principle of the [Relational Workflow Model](../20-concepts/05-workflows.md): computed results should remain consistent with their inputs." }, { "cell_type": "markdown", "metadata": {}, - "source": [ - "In DataJoint, the principal way of replacing data is by `delete` and `insert`. This approach observes referential integrity constraints. \n", - "\n", - "In some cases, it becomes necessary to deliberately correct existing values. The `update1` method accomplishes this. The method should only be used to fix problems, and not as part of a regular workflow. When updating an entry, make sure that any information stored in dependent tables that depends on the update values is properly updated as well. \n", - "\n", - "Syntax:\n", - "\n", - "```python\n", - "table.update1(record)\n", - "```\n", - "Here `record` is a `dict` specifying the primary key values for identifying what record to update and the values that should be updated. The entry must already exist.\n", - "\n", - "## Example\n", - "Let's create the `Student` table and populate a few entries." - ] + "source": "## Why Updates Are Discouraged\n\nConsider what happens when you update an upstream value:\n- Downstream computed results were derived from the **old** value\n- After the update, they coexist with the **new** value\n- The relationship between inputs and outputs is broken—**computational validity** is violated\n\nThe proper approach for most corrections is the **delete-reinsert-repopulate** pattern:\n\n1. **Delete** the incorrect data (cascading removes all dependent computations)\n2. **Insert** the corrected data\n3. **Repopulate** to regenerate downstream results with the new inputs\n\nThis ensures every computed result accurately reflects its inputs.\n\n## When `update1` Is Appropriate\n\nThe `update1` method exists for cases where in-place correction is truly appropriate:\n\n- **Non-scientific metadata**: Notes, comments, administrative fields that don't affect computations\n- **Corrections without downstream impact**: Fields that no computed table depends on\n- **Fixing typos**: In descriptive text fields\n\nThe key question: *Does any downstream computation depend on this value?*\nIf yes, use delete-reinsert. If no, `update1` may be appropriate.\n\n## Syntax\n\n```python\ntable.update1(record)\n```\n\nThe `record` is a dictionary containing:\n- All primary key values (to identify which row to update)\n- The attribute(s) to update with their new values\n\nThe entry must already exist—`update1` will raise an error if it doesn't." }, { "cell_type": "code", @@ -164,9 +146,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "We can now update some values. Note that you must specify the primary key and the entry must already exist." - ] + "source": "Update specific values by providing the primary key and the fields to change:" }, { "cell_type": "code", @@ -276,9 +256,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "If the entry does not exist or if the primary key value is not specified, `update1` raises errors:" - ] + "source": "## Error Handling\n\n`update1` enforces strict requirements. Attempting to update a non-existent entry raises an error:" }, { "cell_type": "code", @@ -323,6 +301,16 @@ "source": [ "Student.update1(dict(phone=\"(800)555-3377\"))" ] + }, + { + "cell_type": "markdown", + "source": "Similarly, omitting the primary key raises an error because DataJoint cannot identify which row to update.", + "metadata": {} + }, + { + "cell_type": "markdown", + "source": "## Summary\n\n| Scenario | Recommended Approach |\n|----------|---------------------|\n| Correcting data that affects computations | Delete → Insert → Repopulate |\n| Fixing typos in descriptive fields | `update1` |\n| Changing administrative metadata | `update1` |\n| Any doubt about downstream impact | Delete → Insert → Repopulate |\n\nThe conservative approach—delete and reinsert—is almost always safer.\nUse `update1` only when you are certain the change has no computational consequences.", + "metadata": {} } ], "metadata": { @@ -346,4 +334,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file diff --git a/book/40-operations/050-populate.ipynb b/book/40-operations/050-populate.ipynb index b2f3e4a..2137b47 100644 --- a/book/40-operations/050-populate.ipynb +++ b/book/40-operations/050-populate.ipynb @@ -3,12 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Populate\n\nThe `populate` operation is the engine of workflow automation in DataJoint.\nWhile `insert`, `delete`, and `update` are manual operations, `populate` automates data entry for **Imported** and **Computed** tables based on dependencies defined in the schema.\n\nThis chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to the practical `populate` operation.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## The `make` Method\n\nThe `make()` method defines the computational logic for each entry.\nIt receives a **key** dictionary identifying which entity to compute and must **fetch** inputs, **compute** results, and **insert** them into the table.\n\nSee the dedicated [make Method](055-make.ipynb) chapter for:\n- The three-part anatomy (fetch, compute, insert)\n- Restrictions on auto-populated tables\n- The three-part pattern for long-running computations\n- Transaction handling strategies\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [The `make` Method](055-make.ipynb) — Anatomy, constraints, and patterns\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] + "source": "# Populate\n\nThe `populate` operation is the engine of workflow automation in DataJoint.\nWhile [insert](010-insert.ipynb), [delete](020-delete.ipynb), and [update](030-updates.ipynb) are operations for Manual tables, `populate` automates data entry for **Imported** and **Computed** tables based on the dependencies defined in the schema.\n\nAs introduced in [Workflow Operations](000-workflow-operations.md), the distinction between external and automatic data entry maps directly to table tiers:\n\n| Table Tier | Data Entry Method |\n|------------|-------------------|\n| Lookup | `contents` property (part of schema) |\n| Manual | `insert` from external sources |\n| **Imported** | **Automatic `populate`** |\n| **Computed** | **Automatic `populate`** |\n\nThis chapter shows how `populate` transforms the schema's dependency graph into executable computations.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Data from external systems or human entry |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## The `make` Method\n\nThe `make()` method defines the computational logic for each entry.\nIt receives a **key** dictionary identifying which entity to compute and must **fetch** inputs, **compute** results, and **insert** them into the table.\n\nSee the dedicated [make Method](055-make.ipynb) chapter for:\n- The three-part anatomy (fetch, compute, insert)\n- Restrictions on auto-populated tables\n- The three-part pattern for long-running computations\n- Transaction handling strategies\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations via `contents`\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [The `make` Method](055-make.ipynb) — Anatomy, constraints, and patterns\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" } ], "metadata": {