From 77590edf3d24ed52f017e0e1c89c06ad23dbcb76 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 03:18:50 +0000 Subject: [PATCH 01/18] Add index directives for database and DataJoint terms throughout the book Add {index} directives to key database and DataJoint-specific concepts across all major sections of the book: - Introduction: DataJoint, computational database, data integrity, SQL, table tiers (manual, imported, computed, lookup), provenance, foreign key, schema, query algebra operators - Concepts: database, DBMS, authentication, authorization, server-client architecture, embedded database, SQLite, Edgar F. Codd, relational data model, data model, schema-on-write/read, table, relation, tuple, attribute, JSON, data provenance, relational algebra/calculus, normal forms, Entity-Relationship Model, Peter Chen, functional dependency, table tiers, DAG, cascade delete - Design: domain/entity integrity, consistency, ACID, natural key, composite primary key, surrogate key, UUID - Queries: entity normalization, closure property, data independence These index terms will now appear in the generated index page at 95-reference/index.md via the {show-index} directive. --- book/00-introduction/00-purpose.md | 14 +++++++------- book/00-introduction/05-executive-summary.md | 18 +++++++++--------- book/20-concepts/00-databases.md | 14 +++++++------- book/20-concepts/01-models.md | 12 ++++++------ book/20-concepts/02-relational.md | 8 ++++---- book/20-concepts/04-workflows.md | 8 ++++---- book/30-design/012-integrity.md | 8 ++++---- book/30-design/020-primary-key.md | 8 ++++---- book/50-queries/008-datajoint-in-context.md | 6 +++--- 9 files changed, 48 insertions(+), 48 deletions(-) diff --git a/book/00-introduction/00-purpose.md b/book/00-introduction/00-purpose.md index fcb1c0d..b3fd6a5 100644 --- a/book/00-introduction/00-purpose.md +++ b/book/00-introduction/00-purpose.md @@ -4,7 +4,7 @@ title: Purpose ## What is DataJoint? -**DataJoint is a computational database language and platform that enables scientists to design, implement, and manage data operations for research by unifying data structures and analysis code.** It provides data integrity, automated computation, reproducibility, and seamless collaboration through a relational database approach that coordinates relational databases, code repositories, and object storage. +**{index}`DataJoint` is a {index}`computational database` language and platform that enables scientists to design, implement, and manage data operations for research by unifying data structures and analysis code.** It provides {index}`data integrity`, {index}`automated computation`, {index}`reproducibility`, and seamless collaboration through a {index}`relational database` approach that coordinates relational databases, code repositories, and object storage. ## Who This Book Is For @@ -28,10 +28,10 @@ Here's what makes DataJoint different: **your database schema IS your data proce Traditional databases store and retrieve data. DataJoint does that too, but it also tracks what gets computed from what. Each table plays a specific role in your workflow: -- **Manual tables**: Source data entered by researchers -- **Imported tables**: Data acquired from instruments or external sources -- **Computed tables**: Results automatically derived from upstream data -- **Lookup tables**: Reference data and parameters +- **{index}`Manual table`s**: Source data entered by researchers +- **{index}`Imported table`s**: Data acquired from instruments or external sources +- **{index}`Computed table`s**: Results automatically derived from upstream data +- **{index}`Lookup table`s**: Reference data and parameters This workflow perspective shapes everything: @@ -39,7 +39,7 @@ This workflow perspective shapes everything: **Intelligent Diagrams**: Different table types get distinct visual styles. One glance tells you what's manual, what's automatic, and how everything connects. -**Provenance, Not Just Integrity**: Foreign keys mean more than "this ID exists." They mean "this result was computed FROM this input." When upstream data changes, DataJoint ensures you can't accidentally keep stale downstream results. This is why DataJoint emphasizes INSERT and DELETE over UPDATE—changing input data without recomputing outputs breaks your science, even if the database technically remains consistent. +**{index}`Provenance`, Not Just Integrity**: {index}`Foreign key`s mean more than "this ID exists." They mean "this result was computed FROM this input." When upstream data changes, DataJoint ensures you can't accidentally keep stale downstream results. This is why DataJoint emphasizes INSERT and DELETE over UPDATE—changing input data without recomputing outputs breaks your science, even if the database technically remains consistent. For scientific computing, this workflow-centric design is transformative. Your database doesn't just store results—it guarantees they're valid, reproducible, and traceable back to their origins. @@ -62,7 +62,7 @@ This book provides the skills to transform research operations: from fragile scr ## DataJoint and SQL: Two Languages, One Foundation -**SQL (Structured Query Language)** powers virtually every relational database. DataJoint wraps SQL in Pythonic syntax, automatically translating your code into optimized queries. +**{index}`SQL` (Structured Query Language)** powers virtually every relational database. DataJoint wraps SQL in Pythonic syntax, automatically translating your code into optimized queries. You could learn DataJoint without ever seeing SQL. But this book teaches both, side by side. You'll understand not just *what* works but *why*—and you'll be able to work directly with SQL when needed. diff --git a/book/00-introduction/05-executive-summary.md b/book/00-introduction/05-executive-summary.md index 8c4f5d7..1d2db7a 100644 --- a/book/00-introduction/05-executive-summary.md +++ b/book/00-introduction/05-executive-summary.md @@ -11,7 +11,7 @@ Standard database solutions address storage and querying but not computation. Da ## The DataJoint Solution -**DataJoint introduces the Relational Workflow Model**—an extension of classical relational theory that treats computational transformations as first-class citizens of the data model. The database schema becomes an executable specification: it defines not just what data exists, but how data flows through the pipeline and when computations should run. +**DataJoint introduces the {index}`Relational Workflow Model`**—an extension of classical relational theory that treats computational transformations as first-class citizens of the data model. The database {index}`schema` becomes an executable specification: it defines not just what data exists, but how data flows through the pipeline and when computations should run. This creates what we call a **Computational Database**: a system where inserting new raw data automatically triggers all downstream analyses in dependency order, maintaining computational validity throughout. Think of it as a spreadsheet that auto-recalculates, but with the rigor of a relational database and the scale of distributed computing. @@ -21,16 +21,16 @@ This creates what we call a **Computational Database**: a system where inserting Unlike Entity-Relationship modeling that requires translation to SQL, DataJoint schemas are directly executable. The diagram *is* the implementation. Schema changes propagate immediately. Documentation cannot drift from reality because the schema is the documentation. **Workflow-Aware Foreign Keys** -Foreign keys in DataJoint do more than enforce referential integrity—they encode computational dependencies. A computed result that references raw data will be automatically deleted if that raw data is removed, preventing stale or orphaned results. This maintains *computational validity*, not just *referential integrity*. +Foreign keys in DataJoint do more than enforce {index}`referential integrity`—they encode computational dependencies. A computed result that references raw data will be automatically deleted if that raw data is removed, preventing stale or orphaned results. This maintains *{index}`computational validity`*, not just *referential integrity*. **Declarative Computation** -Computations are defined declaratively through `make()` methods attached to table definitions. The `populate()` operation identifies all missing results and executes computations in dependency order. Parallelization, error handling, and job distribution are handled automatically. +Computations are defined declaratively through {index}`make() method`s attached to table definitions. The {index}`populate()` operation identifies all missing results and executes computations in dependency order. Parallelization, error handling, and job distribution are handled automatically. -**Immutability by Design** +**{index}`Immutability` by Design** Computed results are immutable. Correcting upstream data requires deleting dependent results and recomputing—ensuring the database always represents a consistent computational state. This naturally provides complete provenance: every result can be traced to its source data and the exact code that produced it. **Hybrid Storage Model** -Structured metadata lives in the relational database (MySQL/PostgreSQL). Large binary objects (images, recordings, arrays) live in scalable object storage (S3, GCS, filesystem) with the database maintaining the mapping. Queries operate on metadata; computation accesses objects transparently. +Structured metadata lives in the relational database ({index}`MySQL`/{index}`PostgreSQL`). Large binary objects (images, recordings, arrays) live in scalable {index}`object storage` (S3, GCS, filesystem) with the database maintaining the mapping. Queries operate on metadata; computation accesses objects transparently. ## Architecture Overview @@ -70,9 +70,9 @@ This book provides comprehensive coverage of DataJoint from foundations through **Part II: Design** - Schema design principles and table definitions -- Primary keys, foreign keys, and dependency structures -- Master-part relationships for hierarchical data -- Normalization through the lens of workflow entities +- {index}`Primary key`s, foreign keys, and dependency structures +- {index}`Master-part relationship`s for hierarchical data +- {index}`Normalization` through the lens of workflow entities - Schema evolution and migration strategies **Part III: Operations** @@ -80,7 +80,7 @@ This book provides comprehensive coverage of DataJoint from foundations through - Caching strategies for performance optimization **Part IV: Queries** -- DataJoint's five-operator query algebra: restriction, projection, join, aggregation, union +- DataJoint's five-operator {index}`query algebra`: {index}`restriction`, {index}`projection`, {index}`join`, {index}`aggregation`, {index}`union` - Comparison with SQL and when to use each - Complex query patterns and optimization diff --git a/book/20-concepts/00-databases.md b/book/20-concepts/00-databases.md index 93612b3..48edac6 100644 --- a/book/20-concepts/00-databases.md +++ b/book/20-concepts/00-databases.md @@ -5,7 +5,7 @@ title: Databases ## What is a Database? ```{card} Database -A **database** is a dynamic (i.e. *time-varying*), systematically organized collection of data that plays an integral role in the operation of an enterprise. +A **{index}`database`** is a dynamic (i.e. *time-varying*), systematically organized collection of data that plays an integral role in the operation of an enterprise. It supports the enterprise's operations and is accessed by a variety of users in different ways. Examples of enterprises that rely on databases include hotels, airlines, stores, hospitals, universities, banks, and scientific studies. The database not only tracks the current state of the enterprise's processes but also enforces essential *business rules*, ensuring that only valid transactions occur and preventing errors or inconsistencies. It serves as the **system of record**, the **single source of truth**, accurately reflecting the current state and ongoing activities. @@ -25,7 +25,7 @@ Databases are crucial for the smooth and organized operation of various entities ## Database Management Systems (DBMS) ```{card} Database Management System -A Database Management System is a software system that serves as the computational engine powering a database. +A {index}`Database Management System` ({index}`DBMS`) is a software system that serves as the computational engine powering a database. It defines and enforces the structure of the data, ensuring that the organization's rules are consistently applied. A DBMS manages data storage and efficiently executes data updates and queries while safeguarding the data's structure and integrity, particularly in environments with multiple concurrent users. @@ -50,7 +50,7 @@ One of the most critical features distinguishing databases from simple file stor ### Authentication and Authorization -Before you can work with a database, you must **authenticate**—prove your identity with a username and password. Once authenticated, the database enforces **authorization** rules that determine what you can do: +Before you can work with a database, you must **{index}`authentication`**—prove your identity with a username and password. Once authenticated, the database enforces **{index}`authorization`** rules that determine what you can do: - **Read**: View specific tables or columns - **Write**: Add new data to certain tables @@ -80,11 +80,11 @@ Modern databases typically separate data management from data use through distin ### Common Architectures -**Server-Client Architecture** (most common): A database server program manages all data operations, while client programs (your scripts, applications, notebooks) connect to request data or submit changes. The server enforces all rules and access permissions consistently for every client. This is like a library where the librarian (server) manages the books and enforces checkout policies, while patrons (clients) request materials. +**{index}`Server-client architecture`** (most common): A database server program manages all data operations, while client programs (your scripts, applications, notebooks) connect to request data or submit changes. The server enforces all rules and access permissions consistently for every client. This is like a library where the librarian (server) manages the books and enforces checkout policies, while patrons (clients) request materials. The two most popular open-source relational database systems: MySQL and PostgreSQL implement a server-client architecture. -**Embedded Databases**: The database engine runs within your application itself—no separate server. This works for single-user applications like mobile apps or desktop software, but doesn't support multiple users accessing shared data simultaneously. -SQLite is a common embedded database @10.14778/3554821.3554842. +**{index}`Embedded database`s**: The database engine runs within your application itself—no separate server. This works for single-user applications like mobile apps or desktop software, but doesn't support multiple users accessing shared data simultaneously. +{index}`SQLite` is a common embedded database @10.14778/3554821.3554842. **Distributed Databases**: Data and processing are spread across multiple servers working together. This provides high availability and can handle massive scale, but adds significant complexity. Systems like Google Spanner, Amazon DynamoDB, and CockroachDB use this approach. @@ -106,7 +106,7 @@ Separating data management from data use provides critical advantages: This book focuses on **DataJoint**, a framework that extends relational databases specifically for scientific workflows. DataJoint builds on the solid foundation of relational theory while adding capabilities essential for research: automated computation, data provenance, and reproducibility. -The relational data model—introduced by Edgar F. Codd in 1970—revolutionized data management by organizing data into tables with well-defined relationships. This model has dominated database systems for over five decades due to its mathematical rigor and versatility. Modern relational databases like MySQL and PostgreSQL continue to evolve, incorporating new capabilities for scalability and security while maintaining the core principles that make them reliable and powerful. +The {index}`relational data model`—introduced by {index}`Edgar F. Codd` in 1970—revolutionized data management by organizing data into tables with well-defined relationships. This model has dominated database systems for over five decades due to its mathematical rigor and versatility. Modern relational databases like MySQL and PostgreSQL continue to evolve, incorporating new capabilities for scalability and security while maintaining the core principles that make them reliable and powerful. The following chapters build the conceptual foundation you need to understand DataJoint's approach: - **Data Models**: What data models are and why schemas matter for scientific work diff --git a/book/20-concepts/01-models.md b/book/20-concepts/01-models.md index 0310cb8..fa8b3e4 100644 --- a/book/20-concepts/01-models.md +++ b/book/20-concepts/01-models.md @@ -15,7 +15,7 @@ This chapter introduces data models conceptually, explores the critical distinct ## Definition ```{card} Data Model -A *data model* is a conceptual framework that defines how data is organized, represented, and transformed. It gives us the components for creating blueprints for the structure and operations of data management systems, ensuring consistency and efficiency in data handling. +A *{index}`data model`* is a conceptual framework that defines how data is organized, represented, and transformed. It gives us the components for creating blueprints for the structure and operations of data management systems, ensuring consistency and efficiency in data handling. Data management systems are built to accommodate these models, allowing us to manage data according to the principles laid out by the model. If you're studying data science or engineering, you've likely encountered different data models, each providing a unique approach to organizing and manipulating data. @@ -77,9 +77,9 @@ Both structured and schemaless data formats can be attractive in various scenari These two approaches are sometimes called **schema-on-write** and **schema-on-read**: -- **Schema-on-write** refers to structured data models where the schema is defined and enforced *before* data is stored. Data must conform to the schema at write time, ensuring consistency and integrity from the moment data enters the system. Relational databases exemplify this approach. +- **{index}`Schema-on-write`** refers to structured data models where the schema is defined and enforced *before* data is stored. Data must conform to the schema at write time, ensuring consistency and integrity from the moment data enters the system. Relational databases exemplify this approach. -- **Schema-on-read** refers to schemaless or self-describing data models where structure is interpreted *when* data is read rather than when it is written. Data can be ingested rapidly in its raw form, with structure applied later during analysis. Data lakes and document stores often follow this approach. +- **{index}`Schema-on-read`** refers to schemaless or self-describing data models where structure is interpreted *when* data is read rather than when it is written. Data can be ingested rapidly in its raw form, with structure applied later during analysis. Data lakes and document stores often follow this approach. Each approach has its strengths: @@ -244,7 +244,7 @@ Most importantly, spreadsheets provide no referential integrity. If cell B2 cont The **relational data model**, introduced by Edgar F. Codd in 1970, revolutionized data management by organizing data into tables (relations) with well-defined relationships. This model emphasizes data integrity, consistency, and powerful query capabilities through a formal mathematical foundation. -The relational model organizes all data into tables representing mathematical relations, where each table consists of rows (representing mathematical *tuples*) and columns (often called *attributes*). Key principles include data type constraints, uniqueness enforcement through primary keys, referential integrity through foreign keys, and declarative queries. The next chapter explores these principles in depth. +The relational model organizes all data into {index}`table`s representing mathematical {index}`relation`s, where each table consists of rows (representing mathematical *{index}`tuple`s*) and columns (often called *{index}`attribute`s*). Key principles include {index}`data type` constraints, uniqueness enforcement through primary keys, referential integrity through foreign keys, and {index}`declarative query`. The next chapter explores these principles in depth. The most common way to interact with relational databases is through the Structured Query Language (SQL), a language specifically designed to define, manipulate, and query data within relational databases. @@ -261,7 +261,7 @@ The rest of this book focuses on the relational model, but specifically through ### Example: Document Databases (JSON) -The Document Data Model, commonly exemplified by JSON (JavaScript Object Notation), organizes data as key-value pairs within structured documents. This flexible, text-based format is widely used for data interchange between systems, particularly in web applications and APIs. +The Document Data Model, commonly exemplified by {index}`JSON` (JavaScript Object Notation), organizes data as key-value pairs within structured documents. This flexible, text-based format is widely used for data interchange between systems, particularly in web applications and APIs. #### Structure @@ -313,7 +313,7 @@ The key insight: while initial ingestion might be flexible (schema-on-read), the In recent years, concerns about scientific integrity have brought greater attention to proper data management as the foundation for reproducible science and valid findings. As science becomes more complex and interconnected, meticulous data handling—including reproducibility and data provenance—has become critical. -**Data provenance**—the detailed history of how data is collected, processed, and analyzed—provides transparency and accountability. But provenance tracked through metadata alone can break: +**{index}`Data provenance`**—the detailed history of how data is collected, processed, and analyzed—provides transparency and accountability. But provenance tracked through metadata alone can break: - Tags pointing to deleted files - Descriptions of outdated relationships - Manual records that fall out of sync with actual data diff --git a/book/20-concepts/02-relational.md b/book/20-concepts/02-relational.md index 16e265b..7467238 100644 --- a/book/20-concepts/02-relational.md +++ b/book/20-concepts/02-relational.md @@ -92,7 +92,7 @@ The evolution of relational database thinking provides three complementary level Codd's relational algebra and calculus provide formal operations on relations with provable properties. Query optimization relies on these mathematical foundations to prove that two different queries produce equivalent results, allowing the system to choose the most efficient implementation. -**2. Conceptual Modeling (Chen, 1976):** Building upon Codd's foundation, **Peter Chen** introduced the Entity-Relationship Model (ERM). While Codd's model provided the rigorous mathematical underpinnings, Chen's ERM offered a more intuitive, conceptual way to think about and design databases, particularly during initial planning stages. +**2. Conceptual Modeling (Chen, 1976):** Building upon Codd's foundation, **{index}`Peter Chen`** introduced the {index}`Entity-Relationship Model` (ERM). While Codd's model provided the rigorous mathematical underpinnings, Chen's ERM offered a more intuitive, conceptual way to think about and design databases, particularly during initial planning stages. *Like architectural blueprints that translate engineering principles into buildable structures.* @@ -128,7 +128,7 @@ The next sections show how this mathematical rigor translates into practical dat ## Relational Algebra and Calculus -**Relational algebra** is a set of operations that can be used to transform relations in a formal way. It provides the foundation for querying relational databases, allowing us to combine, modify, and retrieve data stored in tables (relations). +**{index}`Relational algebra`** is a set of operations that can be used to transform relations in a formal way. It provides the foundation for querying relational databases, allowing us to combine, modify, and retrieve data stored in tables (relations). Examples of relational operators: @@ -165,7 +165,7 @@ This operation effectively merges the connections from both sets of values, prov Relational algebra, with its powerful operators, allows us to query and manipulate data in a structured and efficient way, forming the backbone of modern database systems. By understanding and applying these operators, we can perform complex data analysis and retrieval tasks with precision and clarity. -Another formal language for deriving new relations from scratch or from other relations is **relational calculus**. Rather than using relational operators, it relies on a *set-building notation* to generate relations. +Another formal language for deriving new relations from scratch or from other relations is **{index}`relational calculus`**. Rather than using relational operators, it relies on a *set-building notation* to generate relations. :::{note} The query notation of the SQL programming language combines concepts from both relational algebra and relational calculus. However, DataJoint's query language is based purely on relational algebra. @@ -188,7 +188,7 @@ Codd's model was derived from relational theory but differed sufficiently in its Through the 1970s, before relational databases became practical, theorists derived fundamental rules for rigorous data organization and queries from first principles using mathematical proofs and derivations. For this reason, early work on relational databases has an abstract academic feel to it with rather simple toy examples: the ubiquitous employees/departments, products/orders, and students/courses. -The design principles were defined through the rigorous but rather abstract principles, the **normal forms** [@10.1145/358024.358054]. These normal forms provide mathematically precise rules for organizing data to minimize redundancy and maintain integrity. +The design principles were defined through the rigorous but rather abstract principles, the **{index}`normal form`s** [@10.1145/358024.358054]. These normal forms provide mathematically precise rules for organizing data to minimize redundancy and maintain integrity. The relational data model is one of the most powerful and precise ways to store and manage structured data. At its core, this model organizes all data into tables—representing mathematical relations—where each table consists of rows (representing mathematical *tuples*) and columns (often called *attributes*). diff --git a/book/20-concepts/04-workflows.md b/book/20-concepts/04-workflows.md index e22cb94..e654ef0 100644 --- a/book/20-concepts/04-workflows.md +++ b/book/20-concepts/04-workflows.md @@ -36,7 +36,7 @@ The **mathematical view** of the relational model, championed by Edgar F. Codd, **Tuple as Proposition**: Each row (tuple) is a specific set of attribute values that asserts a true proposition for the predicate. For example, if a table's predicate is "Employee $x$ works on Project $y$," the row `(Alice, P1)` asserts the truth: "Employee Alice works on Project P1." -**Functional Dependencies between Attributes**: The core concept is the functional dependency: attribute `A` functionally determines attribute `B` (written `A → B`) if knowing the value of `A` allows you to determine the unique value of `B`. For example, the attribute `department` functionally determines the attribute `department_chair` because knowing the department name allows you to determine the unique name of the department chair. Functional dependencies are helpful for reasoning about the structure of the database and for performing queries. +**{index}`Functional dependency`**: The core concept is the functional dependency: attribute `A` functionally determines attribute `B` (written `A → B`) if knowing the value of `A` allows you to determine the unique value of `B`. For example, the attribute `department` functionally determines the attribute `department_chair` because knowing the department name allows you to determine the unique name of the department chair. Functional dependencies are helpful for reasoning about the structure of the database and for performing queries. Then the database can be viewed as a collection of predicates and a minimal complete set of true propositions from which all other true propositions can be derived. Data queries are viewed as logical inferences using the rules of predicate calculus. *Relational algebra* and *relational calculus* provide set of operations that can be used to perform these inferences. Under the Closed World Assumption (CWA), the database is assumed to contain all true propositions and all other propositions are assumed to be false. CWA is a simplifying assumption that allows us to reason about the data in the database in a more precise way. @@ -168,11 +168,11 @@ This unified approach eliminates the traditional separation between conceptual d ### Table Tiers: Workflow Roles -DataJoint introduces a sophisticated classification system called table tiers that organizes entity sets according to their specific role in the workflow. This classification goes far beyond simple organizational convenience—it fundamentally shapes how you think about data flow and responsibility within your system. +DataJoint introduces a sophisticated classification system called {index}`table tier`s that organizes entity sets according to their specific role in the workflow. This classification goes far beyond simple organizational convenience—it fundamentally shapes how you think about data flow and responsibility within your system. The four table tiers each represent a distinct type of workflow activity. Lookup tables contain reference data and parameters, such as controlled vocabularies and constants that provide the foundational knowledge for your workflow. Manual tables capture human-entered data, including observations and decisions that require expert judgment or domain knowledge. Imported tables handle automated data acquisition from instruments, files, or external systems. Finally, computed tables perform automated processing, generating derived results and analyses from the data collected in other tiers. -This tiered structure creates a natural dependency hierarchy that reflects the logical flow of information through your workflow. Computed tables depend on imported or manual tables for their input data, which in turn may depend on lookup tables for reference information. This creates a directed acyclic graph (DAG) that makes the workflow structure explicit and prevents circular dependencies that could lead to infinite loops or logical inconsistencies. +This tiered structure creates a natural dependency hierarchy that reflects the logical flow of information through your workflow. Computed tables depend on imported or manual tables for their input data, which in turn may depend on lookup tables for reference information. This creates a {index}`directed acyclic graph` ({index}`DAG`) that makes the workflow structure explicit and prevents circular dependencies that could lead to infinite loops or logical inconsistencies. The visual representation of this structure through color-coded diagrams provides immediate insight into your workflow. Green represents manual tables where human expertise enters the system, blue shows imported tables where automated data acquisition occurs, red indicates computed tables where algorithmic processing happens, and gray denotes lookup tables containing reference information. At a glance, you can see where data enters your system and trace how it flows through each processing step. @@ -194,7 +194,7 @@ This workflow-centric approach makes relationships implicit rather than explicit The Relational Workflow Model introduces a crucial distinction between transactional consistency and computational validity that fundamentally changes how we think about data integrity. Traditional databases focus primarily on transactional consistency, ensuring that concurrent updates don't corrupt data through mechanisms like locking and isolation. While this is essential for preventing race conditions, it doesn't address a deeper problem that arises in computational workflows: ensuring that downstream results remain consistent with their upstream inputs. -DataJoint addresses this challenge through its approach to immutability and cascade operations. When you delete an entity in DataJoint, the system doesn't simply remove that single record—it cascades the delete to all dependent entities throughout the workflow. This behavior isn't just cleanup; it's enforcing computational validity by recognizing that if the inputs are gone, any results based on those inputs become meaningless and must be removed. +DataJoint addresses this challenge through its approach to immutability and {index}`cascade delete` operations. When you delete an entity in DataJoint, the system doesn't simply remove that single record—it cascades the delete to all dependent entities throughout the workflow. This behavior isn't just cleanup; it's enforcing computational validity by recognizing that if the inputs are gone, any results based on those inputs become meaningless and must be removed. The process of correcting data illustrates this principle beautifully. When you discover an error in upstream data, you don't simply update the problematic record. Instead, you delete the entire downstream pipeline that was computed from the incorrect data, reinsert the corrected data, and then recompute the entire dependent chain. This ensures that every result in your database represents a consistent computation from valid inputs. diff --git a/book/30-design/012-integrity.md b/book/30-design/012-integrity.md index 4aaacb7..be3bb06 100644 --- a/book/30-design/012-integrity.md +++ b/book/30-design/012-integrity.md @@ -53,7 +53,7 @@ Relational databases excel at expressing and enforcing such rules through **inte This section introduces seven fundamental types of integrity constraints. Each will be covered in detail in subsequent chapters, with DataJoint implementation examples. -## 1. Domain Integrity +## 1. {index}`Domain integrity` **Ensures values are within valid ranges and types.** Domain integrity restricts attribute values to predefined valid sets using: @@ -82,7 +82,7 @@ Completeness prevents missing values that could invalidate analyses: --- -## 3. Entity Integrity +## 3. {index}`Entity integrity` **Each real-world entity corresponds to exactly one database record, and each database record corresponds to exactly one real-world entity.** Entity integrity ensures a **one-to-one correspondence** between real-world entities and their digital representations in the database. This is not simply about having unique identifiers—it's about establishing a reliable, bidirectional mapping where: @@ -132,13 +132,13 @@ Compositional integrity ensures multi-part entities are never partially stored: --- -## 6. Consistency +## 6. {index}`Consistency` **All users see the same valid data state.** Consistency provides a unified view during concurrent access: - **Isolation levels** control transaction visibility - **Locking mechanisms** prevent conflicting updates -- **ACID properties** guarantee reliable state transitions +- **{index}`ACID` properties** guarantee reliable state transitions **Example:** Two researchers inserting experiments simultaneously don't create duplicates. diff --git a/book/30-design/020-primary-key.md b/book/30-design/020-primary-key.md index c190b1e..bc250ca 100644 --- a/book/30-design/020-primary-key.md +++ b/book/30-design/020-primary-key.md @@ -192,7 +192,7 @@ This flexibility in entity integrity allows businesses to balance strict data ru ## Using Natural Keys -A table can be designed with a **natural primary key**, which is an identifier that exists in the real world. For example, a Social Security Number (SSN) can serve as a natural key for a person because it is a unique number used and recognized in real-world systems. +A table can be designed with a **{index}`natural key`**, which is an identifier that exists in the real world. For example, a Social Security Number (SSN) can serve as a natural key for a person because it is a unique number used and recognized in real-world systems. In some cases, a natural key already exists, or one can be specifically created for data management purposes and then introduced into the real world to be permanently associated with physical entities. @@ -204,7 +204,7 @@ Phone numbers, in particular, have become popular as identifiers as mobile phone # Composite Primary Keys -Sometimes, a single column cannot uniquely identify a record. In these cases, we use a **composite primary key**—a primary key made up of multiple columns that together uniquely identify each row. +Sometimes, a single column cannot uniquely identify a record. In these cases, we use a **{index}`composite primary key`**—a primary key made up of multiple columns that together uniquely identify each row. ## Example: U.S. House of Representatives @@ -286,7 +286,7 @@ Use composite primary keys when: # Using Surrogate Keys -In many cases, it makes more sense to use a **surrogate key** as the primary key in a database. A surrogate key has no relationship to the real world and is used solely within the database for identification purposes. These keys are often generated automatically as an **auto-incrementing number** or a **random string** like a UUID (Universally Unique Identifier) or GUID (Globally Unique Identifier). +In many cases, it makes more sense to use a **{index}`surrogate key`** as the primary key in a database. A surrogate key has no relationship to the real world and is used solely within the database for identification purposes. These keys are often generated automatically as an **auto-incrementing number** or a **random string** like a UUID (Universally Unique Identifier) or GUID (Globally Unique Identifier). When using surrogate keys, entity integrity can still be maintained by using other unique attributes (such as secondary unique indexes) to help identify and match entities to their digital representations. @@ -294,7 +294,7 @@ Surrogate keys are especially useful for entities that exist only in digital for ## Universally Unique Identifiers (UUIDs) -**UUIDs** (Universally Unique Identifiers) are 128-bit identifiers that are designed to be globally unique across time and space. They are standardized by [RFC 9562](https://www.rfc-editor.org/rfc/rfc9562.html) (which obsoletes RFC 4122) and provide a reliable way to generate surrogate keys without coordination between different systems. +**{index}`UUID`s** (Universally Unique Identifiers) are 128-bit identifiers that are designed to be globally unique across time and space. They are standardized by [RFC 9562](https://www.rfc-editor.org/rfc/rfc9562.html) (which obsoletes RFC 4122) and provide a reliable way to generate surrogate keys without coordination between different systems. ### UUID Format diff --git a/book/50-queries/008-datajoint-in-context.md b/book/50-queries/008-datajoint-in-context.md index c28026d..9214515 100644 --- a/book/50-queries/008-datajoint-in-context.md +++ b/book/50-queries/008-datajoint-in-context.md @@ -48,7 +48,7 @@ Codd identified this inflexibility as a symptom of a deeper problem and articula **Access Path Dependence**: The most significant issue was the reliance on predefined access paths. In a hierarchical or network model, a program's logic for retrieving data was inextricably linked to the specific parent-child or owner-member relationships defined in the schema. If a business requirement changed—for example, altering the relationship between Projects and Parts in a manufacturing database—the application programs that navigated the old structure would become logically impaired and cease to function correctly. -Codd's central argument was that users and applications needed to be "protected from having to know how the data is organized in the machine". This protection, which he termed "data independence," was the foundational goal of his new model. +Codd's central argument was that users and applications needed to be "protected from having to know how the data is organized in the machine". This protection, which he termed "{index}`data independence`," was the foundational goal of his new model. ### A Relational Model for Data: Core Principles of Relations, Tuples, and Domains @@ -81,7 +81,7 @@ Having defined the data structure, Codd and his colleagues subsequently develope From these primitives, other useful operators like Join (⋈) and Intersection (∩) can be derived. Two properties of this algebra are particularly crucial for understanding its power and elegance. -First is the **Closure Property**. This principle states that the result of any operation in relational algebra is itself a relation. This is a profoundly important feature. Because the output of an operation is the same type of object as the input, operators can be composed and nested to form expressions of arbitrary complexity. A query can be built up from sub-queries, each of which produces a valid relation that can be fed into the next operation. This property is the foundation of modern query languages like SQL. +First is the **{index}`Closure property`**. This principle states that the result of any operation in relational algebra is itself a relation. This is a profoundly important feature. Because the output of an operation is the same type of object as the input, operators can be composed and nested to form expressions of arbitrary complexity. A query can be built up from sub-queries, each of which produces a valid relation that can be fed into the next operation. This property is the foundation of modern query languages like SQL. Second is the concept of **Relational Completeness**. Relational algebra serves as a theoretical benchmark for the expressive power of any database query language. A language is said to be "relationally complete" if it can be used to formulate any query that is expressible in relational algebra (or its declarative equivalent, relational calculus). This provides a formal yardstick to measure whether a language is sufficiently powerful to perform any standard relational query without needing to resort to procedural constructs like loops or branching. @@ -205,7 +205,7 @@ To ensure clarity in the subsequent technical discussion, the following table pr ### Core Properties: Algebraic Closure and Entity Integrity Preservation -The principle of entity normalization is not merely a guideline for schema design; it is a strict constraint on DataJoint's query language. DataJoint implements a complete relational algebra with five primary operators: restrict (`&`), join (`*`), project (`proj`), aggregate (`aggr`), and union (`+`). This algebra is designed around two critical properties that work in concert to maintain semantic cohesion: +The principle of {index}`entity normalization` is not merely a guideline for schema design; it is a strict constraint on DataJoint's query language. DataJoint implements a complete relational algebra with five primary operators: restrict (`&`), join (`*`), project (`proj`), aggregate (`aggr`), and union (`+`). This algebra is designed around two critical properties that work in concert to maintain semantic cohesion: **Algebraic Closure**: Like classical relational algebra, DataJoint's algebra possesses the closure property. All operators take entity sets as input and produce a valid entity set as output. This allows for the seamless composition and nesting of query expressions. From 7a9c568b5effd84b7e5ea48ef0dc6f26d08b71d5 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 03:19:48 +0000 Subject: [PATCH 02/18] Add .gitignore for build artifacts and dependencies --- .gitignore | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 .gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..909666a --- /dev/null +++ b/.gitignore @@ -0,0 +1,23 @@ +# Build output +_build/ + +# Node.js dependencies +node_modules/ + +# Python +__pycache__/ +*.py[cod] +.ipynb_checkpoints/ + +# Environment +.env +.venv/ +venv/ + +# IDE +.vscode/ +.idea/ + +# OS files +.DS_Store +Thumbs.db From 72db81ee3b316434ae4a8fc2e82bb64030f7891d Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Thu, 11 Dec 2025 21:48:09 -0600 Subject: [PATCH 03/18] clean up --- SIMPLIFICATION_RECOMMENDATIONS.md | 184 ------------------------------ 1 file changed, 184 deletions(-) delete mode 100644 SIMPLIFICATION_RECOMMENDATIONS.md diff --git a/SIMPLIFICATION_RECOMMENDATIONS.md b/SIMPLIFICATION_RECOMMENDATIONS.md deleted file mode 100644 index 73f0af5..0000000 --- a/SIMPLIFICATION_RECOMMENDATIONS.md +++ /dev/null @@ -1,184 +0,0 @@ -# Recommendations for Simplifying Main Text Examples - -This report identifies opportunities to simplify examples in the main text by referencing comprehensive examples in the `book/80-examples/` section. - -## Executive Summary - -After reviewing the main text chapters and the examples section, I identified several opportunities for simplification. However, many examples in the main text serve specific pedagogical purposes and are intentionally minimal to focus on particular concepts. The recommendations below balance simplification with pedagogical effectiveness. - -## Examples Section Inventory - -| Notebook | Domain | Key Features | -|----------|--------|--------------| -| `015-university.ipynb` | Academic administration | Complete schema with Students, Courses, Departments, Terms, Enrollments, Grades; synthetic data generation | -| `016-university-queries.ipynb` | Query patterns | Comprehensive query examples: restriction, joins, aggregation, universal sets | -| `010-classic-sales.ipynb` | E-commerce | MySQL sample database; workflow-centric business operations | -| `070-fractals.ipynb` | Computational pipeline | Table tiers (Manual, Lookup, Computed), populate mechanics, image processing | -| `075-blob-detection.ipynb` | Image analysis | Master-part relationships, parameter sweeps, computational workflows | - ---- - -## Recommendation 1: Queries Chapter - Reference University Queries - -**File**: `book/50-queries/020-restriction.ipynb` - -**Current State**: Creates a standalone languages/fluency database example to demonstrate restriction patterns. - -**Opportunity**: The restriction chapter could be simplified by: -1. Keeping the concise language/fluency example for basic concepts -2. Adding a cross-reference note at the end directing readers to `016-university-queries.ipynb` for more comprehensive query patterns - -**Suggested Addition** (at end of chapter): -```markdown -## Further Practice - -For comprehensive query examples covering all patterns discussed here, -see the [University Queries](../80-examples/016-university-queries.ipynb) example, -which demonstrates these concepts on a realistic academic database. -``` - -**Impact**: Low - additive, doesn't require removing existing content - ---- - -## Recommendation 2: Relationships Chapter - Reference Classic Sales - -**File**: `book/30-database-design/050-relationships.ipynb` - -**Current State**: Creates 12 bank schemas (bank1-12) to demonstrate relationship patterns incrementally. - -**Analysis**: The bank examples are intentionally minimal and incremental, which is pedagogically valuable. Each schema builds on the previous to illustrate specific cardinality concepts. - -**Opportunity**: Add a cross-reference after the core patterns are established: - -**Suggested Addition** (after the "Many-to-Many" section): -```markdown -:::{tip} -For a complete business database demonstrating these relationship patterns -in a realistic context, see the [Classic Sales](../80-examples/010-classic-sales.ipynb) -example, which models offices, employees, customers, orders, and products -as an integrated workflow. -::: -``` - -**Impact**: Low - additive only - ---- - -## Recommendation 3: Master-Part Chapter - Reference Blob Detection - -**File**: `book/30-database-design/053-master-part.ipynb` - -**Current State**: Uses polygon/vertex example for master-part relationships. - -**Analysis**: The polygon/vertex example is appropriately minimal for introducing the concept. The chapter already mentions computational workflows. - -**Opportunity**: Add a practical cross-reference: - -**Suggested Addition** (in "Master-Part in Computations" section): -```markdown -For a complete computational example demonstrating master-part relationships -in an image analysis pipeline, see the [Blob Detection](../80-examples/075-blob-detection.ipynb) -example, where `Detection` (master) and `Detection.Blob` (part) capture -aggregate results and per-feature details atomically. -``` - -**Impact**: Low - enhances existing content - ---- - -## Recommendation 4: Computation Chapter - Already Well Cross-Referenced - -**File**: `book/60-computation/010-computation.ipynb` - -**Current State**: Already references `075-blob-detection.ipynb` extensively as a case study. - -**Analysis**: This chapter demonstrates best practice - it explains concepts briefly and directs readers to the comprehensive example for implementation details. - -**Recommendation**: No changes needed. This is a model for other chapters. - ---- - -## Recommendation 5: Normalization Chapter - Potential for E-commerce Simplification - -**File**: `book/30-database-design/055-normalization.ipynb` - -**Current State**: Contains extensive E-commerce Order Processing example (Order → Payment → Shipment → Delivery → DeliveryConfirmation) spanning ~100 lines. - -**Analysis**: This example is integral to explaining workflow normalization principles. It demonstrates how traditional normalization approaches differ from workflow normalization. - -**Opportunity**: Consider adding reference to classic-sales after the e-commerce discussion: - -**Suggested Addition**: -```markdown -:::{seealso} -The [Classic Sales](../80-examples/010-classic-sales.ipynb) example demonstrates -these workflow normalization principles in a complete business database with -offices, employees, customers, orders, and products. -::: -``` - -**Impact**: Low - additive only - ---- - -## Recommendation 6: Concepts Chapter - Reference Fractals Example - -**File**: `book/20-concepts/04-workflows.md` - -**Current State**: Explains Relational Workflow Model concepts theoretically. - -**Opportunity**: Add reference to practical implementation: - -**Suggested Addition** (after "Table Tiers: Workflow Roles" section): -```markdown -:::{tip} -For a hands-on demonstration of all table tiers working together in a -computational pipeline, see the [Julia Fractals](../80-examples/070-fractals.ipynb) -example, which shows Manual tables for experimental parameters, Lookup tables -for reference data, and Computed tables for derived results. -::: -``` - -**Impact**: Low - connects theory to practice - ---- - -## Not Recommended for Simplification - -### Bank Examples (050-relationships.ipynb) -The 12 bank schemas serve a clear pedagogical purpose: demonstrating relationship patterns incrementally. Replacing them with references would lose the step-by-step learning progression. - -### Language/Fluency Examples (020-restriction.ipynb) -These are appropriately minimal for teaching restriction concepts. The university queries example is more complex and would overwhelm the focused explanation. - -### Mouse/Cage Examples (055-normalization.ipynb) -These examples are tightly integrated with the normalization discussion and demonstrate the specific points about workflow normalization vs. entity normalization. - -### Polygon/Vertex Example (053-master-part.ipynb) -This minimal example is ideal for introducing master-part concepts without distraction. - ---- - -## Implementation Priority - -| Priority | Recommendation | Effort | Impact | -|----------|---------------|--------|--------| -| 1 | Add blob-detection reference to master-part chapter | Low | High - connects concepts to practical example | -| 2 | Add fractals reference to concepts chapter | Low | Medium - connects theory to practice | -| 3 | Add university-queries reference to restriction chapter | Low | Medium - provides comprehensive practice | -| 4 | Add classic-sales reference to relationships chapter | Low | Low - supplementary | -| 5 | Add classic-sales reference to normalization chapter | Low | Low - supplementary | - ---- - -## Conclusion - -The main text examples are generally well-designed for their pedagogical purposes. The primary opportunity is to **add cross-references** to comprehensive examples rather than remove existing content. This approach: - -1. Preserves the focused, incremental learning in main text chapters -2. Directs motivated readers to comprehensive examples for deeper exploration -3. Demonstrates how concepts apply in realistic, complete systems -4. Reduces duplication of effort for readers who explore multiple chapters - -The computation chapter (`010-computation.ipynb`) already exemplifies best practice by referencing `075-blob-detection.ipynb` as a case study rather than duplicating the full implementation. From 5c2da11ffe4b9902b34aaa2ac2c5d62e1e663078 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 04:03:10 +0000 Subject: [PATCH 04/18] Rename index.md to genindex.md to avoid slug conflict The index page was named 'index.md' which could conflict with the site's main index page. Renamed to 'genindex.md' (following Sphinx convention) and added proper frontmatter title. --- book/95-reference/genindex.md | 8 ++++++++ book/95-reference/index.md | 4 ---- 2 files changed, 8 insertions(+), 4 deletions(-) create mode 100644 book/95-reference/genindex.md delete mode 100644 book/95-reference/index.md diff --git a/book/95-reference/genindex.md b/book/95-reference/genindex.md new file mode 100644 index 0000000..4ada334 --- /dev/null +++ b/book/95-reference/genindex.md @@ -0,0 +1,8 @@ +--- +title: Index of Terms +--- + +# Index of Terms + +```{show-index} +``` diff --git a/book/95-reference/index.md b/book/95-reference/index.md deleted file mode 100644 index b9e52d3..0000000 --- a/book/95-reference/index.md +++ /dev/null @@ -1,4 +0,0 @@ -# Index - -```{show-index} -``` From db7c703d82c7dc6daea9732589bb7693bb1613c7 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 04:05:55 +0000 Subject: [PATCH 05/18] Remove index functionality - Delete genindex.md (index of terms page) - Remove all {index} directives from 9 files throughout the book The MyST book-theme has a React bug preventing proper HTML generation, so the index feature is being removed for now. --- book/00-introduction/00-purpose.md | 14 +++++++------- book/00-introduction/05-executive-summary.md | 18 +++++++++--------- book/20-concepts/00-databases.md | 14 +++++++------- book/20-concepts/01-models.md | 12 ++++++------ book/20-concepts/02-relational.md | 8 ++++---- book/20-concepts/04-workflows.md | 10 +++++----- book/30-design/012-integrity.md | 8 ++++---- book/30-design/020-primary-key.md | 8 ++++---- book/50-queries/008-datajoint-in-context.md | 6 +++--- book/95-reference/genindex.md | 8 -------- 10 files changed, 49 insertions(+), 57 deletions(-) delete mode 100644 book/95-reference/genindex.md diff --git a/book/00-introduction/00-purpose.md b/book/00-introduction/00-purpose.md index b3fd6a5..fcb1c0d 100644 --- a/book/00-introduction/00-purpose.md +++ b/book/00-introduction/00-purpose.md @@ -4,7 +4,7 @@ title: Purpose ## What is DataJoint? -**{index}`DataJoint` is a {index}`computational database` language and platform that enables scientists to design, implement, and manage data operations for research by unifying data structures and analysis code.** It provides {index}`data integrity`, {index}`automated computation`, {index}`reproducibility`, and seamless collaboration through a {index}`relational database` approach that coordinates relational databases, code repositories, and object storage. +**DataJoint is a computational database language and platform that enables scientists to design, implement, and manage data operations for research by unifying data structures and analysis code.** It provides data integrity, automated computation, reproducibility, and seamless collaboration through a relational database approach that coordinates relational databases, code repositories, and object storage. ## Who This Book Is For @@ -28,10 +28,10 @@ Here's what makes DataJoint different: **your database schema IS your data proce Traditional databases store and retrieve data. DataJoint does that too, but it also tracks what gets computed from what. Each table plays a specific role in your workflow: -- **{index}`Manual table`s**: Source data entered by researchers -- **{index}`Imported table`s**: Data acquired from instruments or external sources -- **{index}`Computed table`s**: Results automatically derived from upstream data -- **{index}`Lookup table`s**: Reference data and parameters +- **Manual tables**: Source data entered by researchers +- **Imported tables**: Data acquired from instruments or external sources +- **Computed tables**: Results automatically derived from upstream data +- **Lookup tables**: Reference data and parameters This workflow perspective shapes everything: @@ -39,7 +39,7 @@ This workflow perspective shapes everything: **Intelligent Diagrams**: Different table types get distinct visual styles. One glance tells you what's manual, what's automatic, and how everything connects. -**{index}`Provenance`, Not Just Integrity**: {index}`Foreign key`s mean more than "this ID exists." They mean "this result was computed FROM this input." When upstream data changes, DataJoint ensures you can't accidentally keep stale downstream results. This is why DataJoint emphasizes INSERT and DELETE over UPDATE—changing input data without recomputing outputs breaks your science, even if the database technically remains consistent. +**Provenance, Not Just Integrity**: Foreign keys mean more than "this ID exists." They mean "this result was computed FROM this input." When upstream data changes, DataJoint ensures you can't accidentally keep stale downstream results. This is why DataJoint emphasizes INSERT and DELETE over UPDATE—changing input data without recomputing outputs breaks your science, even if the database technically remains consistent. For scientific computing, this workflow-centric design is transformative. Your database doesn't just store results—it guarantees they're valid, reproducible, and traceable back to their origins. @@ -62,7 +62,7 @@ This book provides the skills to transform research operations: from fragile scr ## DataJoint and SQL: Two Languages, One Foundation -**{index}`SQL` (Structured Query Language)** powers virtually every relational database. DataJoint wraps SQL in Pythonic syntax, automatically translating your code into optimized queries. +**SQL (Structured Query Language)** powers virtually every relational database. DataJoint wraps SQL in Pythonic syntax, automatically translating your code into optimized queries. You could learn DataJoint without ever seeing SQL. But this book teaches both, side by side. You'll understand not just *what* works but *why*—and you'll be able to work directly with SQL when needed. diff --git a/book/00-introduction/05-executive-summary.md b/book/00-introduction/05-executive-summary.md index 1d2db7a..6d6f05c 100644 --- a/book/00-introduction/05-executive-summary.md +++ b/book/00-introduction/05-executive-summary.md @@ -11,7 +11,7 @@ Standard database solutions address storage and querying but not computation. Da ## The DataJoint Solution -**DataJoint introduces the {index}`Relational Workflow Model`**—an extension of classical relational theory that treats computational transformations as first-class citizens of the data model. The database {index}`schema` becomes an executable specification: it defines not just what data exists, but how data flows through the pipeline and when computations should run. +**DataJoint introduces the Relational Workflow Model**—an extension of classical relational theory that treats computational transformations as first-class citizens of the data model. The database schema becomes an executable specification: it defines not just what data exists, but how data flows through the pipeline and when computations should run. This creates what we call a **Computational Database**: a system where inserting new raw data automatically triggers all downstream analyses in dependency order, maintaining computational validity throughout. Think of it as a spreadsheet that auto-recalculates, but with the rigor of a relational database and the scale of distributed computing. @@ -21,16 +21,16 @@ This creates what we call a **Computational Database**: a system where inserting Unlike Entity-Relationship modeling that requires translation to SQL, DataJoint schemas are directly executable. The diagram *is* the implementation. Schema changes propagate immediately. Documentation cannot drift from reality because the schema is the documentation. **Workflow-Aware Foreign Keys** -Foreign keys in DataJoint do more than enforce {index}`referential integrity`—they encode computational dependencies. A computed result that references raw data will be automatically deleted if that raw data is removed, preventing stale or orphaned results. This maintains *{index}`computational validity`*, not just *referential integrity*. +Foreign keys in DataJoint do more than enforce referential integrity—they encode computational dependencies. A computed result that references raw data will be automatically deleted if that raw data is removed, preventing stale or orphaned results. This maintains *computational validity*, not just *referential integrity*. **Declarative Computation** -Computations are defined declaratively through {index}`make() method`s attached to table definitions. The {index}`populate()` operation identifies all missing results and executes computations in dependency order. Parallelization, error handling, and job distribution are handled automatically. +Computations are defined declaratively through make() methods attached to table definitions. The populate() operation identifies all missing results and executes computations in dependency order. Parallelization, error handling, and job distribution are handled automatically. -**{index}`Immutability` by Design** +**Immutability by Design** Computed results are immutable. Correcting upstream data requires deleting dependent results and recomputing—ensuring the database always represents a consistent computational state. This naturally provides complete provenance: every result can be traced to its source data and the exact code that produced it. **Hybrid Storage Model** -Structured metadata lives in the relational database ({index}`MySQL`/{index}`PostgreSQL`). Large binary objects (images, recordings, arrays) live in scalable {index}`object storage` (S3, GCS, filesystem) with the database maintaining the mapping. Queries operate on metadata; computation accesses objects transparently. +Structured metadata lives in the relational database (MySQL/PostgreSQL). Large binary objects (images, recordings, arrays) live in scalable object storage (S3, GCS, filesystem) with the database maintaining the mapping. Queries operate on metadata; computation accesses objects transparently. ## Architecture Overview @@ -70,9 +70,9 @@ This book provides comprehensive coverage of DataJoint from foundations through **Part II: Design** - Schema design principles and table definitions -- {index}`Primary key`s, foreign keys, and dependency structures -- {index}`Master-part relationship`s for hierarchical data -- {index}`Normalization` through the lens of workflow entities +- Primary keys, foreign keys, and dependency structures +- Master-part relationships for hierarchical data +- Normalization through the lens of workflow entities - Schema evolution and migration strategies **Part III: Operations** @@ -80,7 +80,7 @@ This book provides comprehensive coverage of DataJoint from foundations through - Caching strategies for performance optimization **Part IV: Queries** -- DataJoint's five-operator {index}`query algebra`: {index}`restriction`, {index}`projection`, {index}`join`, {index}`aggregation`, {index}`union` +- DataJoint's five-operator query algebra: restriction, projection, join, aggregation, union - Comparison with SQL and when to use each - Complex query patterns and optimization diff --git a/book/20-concepts/00-databases.md b/book/20-concepts/00-databases.md index 48edac6..f2c4f13 100644 --- a/book/20-concepts/00-databases.md +++ b/book/20-concepts/00-databases.md @@ -5,7 +5,7 @@ title: Databases ## What is a Database? ```{card} Database -A **{index}`database`** is a dynamic (i.e. *time-varying*), systematically organized collection of data that plays an integral role in the operation of an enterprise. +A **database** is a dynamic (i.e. *time-varying*), systematically organized collection of data that plays an integral role in the operation of an enterprise. It supports the enterprise's operations and is accessed by a variety of users in different ways. Examples of enterprises that rely on databases include hotels, airlines, stores, hospitals, universities, banks, and scientific studies. The database not only tracks the current state of the enterprise's processes but also enforces essential *business rules*, ensuring that only valid transactions occur and preventing errors or inconsistencies. It serves as the **system of record**, the **single source of truth**, accurately reflecting the current state and ongoing activities. @@ -25,7 +25,7 @@ Databases are crucial for the smooth and organized operation of various entities ## Database Management Systems (DBMS) ```{card} Database Management System -A {index}`Database Management System` ({index}`DBMS`) is a software system that serves as the computational engine powering a database. +A Database Management System (DBMS) is a software system that serves as the computational engine powering a database. It defines and enforces the structure of the data, ensuring that the organization's rules are consistently applied. A DBMS manages data storage and efficiently executes data updates and queries while safeguarding the data's structure and integrity, particularly in environments with multiple concurrent users. @@ -50,7 +50,7 @@ One of the most critical features distinguishing databases from simple file stor ### Authentication and Authorization -Before you can work with a database, you must **{index}`authentication`**—prove your identity with a username and password. Once authenticated, the database enforces **{index}`authorization`** rules that determine what you can do: +Before you can work with a database, you must **authentication**—prove your identity with a username and password. Once authenticated, the database enforces **authorization** rules that determine what you can do: - **Read**: View specific tables or columns - **Write**: Add new data to certain tables @@ -80,11 +80,11 @@ Modern databases typically separate data management from data use through distin ### Common Architectures -**{index}`Server-client architecture`** (most common): A database server program manages all data operations, while client programs (your scripts, applications, notebooks) connect to request data or submit changes. The server enforces all rules and access permissions consistently for every client. This is like a library where the librarian (server) manages the books and enforces checkout policies, while patrons (clients) request materials. +**Server-client architecture** (most common): A database server program manages all data operations, while client programs (your scripts, applications, notebooks) connect to request data or submit changes. The server enforces all rules and access permissions consistently for every client. This is like a library where the librarian (server) manages the books and enforces checkout policies, while patrons (clients) request materials. The two most popular open-source relational database systems: MySQL and PostgreSQL implement a server-client architecture. -**{index}`Embedded database`s**: The database engine runs within your application itself—no separate server. This works for single-user applications like mobile apps or desktop software, but doesn't support multiple users accessing shared data simultaneously. -{index}`SQLite` is a common embedded database @10.14778/3554821.3554842. +**Embedded databases**: The database engine runs within your application itself—no separate server. This works for single-user applications like mobile apps or desktop software, but doesn't support multiple users accessing shared data simultaneously. +SQLite is a common embedded database @10.14778/3554821.3554842. **Distributed Databases**: Data and processing are spread across multiple servers working together. This provides high availability and can handle massive scale, but adds significant complexity. Systems like Google Spanner, Amazon DynamoDB, and CockroachDB use this approach. @@ -106,7 +106,7 @@ Separating data management from data use provides critical advantages: This book focuses on **DataJoint**, a framework that extends relational databases specifically for scientific workflows. DataJoint builds on the solid foundation of relational theory while adding capabilities essential for research: automated computation, data provenance, and reproducibility. -The {index}`relational data model`—introduced by {index}`Edgar F. Codd` in 1970—revolutionized data management by organizing data into tables with well-defined relationships. This model has dominated database systems for over five decades due to its mathematical rigor and versatility. Modern relational databases like MySQL and PostgreSQL continue to evolve, incorporating new capabilities for scalability and security while maintaining the core principles that make them reliable and powerful. +The relational data model—introduced by Edgar F. Codd in 1970—revolutionized data management by organizing data into tables with well-defined relationships. This model has dominated database systems for over five decades due to its mathematical rigor and versatility. Modern relational databases like MySQL and PostgreSQL continue to evolve, incorporating new capabilities for scalability and security while maintaining the core principles that make them reliable and powerful. The following chapters build the conceptual foundation you need to understand DataJoint's approach: - **Data Models**: What data models are and why schemas matter for scientific work diff --git a/book/20-concepts/01-models.md b/book/20-concepts/01-models.md index fa8b3e4..e39d00f 100644 --- a/book/20-concepts/01-models.md +++ b/book/20-concepts/01-models.md @@ -15,7 +15,7 @@ This chapter introduces data models conceptually, explores the critical distinct ## Definition ```{card} Data Model -A *{index}`data model`* is a conceptual framework that defines how data is organized, represented, and transformed. It gives us the components for creating blueprints for the structure and operations of data management systems, ensuring consistency and efficiency in data handling. +A *data model* is a conceptual framework that defines how data is organized, represented, and transformed. It gives us the components for creating blueprints for the structure and operations of data management systems, ensuring consistency and efficiency in data handling. Data management systems are built to accommodate these models, allowing us to manage data according to the principles laid out by the model. If you're studying data science or engineering, you've likely encountered different data models, each providing a unique approach to organizing and manipulating data. @@ -77,9 +77,9 @@ Both structured and schemaless data formats can be attractive in various scenari These two approaches are sometimes called **schema-on-write** and **schema-on-read**: -- **{index}`Schema-on-write`** refers to structured data models where the schema is defined and enforced *before* data is stored. Data must conform to the schema at write time, ensuring consistency and integrity from the moment data enters the system. Relational databases exemplify this approach. +- **Schema-on-write** refers to structured data models where the schema is defined and enforced *before* data is stored. Data must conform to the schema at write time, ensuring consistency and integrity from the moment data enters the system. Relational databases exemplify this approach. -- **{index}`Schema-on-read`** refers to schemaless or self-describing data models where structure is interpreted *when* data is read rather than when it is written. Data can be ingested rapidly in its raw form, with structure applied later during analysis. Data lakes and document stores often follow this approach. +- **Schema-on-read** refers to schemaless or self-describing data models where structure is interpreted *when* data is read rather than when it is written. Data can be ingested rapidly in its raw form, with structure applied later during analysis. Data lakes and document stores often follow this approach. Each approach has its strengths: @@ -244,7 +244,7 @@ Most importantly, spreadsheets provide no referential integrity. If cell B2 cont The **relational data model**, introduced by Edgar F. Codd in 1970, revolutionized data management by organizing data into tables (relations) with well-defined relationships. This model emphasizes data integrity, consistency, and powerful query capabilities through a formal mathematical foundation. -The relational model organizes all data into {index}`table`s representing mathematical {index}`relation`s, where each table consists of rows (representing mathematical *{index}`tuple`s*) and columns (often called *{index}`attribute`s*). Key principles include {index}`data type` constraints, uniqueness enforcement through primary keys, referential integrity through foreign keys, and {index}`declarative query`. The next chapter explores these principles in depth. +The relational model organizes all data into tables representing mathematical relations, where each table consists of rows (representing mathematical *tuples*) and columns (often called *attributes*). Key principles include data type constraints, uniqueness enforcement through primary keys, referential integrity through foreign keys, and declarative query. The next chapter explores these principles in depth. The most common way to interact with relational databases is through the Structured Query Language (SQL), a language specifically designed to define, manipulate, and query data within relational databases. @@ -261,7 +261,7 @@ The rest of this book focuses on the relational model, but specifically through ### Example: Document Databases (JSON) -The Document Data Model, commonly exemplified by {index}`JSON` (JavaScript Object Notation), organizes data as key-value pairs within structured documents. This flexible, text-based format is widely used for data interchange between systems, particularly in web applications and APIs. +The Document Data Model, commonly exemplified by JSON (JavaScript Object Notation), organizes data as key-value pairs within structured documents. This flexible, text-based format is widely used for data interchange between systems, particularly in web applications and APIs. #### Structure @@ -313,7 +313,7 @@ The key insight: while initial ingestion might be flexible (schema-on-read), the In recent years, concerns about scientific integrity have brought greater attention to proper data management as the foundation for reproducible science and valid findings. As science becomes more complex and interconnected, meticulous data handling—including reproducibility and data provenance—has become critical. -**{index}`Data provenance`**—the detailed history of how data is collected, processed, and analyzed—provides transparency and accountability. But provenance tracked through metadata alone can break: +**Data provenance**—the detailed history of how data is collected, processed, and analyzed—provides transparency and accountability. But provenance tracked through metadata alone can break: - Tags pointing to deleted files - Descriptions of outdated relationships - Manual records that fall out of sync with actual data diff --git a/book/20-concepts/02-relational.md b/book/20-concepts/02-relational.md index 7467238..16e265b 100644 --- a/book/20-concepts/02-relational.md +++ b/book/20-concepts/02-relational.md @@ -92,7 +92,7 @@ The evolution of relational database thinking provides three complementary level Codd's relational algebra and calculus provide formal operations on relations with provable properties. Query optimization relies on these mathematical foundations to prove that two different queries produce equivalent results, allowing the system to choose the most efficient implementation. -**2. Conceptual Modeling (Chen, 1976):** Building upon Codd's foundation, **{index}`Peter Chen`** introduced the {index}`Entity-Relationship Model` (ERM). While Codd's model provided the rigorous mathematical underpinnings, Chen's ERM offered a more intuitive, conceptual way to think about and design databases, particularly during initial planning stages. +**2. Conceptual Modeling (Chen, 1976):** Building upon Codd's foundation, **Peter Chen** introduced the Entity-Relationship Model (ERM). While Codd's model provided the rigorous mathematical underpinnings, Chen's ERM offered a more intuitive, conceptual way to think about and design databases, particularly during initial planning stages. *Like architectural blueprints that translate engineering principles into buildable structures.* @@ -128,7 +128,7 @@ The next sections show how this mathematical rigor translates into practical dat ## Relational Algebra and Calculus -**{index}`Relational algebra`** is a set of operations that can be used to transform relations in a formal way. It provides the foundation for querying relational databases, allowing us to combine, modify, and retrieve data stored in tables (relations). +**Relational algebra** is a set of operations that can be used to transform relations in a formal way. It provides the foundation for querying relational databases, allowing us to combine, modify, and retrieve data stored in tables (relations). Examples of relational operators: @@ -165,7 +165,7 @@ This operation effectively merges the connections from both sets of values, prov Relational algebra, with its powerful operators, allows us to query and manipulate data in a structured and efficient way, forming the backbone of modern database systems. By understanding and applying these operators, we can perform complex data analysis and retrieval tasks with precision and clarity. -Another formal language for deriving new relations from scratch or from other relations is **{index}`relational calculus`**. Rather than using relational operators, it relies on a *set-building notation* to generate relations. +Another formal language for deriving new relations from scratch or from other relations is **relational calculus**. Rather than using relational operators, it relies on a *set-building notation* to generate relations. :::{note} The query notation of the SQL programming language combines concepts from both relational algebra and relational calculus. However, DataJoint's query language is based purely on relational algebra. @@ -188,7 +188,7 @@ Codd's model was derived from relational theory but differed sufficiently in its Through the 1970s, before relational databases became practical, theorists derived fundamental rules for rigorous data organization and queries from first principles using mathematical proofs and derivations. For this reason, early work on relational databases has an abstract academic feel to it with rather simple toy examples: the ubiquitous employees/departments, products/orders, and students/courses. -The design principles were defined through the rigorous but rather abstract principles, the **{index}`normal form`s** [@10.1145/358024.358054]. These normal forms provide mathematically precise rules for organizing data to minimize redundancy and maintain integrity. +The design principles were defined through the rigorous but rather abstract principles, the **normal forms** [@10.1145/358024.358054]. These normal forms provide mathematically precise rules for organizing data to minimize redundancy and maintain integrity. The relational data model is one of the most powerful and precise ways to store and manage structured data. At its core, this model organizes all data into tables—representing mathematical relations—where each table consists of rows (representing mathematical *tuples*) and columns (often called *attributes*). diff --git a/book/20-concepts/04-workflows.md b/book/20-concepts/04-workflows.md index e654ef0..2bc22cb 100644 --- a/book/20-concepts/04-workflows.md +++ b/book/20-concepts/04-workflows.md @@ -6,7 +6,7 @@ The previous chapters established that traditional relational databases excel at **DataJoint solves these problems by treating your database schema as an executable workflow specification.** Your table definitions don't just describe data structure—they prescribe how data flows through your pipeline, when computations run, and how results depend on inputs. -This chapter introduces the **Relational Workflow Model**—a fundamental extension of relational theory that makes databases workflow-aware while preserving all the mathematical rigor of Codd's model. The Relational Workflow Model defines a new class of databases called **{index}`Computational Databases `**, where computational transformations are first-class citizens of the data model. Just as electronic spreadsheets automatically recalculate formulas when you enter new data, computational databases trigger cascades of computations specified by the schema whenever new data enters the system. +This chapter introduces the **Relational Workflow Model**—a fundamental extension of relational theory that makes databases workflow-aware while preserving all the mathematical rigor of Codd's model. The Relational Workflow Model defines a new class of databases called **Computational Databases**, where computational transformations are first-class citizens of the data model. Just as electronic spreadsheets automatically recalculate formulas when you enter new data, computational databases trigger cascades of computations specified by the schema whenever new data enters the system. ## A New Paradigm for Relational Databases @@ -36,7 +36,7 @@ The **mathematical view** of the relational model, championed by Edgar F. Codd, **Tuple as Proposition**: Each row (tuple) is a specific set of attribute values that asserts a true proposition for the predicate. For example, if a table's predicate is "Employee $x$ works on Project $y$," the row `(Alice, P1)` asserts the truth: "Employee Alice works on Project P1." -**{index}`Functional dependency`**: The core concept is the functional dependency: attribute `A` functionally determines attribute `B` (written `A → B`) if knowing the value of `A` allows you to determine the unique value of `B`. For example, the attribute `department` functionally determines the attribute `department_chair` because knowing the department name allows you to determine the unique name of the department chair. Functional dependencies are helpful for reasoning about the structure of the database and for performing queries. +**Functional dependency**: The core concept is the functional dependency: attribute `A` functionally determines attribute `B` (written `A → B`) if knowing the value of `A` allows you to determine the unique value of `B`. For example, the attribute `department` functionally determines the attribute `department_chair` because knowing the department name allows you to determine the unique name of the department chair. Functional dependencies are helpful for reasoning about the structure of the database and for performing queries. Then the database can be viewed as a collection of predicates and a minimal complete set of true propositions from which all other true propositions can be derived. Data queries are viewed as logical inferences using the rules of predicate calculus. *Relational algebra* and *relational calculus* provide set of operations that can be used to perform these inferences. Under the Closed World Assumption (CWA), the database is assumed to contain all true propositions and all other propositions are assumed to be false. CWA is a simplifying assumption that allows us to reason about the data in the database in a more precise way. @@ -168,11 +168,11 @@ This unified approach eliminates the traditional separation between conceptual d ### Table Tiers: Workflow Roles -DataJoint introduces a sophisticated classification system called {index}`table tier`s that organizes entity sets according to their specific role in the workflow. This classification goes far beyond simple organizational convenience—it fundamentally shapes how you think about data flow and responsibility within your system. +DataJoint introduces a sophisticated classification system called table tiers that organizes entity sets according to their specific role in the workflow. This classification goes far beyond simple organizational convenience—it fundamentally shapes how you think about data flow and responsibility within your system. The four table tiers each represent a distinct type of workflow activity. Lookup tables contain reference data and parameters, such as controlled vocabularies and constants that provide the foundational knowledge for your workflow. Manual tables capture human-entered data, including observations and decisions that require expert judgment or domain knowledge. Imported tables handle automated data acquisition from instruments, files, or external systems. Finally, computed tables perform automated processing, generating derived results and analyses from the data collected in other tiers. -This tiered structure creates a natural dependency hierarchy that reflects the logical flow of information through your workflow. Computed tables depend on imported or manual tables for their input data, which in turn may depend on lookup tables for reference information. This creates a {index}`directed acyclic graph` ({index}`DAG`) that makes the workflow structure explicit and prevents circular dependencies that could lead to infinite loops or logical inconsistencies. +This tiered structure creates a natural dependency hierarchy that reflects the logical flow of information through your workflow. Computed tables depend on imported or manual tables for their input data, which in turn may depend on lookup tables for reference information. This creates a directed acyclic graph (DAG) that makes the workflow structure explicit and prevents circular dependencies that could lead to infinite loops or logical inconsistencies. The visual representation of this structure through color-coded diagrams provides immediate insight into your workflow. Green represents manual tables where human expertise enters the system, blue shows imported tables where automated data acquisition occurs, red indicates computed tables where algorithmic processing happens, and gray denotes lookup tables containing reference information. At a glance, you can see where data enters your system and trace how it flows through each processing step. @@ -194,7 +194,7 @@ This workflow-centric approach makes relationships implicit rather than explicit The Relational Workflow Model introduces a crucial distinction between transactional consistency and computational validity that fundamentally changes how we think about data integrity. Traditional databases focus primarily on transactional consistency, ensuring that concurrent updates don't corrupt data through mechanisms like locking and isolation. While this is essential for preventing race conditions, it doesn't address a deeper problem that arises in computational workflows: ensuring that downstream results remain consistent with their upstream inputs. -DataJoint addresses this challenge through its approach to immutability and {index}`cascade delete` operations. When you delete an entity in DataJoint, the system doesn't simply remove that single record—it cascades the delete to all dependent entities throughout the workflow. This behavior isn't just cleanup; it's enforcing computational validity by recognizing that if the inputs are gone, any results based on those inputs become meaningless and must be removed. +DataJoint addresses this challenge through its approach to immutability and cascade delete operations. When you delete an entity in DataJoint, the system doesn't simply remove that single record—it cascades the delete to all dependent entities throughout the workflow. This behavior isn't just cleanup; it's enforcing computational validity by recognizing that if the inputs are gone, any results based on those inputs become meaningless and must be removed. The process of correcting data illustrates this principle beautifully. When you discover an error in upstream data, you don't simply update the problematic record. Instead, you delete the entire downstream pipeline that was computed from the incorrect data, reinsert the corrected data, and then recompute the entire dependent chain. This ensures that every result in your database represents a consistent computation from valid inputs. diff --git a/book/30-design/012-integrity.md b/book/30-design/012-integrity.md index be3bb06..f369bd3 100644 --- a/book/30-design/012-integrity.md +++ b/book/30-design/012-integrity.md @@ -53,7 +53,7 @@ Relational databases excel at expressing and enforcing such rules through **inte This section introduces seven fundamental types of integrity constraints. Each will be covered in detail in subsequent chapters, with DataJoint implementation examples. -## 1. {index}`Domain integrity` +## 1. Domain integrity **Ensures values are within valid ranges and types.** Domain integrity restricts attribute values to predefined valid sets using: @@ -82,7 +82,7 @@ Completeness prevents missing values that could invalidate analyses: --- -## 3. {index}`Entity integrity` +## 3. Entity integrity **Each real-world entity corresponds to exactly one database record, and each database record corresponds to exactly one real-world entity.** Entity integrity ensures a **one-to-one correspondence** between real-world entities and their digital representations in the database. This is not simply about having unique identifiers—it's about establishing a reliable, bidirectional mapping where: @@ -132,13 +132,13 @@ Compositional integrity ensures multi-part entities are never partially stored: --- -## 6. {index}`Consistency` +## 6. Consistency **All users see the same valid data state.** Consistency provides a unified view during concurrent access: - **Isolation levels** control transaction visibility - **Locking mechanisms** prevent conflicting updates -- **{index}`ACID` properties** guarantee reliable state transitions +- **ACID properties** guarantee reliable state transitions **Example:** Two researchers inserting experiments simultaneously don't create duplicates. diff --git a/book/30-design/020-primary-key.md b/book/30-design/020-primary-key.md index bc250ca..7a2bc75 100644 --- a/book/30-design/020-primary-key.md +++ b/book/30-design/020-primary-key.md @@ -192,7 +192,7 @@ This flexibility in entity integrity allows businesses to balance strict data ru ## Using Natural Keys -A table can be designed with a **{index}`natural key`**, which is an identifier that exists in the real world. For example, a Social Security Number (SSN) can serve as a natural key for a person because it is a unique number used and recognized in real-world systems. +A table can be designed with a **natural key**, which is an identifier that exists in the real world. For example, a Social Security Number (SSN) can serve as a natural key for a person because it is a unique number used and recognized in real-world systems. In some cases, a natural key already exists, or one can be specifically created for data management purposes and then introduced into the real world to be permanently associated with physical entities. @@ -204,7 +204,7 @@ Phone numbers, in particular, have become popular as identifiers as mobile phone # Composite Primary Keys -Sometimes, a single column cannot uniquely identify a record. In these cases, we use a **{index}`composite primary key`**—a primary key made up of multiple columns that together uniquely identify each row. +Sometimes, a single column cannot uniquely identify a record. In these cases, we use a **composite primary key**—a primary key made up of multiple columns that together uniquely identify each row. ## Example: U.S. House of Representatives @@ -286,7 +286,7 @@ Use composite primary keys when: # Using Surrogate Keys -In many cases, it makes more sense to use a **{index}`surrogate key`** as the primary key in a database. A surrogate key has no relationship to the real world and is used solely within the database for identification purposes. These keys are often generated automatically as an **auto-incrementing number** or a **random string** like a UUID (Universally Unique Identifier) or GUID (Globally Unique Identifier). +In many cases, it makes more sense to use a **surrogate key** as the primary key in a database. A surrogate key has no relationship to the real world and is used solely within the database for identification purposes. These keys are often generated automatically as an **auto-incrementing number** or a **random string** like a UUID (Universally Unique Identifier) or GUID (Globally Unique Identifier). When using surrogate keys, entity integrity can still be maintained by using other unique attributes (such as secondary unique indexes) to help identify and match entities to their digital representations. @@ -294,7 +294,7 @@ Surrogate keys are especially useful for entities that exist only in digital for ## Universally Unique Identifiers (UUIDs) -**{index}`UUID`s** (Universally Unique Identifiers) are 128-bit identifiers that are designed to be globally unique across time and space. They are standardized by [RFC 9562](https://www.rfc-editor.org/rfc/rfc9562.html) (which obsoletes RFC 4122) and provide a reliable way to generate surrogate keys without coordination between different systems. +**UUIDs** (Universally Unique Identifiers) are 128-bit identifiers that are designed to be globally unique across time and space. They are standardized by [RFC 9562](https://www.rfc-editor.org/rfc/rfc9562.html) (which obsoletes RFC 4122) and provide a reliable way to generate surrogate keys without coordination between different systems. ### UUID Format diff --git a/book/50-queries/008-datajoint-in-context.md b/book/50-queries/008-datajoint-in-context.md index 9214515..2ecc759 100644 --- a/book/50-queries/008-datajoint-in-context.md +++ b/book/50-queries/008-datajoint-in-context.md @@ -48,7 +48,7 @@ Codd identified this inflexibility as a symptom of a deeper problem and articula **Access Path Dependence**: The most significant issue was the reliance on predefined access paths. In a hierarchical or network model, a program's logic for retrieving data was inextricably linked to the specific parent-child or owner-member relationships defined in the schema. If a business requirement changed—for example, altering the relationship between Projects and Parts in a manufacturing database—the application programs that navigated the old structure would become logically impaired and cease to function correctly. -Codd's central argument was that users and applications needed to be "protected from having to know how the data is organized in the machine". This protection, which he termed "{index}`data independence`," was the foundational goal of his new model. +Codd's central argument was that users and applications needed to be "protected from having to know how the data is organized in the machine". This protection, which he termed "data independence," was the foundational goal of his new model. ### A Relational Model for Data: Core Principles of Relations, Tuples, and Domains @@ -81,7 +81,7 @@ Having defined the data structure, Codd and his colleagues subsequently develope From these primitives, other useful operators like Join (⋈) and Intersection (∩) can be derived. Two properties of this algebra are particularly crucial for understanding its power and elegance. -First is the **{index}`Closure property`**. This principle states that the result of any operation in relational algebra is itself a relation. This is a profoundly important feature. Because the output of an operation is the same type of object as the input, operators can be composed and nested to form expressions of arbitrary complexity. A query can be built up from sub-queries, each of which produces a valid relation that can be fed into the next operation. This property is the foundation of modern query languages like SQL. +First is the **Closure property**. This principle states that the result of any operation in relational algebra is itself a relation. This is a profoundly important feature. Because the output of an operation is the same type of object as the input, operators can be composed and nested to form expressions of arbitrary complexity. A query can be built up from sub-queries, each of which produces a valid relation that can be fed into the next operation. This property is the foundation of modern query languages like SQL. Second is the concept of **Relational Completeness**. Relational algebra serves as a theoretical benchmark for the expressive power of any database query language. A language is said to be "relationally complete" if it can be used to formulate any query that is expressible in relational algebra (or its declarative equivalent, relational calculus). This provides a formal yardstick to measure whether a language is sufficiently powerful to perform any standard relational query without needing to resort to procedural constructs like loops or branching. @@ -205,7 +205,7 @@ To ensure clarity in the subsequent technical discussion, the following table pr ### Core Properties: Algebraic Closure and Entity Integrity Preservation -The principle of {index}`entity normalization` is not merely a guideline for schema design; it is a strict constraint on DataJoint's query language. DataJoint implements a complete relational algebra with five primary operators: restrict (`&`), join (`*`), project (`proj`), aggregate (`aggr`), and union (`+`). This algebra is designed around two critical properties that work in concert to maintain semantic cohesion: +The principle of entity normalization is not merely a guideline for schema design; it is a strict constraint on DataJoint's query language. DataJoint implements a complete relational algebra with five primary operators: restrict (`&`), join (`*`), project (`proj`), aggregate (`aggr`), and union (`+`). This algebra is designed around two critical properties that work in concert to maintain semantic cohesion: **Algebraic Closure**: Like classical relational algebra, DataJoint's algebra possesses the closure property. All operators take entity sets as input and produce a valid entity set as output. This allows for the seamless composition and nesting of query expressions. diff --git a/book/95-reference/genindex.md b/book/95-reference/genindex.md deleted file mode 100644 index 4ada334..0000000 --- a/book/95-reference/genindex.md +++ /dev/null @@ -1,8 +0,0 @@ ---- -title: Index of Terms ---- - -# Index of Terms - -```{show-index} -``` From 695426b96950de9779c9270cd8d54a2efabda4c8 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 19:39:07 +0000 Subject: [PATCH 06/18] Move Data Integrity chapter to Concepts section MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Relocated 012-integrity.md from 30-design to 20-concepts/04-integrity.md - Renumbered workflows chapter (04 → 05) and pipelines chapter (05 → 06) - Updated all cross-references in the moved chapter to point to Design section - Updated chapter transitions to reflect new flow: - 03-relational-practice.ipynb now points to Data Integrity as next chapter - 05-workflows.md now references Data Integrity chapter - 04-integrity.md Next Steps now references Workflows and Design section - Updated all files referencing renamed workflow/pipeline chapters: - 030-foreign-keys.ipynb - 050-relationships.ipynb - 053-master-part.ipynb - 010-computation.ipynb - 010-classic-sales.ipynb - 015-table.ipynb New Concepts section flow: 1. Databases → Models → Relational → Practice → Integrity → Workflows → Pipelines This organization places Data Integrity as a foundational concept before introducing the Relational Workflow Model, providing better pedagogical flow. --- book/20-concepts/03-relational-practice.ipynb | 10 +--- .../04-integrity.md} | 28 +++++----- .../{04-workflows.md => 05-workflows.md} | 2 +- .../{05-pipelines.md => 06-pipelines.md} | 0 book/30-design/015-table.ipynb | 2 +- book/30-design/030-foreign-keys.ipynb | 43 ++-------------- book/30-design/050-relationships.ipynb | 51 +------------------ book/30-design/053-master-part.ipynb | 2 +- book/60-computation/010-computation.ipynb | 2 +- book/80-examples/010-classic-sales.ipynb | 2 +- 10 files changed, 26 insertions(+), 116 deletions(-) rename book/{30-design/012-integrity.md => 20-concepts/04-integrity.md} (86%) rename book/20-concepts/{04-workflows.md => 05-workflows.md} (99%) rename book/20-concepts/{05-pipelines.md => 06-pipelines.md} (100%) diff --git a/book/20-concepts/03-relational-practice.ipynb b/book/20-concepts/03-relational-practice.ipynb index aaee363..12dd920 100644 --- a/book/20-concepts/03-relational-practice.ipynb +++ b/book/20-concepts/03-relational-practice.ipynb @@ -1568,15 +1568,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "## The Path Forward: Databases as Workflows\n", - "\n", - "**DataJoint extends relational theory by viewing the schema as a workflow specification.** It preserves all the benefits of relational databases—mathematical rigor, declarative queries, data integrity—while adding workflow semantics that make the database **workflow-aware**.\n", - "\n", - "**Key Insight**: The database schema structure can be identical whether using SQL or DataJoint, although DataJoint imposes some conventions. What's different is the **conceptual view**: SQL sees static entities and relationships; DataJoint sees an executable workflow, where some steps are manual and others are automatic. This workflow view enables automatic execution, provenance tracking, and computational validity—features essential for scientific computing.\n", - "\n", - "The next chapter introduces DataJoint's Relational Workflow Model in detail, showing how Computed tables turn your schema into an executable pipeline specification.\n" - ] + "source": "## The Path Forward: Databases as Workflows\n\n**DataJoint extends relational theory by viewing the schema as a workflow specification.** It preserves all the benefits of relational databases—mathematical rigor, declarative queries, data integrity—while adding workflow semantics that make the database **workflow-aware**.\n\n**Key Insight**: The database schema structure can be identical whether using SQL or DataJoint, although DataJoint imposes some conventions. What's different is the **conceptual view**: SQL sees static entities and relationships; DataJoint sees an executable workflow, where some steps are manual and others are automatic. This workflow view enables automatic execution, provenance tracking, and computational validity—features essential for scientific computing.\n\nThe next chapter explores **Data Integrity**—the fundamental constraints that databases enforce to ensure data remains accurate, consistent, and reliable. Understanding these integrity concepts provides the foundation for DataJoint's Relational Workflow Model, which extends integrity guarantees to include workflow validity and computational consistency." }, { "cell_type": "code", diff --git a/book/30-design/012-integrity.md b/book/20-concepts/04-integrity.md similarity index 86% rename from book/30-design/012-integrity.md rename to book/20-concepts/04-integrity.md index f369bd3..c488321 100644 --- a/book/30-design/012-integrity.md +++ b/book/20-concepts/04-integrity.md @@ -63,7 +63,7 @@ Domain integrity restricts attribute values to predefined valid sets using: **Example:** Recording temperature must be between 20-25°C. -**Covered in:** [Tables](015-table.ipynb) — Data type specification +**Covered in:** [Tables](../30-design/015-table.ipynb) — Data type specification --- @@ -76,9 +76,9 @@ Completeness prevents missing values that could invalidate analyses: **Example:** Every experiment must have a start date. -**Covered in:** -- [Tables](015-table.ipynb) — Required vs. optional attributes -- [Default Values](017-default-values.ipynb) — Handling optional data +**Covered in:** +- [Tables](../30-design/015-table.ipynb) — Required vs. optional attributes +- [Default Values](../30-design/017-default-values.ipynb) — Handling optional data --- @@ -95,7 +95,7 @@ Entity integrity ensures a **one-to-one correspondence** between real-world enti **Example:** Each mouse in the lab has exactly one unique ID, and that ID refers to exactly one mouse—never two different mice sharing the same ID, and never one mouse having multiple IDs. **Covered in:** -- [Primary Keys](020-primary-key.md) — Entity integrity and the 1:1 correspondence guarantee (elaborated in detail) +- [Primary Keys](../30-design/020-primary-key.md) — Entity integrity and the 1:1 correspondence guarantee (elaborated in detail) - [UUID](../85-special-topics/025-uuid.ipynb) — Universally unique identifiers --- @@ -111,8 +111,8 @@ Referential integrity maintains logical associations across tables: **Example:** A recording session cannot reference a non-existent mouse. **Covered in:** -- [Foreign Keys](030-foreign-keys.ipynb) — Cross-table relationships -- [Relationships](050-relationships.ipynb) — Dependency patterns +- [Foreign Keys](../30-design/030-foreign-keys.ipynb) — Cross-table relationships +- [Relationships](../30-design/050-relationships.ipynb) — Dependency patterns --- @@ -127,7 +127,7 @@ Compositional integrity ensures multi-part entities are never partially stored: **Example:** An imaging session's metadata and all acquired frames are stored together or not at all. **Covered in:** -- [Master-Part Relationships](053-master-part.ipynb) — Hierarchical compositions +- [Master-Part Relationships](../30-design/053-master-part.ipynb) — Hierarchical compositions - [Transactions](../40-operations/040-transactions.ipynb) — Atomic operations --- @@ -161,7 +161,7 @@ Workflow integrity maintains valid operation sequences through: **Example:** An analysis pipeline cannot compute results before acquiring raw data. If `NeuronAnalysis` depends on `SpikeData`, which depends on `RecordingSession`, the database enforces that recordings are created before spike detection, which occurs before analysis—maintaining the integrity of the entire scientific workflow. **Covered in:** -- [Foreign Keys](030-foreign-keys.ipynb) — How foreign keys encode workflow dependencies +- [Foreign Keys](../30-design/030-foreign-keys.ipynb) — How foreign keys encode workflow dependencies - [Computation](../60-computation/010-computation.ipynb) — Automatic workflow execution and dependency resolution --- @@ -207,11 +207,13 @@ As you progress through the following chapters, you'll see how DataJoint impleme ```{admonition} Next Steps :class: tip -Now that you understand *why* integrity matters, the following chapters show *how* to implement each constraint type: +Now that you understand *why* integrity matters, the next chapter introduces how DataJoint's **Relational Workflow Model** builds on these integrity concepts to create computational databases where workflows are first-class citizens. + +The [Design](../30-design/010-schema.ipynb) section then shows *how* to implement each constraint type: -1. **[Tables](015-table.ipynb)** — Basic structure with domain integrity -2. **[Primary Keys](020-primary-key.md)** — Entity integrity through unique identification -3. **[Foreign Keys](030-foreign-keys.ipynb)** — Referential integrity across tables +1. **[Tables](../30-design/015-table.ipynb)** — Basic structure with domain integrity +2. **[Primary Keys](../30-design/020-primary-key.md)** — Entity integrity through unique identification +3. **[Foreign Keys](../30-design/030-foreign-keys.ipynb)** — Referential integrity across tables Each chapter builds on these foundational integrity concepts. ``` diff --git a/book/20-concepts/04-workflows.md b/book/20-concepts/05-workflows.md similarity index 99% rename from book/20-concepts/04-workflows.md rename to book/20-concepts/05-workflows.md index 2bc22cb..fc831ba 100644 --- a/book/20-concepts/04-workflows.md +++ b/book/20-concepts/05-workflows.md @@ -2,7 +2,7 @@ ## From Storage to Workflow -The previous chapters established that traditional relational databases excel at storing and querying data but struggle with the computational workflows central to scientific research. The practical chapter demonstrated these limitations firsthand: you can store inputs and outputs, but the database doesn't understand that outputs were *computed from* inputs, doesn't automatically recompute when inputs change, and doesn't track provenance. +The previous chapters established that traditional relational databases excel at storing and querying data but struggle with the computational workflows central to scientific research. The practical chapter demonstrated these limitations firsthand: you can store inputs and outputs, but the database doesn't understand that outputs were *computed from* inputs, doesn't automatically recompute when inputs change, and doesn't track provenance. The [Data Integrity](04-integrity.md) chapter introduced the seven types of integrity constraints that databases enforce, culminating in **workflow integrity**—the guarantee that operations execute in valid sequences. **DataJoint solves these problems by treating your database schema as an executable workflow specification.** Your table definitions don't just describe data structure—they prescribe how data flows through your pipeline, when computations run, and how results depend on inputs. diff --git a/book/20-concepts/05-pipelines.md b/book/20-concepts/06-pipelines.md similarity index 100% rename from book/20-concepts/05-pipelines.md rename to book/20-concepts/06-pipelines.md diff --git a/book/30-design/015-table.ipynb b/book/30-design/015-table.ipynb index 553c5a6..a583e7a 100644 --- a/book/30-design/015-table.ipynb +++ b/book/30-design/015-table.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "---\ntitle: Create Tables\nauthors:\n - name: Dimitri Yatsenko\ndate: 2025-10-31\n---\n\n# Tables: The Foundation of Data Integrity\n\nIn the [previous chapter](012-integrity.md), we learned that relational databases excel at enforcing **data integrity constraints**. Tables are where these constraints come to life through:\n\n1. **Domain Integrity** — Data types restrict values to valid ranges\n2. **Completeness** — Required vs. optional attributes ensure necessary data is present\n3. **Entity Integrity** — Primary keys uniquely identify each record (covered in [Primary Keys](020-primary-key.md))\n4. **Referential Integrity** — Foreign keys enforce relationships (covered in [Foreign Keys](030-foreign-keys.ipynb))\n\nThis chapter shows how to declare tables in DataJoint with proper data types and attribute specifications that enforce these constraints automatically.\n\n```{admonition} Learning Objectives\n:class: note\n\nBy the end of this chapter, you will:\n- Declare tables with the `@schema` decorator\n- Specify attributes with appropriate data types\n- Distinguish between primary key and dependent attributes\n- Insert, view, and delete data\n- Understand how table structure enforces integrity\n```" + "source": "---\ntitle: Create Tables\nauthors:\n - name: Dimitri Yatsenko\ndate: 2025-10-31\n---\n\n# Tables: The Foundation of Data Integrity\n\nIn the [Data Integrity](../20-concepts/04-integrity.md) chapter, we learned that relational databases excel at enforcing **data integrity constraints**. Tables are where these constraints come to life through:\n\n1. **Domain Integrity** — Data types restrict values to valid ranges\n2. **Completeness** — Required vs. optional attributes ensure necessary data is present\n3. **Entity Integrity** — Primary keys uniquely identify each record (covered in [Primary Keys](020-primary-key.md))\n4. **Referential Integrity** — Foreign keys enforce relationships (covered in [Foreign Keys](030-foreign-keys.ipynb))\n\nThis chapter shows how to declare tables in DataJoint with proper data types and attribute specifications that enforce these constraints automatically.\n\n```{admonition} Learning Objectives\n:class: note\n\nBy the end of this chapter, you will:\n- Declare tables with the `@schema` decorator\n- Specify attributes with appropriate data types\n- Distinguish between primary key and dependent attributes\n- Insert, view, and delete data\n- Understand how table structure enforces integrity\n```" }, { "cell_type": "markdown", diff --git a/book/30-design/030-foreign-keys.ipynb b/book/30-design/030-foreign-keys.ipynb index bc634b7..a5d58e0 100644 --- a/book/30-design/030-foreign-keys.ipynb +++ b/book/30-design/030-foreign-keys.ipynb @@ -10,30 +10,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Modeling Relationships with Foreign Keys\n", - "\n", - "While **entity integrity** ensures that each record uniquely represents a real-world entity, **referential integrity** ensures that the *relationships between* these entities are valid and consistent. It's a guarantee that you won't have an employee assigned to a non-existent department or a task associated with a deleted project.\n", - "\n", - "Crucially, **referential integrity is impossible without entity integrity**. You must first have a reliable way to identify unique entities before you can define their relationships.\n", - "\n", - "In relational databases, these relationships are established and enforced using **foreign keys**. A foreign key creates a link between a **child table** (the one with the reference) and a **parent table** (the one being referenced). Think of `Employee` as the child and `Title` as the parent; an employee must have a valid, existing title.\n", - "\n", - "A foreign key is a column (or set of columns) in the child table that refers to the primary key of the parent table. In DataJoint, a foreign key *always* references a parent's primary key, which is a highly recommended practice for clarity and consistency.\n", - "\n", - "## Referential Integrity + Workflow Dependencies\n", - "\n", - "In DataJoint, foreign keys serve a **dual role** that extends beyond traditional relational databases:\n", - "\n", - "1. **Referential integrity** (like traditional databases): Ensures that child references must exist in the parent table\n", - "2. **Workflow dependencies** (DataJoint's addition): Prescribes the order of operations—the parent must be created before the child\n", - "\n", - "This transforms the schema into a **directed acyclic graph (DAG)** representing valid workflow execution sequences. The foreign key `-> Title` in `Employee` not only ensures that each employee has a valid title, but also establishes that titles must be created before employees can be assigned to them.\n", - "\n", - "For more on how DataJoint extends foreign keys with workflow semantics, see [Relational Workflows](../20-concepts/04-workflows.md).\n", - "\n", - "In the following example, we define the parent table `Title` and the child table `Employee`, which references `Title`." - ] + "source": "# Modeling Relationships with Foreign Keys\n\nWhile **entity integrity** ensures that each record uniquely represents a real-world entity, **referential integrity** ensures that the *relationships between* these entities are valid and consistent. It's a guarantee that you won't have an employee assigned to a non-existent department or a task associated with a deleted project.\n\nCrucially, **referential integrity is impossible without entity integrity**. You must first have a reliable way to identify unique entities before you can define their relationships.\n\nIn relational databases, these relationships are established and enforced using **foreign keys**. A foreign key creates a link between a **child table** (the one with the reference) and a **parent table** (the one being referenced). Think of `Employee` as the child and `Title` as the parent; an employee must have a valid, existing title.\n\nA foreign key is a column (or set of columns) in the child table that refers to the primary key of the parent table. In DataJoint, a foreign key *always* references a parent's primary key, which is a highly recommended practice for clarity and consistency.\n\n## Referential Integrity + Workflow Dependencies\n\nIn DataJoint, foreign keys serve a **dual role** that extends beyond traditional relational databases:\n\n1. **Referential integrity** (like traditional databases): Ensures that child references must exist in the parent table\n2. **Workflow dependencies** (DataJoint's addition): Prescribes the order of operations—the parent must be created before the child\n\nThis transforms the schema into a **directed acyclic graph (DAG)** representing valid workflow execution sequences. The foreign key `-> Title` in `Employee` not only ensures that each employee has a valid title, but also establishes that titles must be created before employees can be assigned to them.\n\nFor more on how DataJoint extends foreign keys with workflow semantics, see [Relational Workflows](../20-concepts/05-workflows.md).\n\nIn the following example, we define the parent table `Title` and the child table `Employee`, which references `Title`." }, { "cell_type": "code", @@ -506,21 +483,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Summary\n", - "\n", - "Foreign keys ensure referential integrity by linking a child table to a parent table. In DataJoint, they also establish **workflow dependencies** that prescribe the order of operations. This link imposes five key effects:\n", - "\n", - "1. **Schema Embedding**: The parent's primary key is added as columns to the child table.\n", - "2. **Insert Restriction**: A row cannot be added to the **child** if its foreign key doesn't match a primary key in the **parent**. In DataJoint, this enforces workflow order—the parent must be created before the child.\n", - "3. **Delete Restriction**: A row cannot be deleted from the **parent** if it is still referenced by any rows in the **child**. In DataJoint, cascading deletes maintain workflow consistency by removing dependent downstream artifacts.\n", - "4. **Update Restriction**: Updates to the primary and foreign keys are restricted to prevent inconsistencies. In DataJoint, this preserves workflow immutability—workflow artifacts must be re-executed rather than updated.\n", - "5. **Performance Optimization**: An index is automatically created on the foreign key in the child table to speed up searches and joins.\n", - "\n", - "**In DataJoint, foreign keys transform the schema into a directed acyclic graph (DAG)** that represents valid workflow execution sequences. The schema becomes an executable specification of your workflow, where foreign keys not only enforce referential integrity but also prescribe the order of operations and maintain computational validity throughout the workflow.\n", - "\n", - "For more on how DataJoint extends foreign keys with workflow semantics, see [Relational Workflows](../20-concepts/04-workflows.md)." - ] + "source": "# Summary\n\nForeign keys ensure referential integrity by linking a child table to a parent table. In DataJoint, they also establish **workflow dependencies** that prescribe the order of operations. This link imposes five key effects:\n\n1. **Schema Embedding**: The parent's primary key is added as columns to the child table.\n2. **Insert Restriction**: A row cannot be added to the **child** if its foreign key doesn't match a primary key in the **parent**. In DataJoint, this enforces workflow order—the parent must be created before the child.\n3. **Delete Restriction**: A row cannot be deleted from the **parent** if it is still referenced by any rows in the **child**. In DataJoint, cascading deletes maintain workflow consistency by removing dependent downstream artifacts.\n4. **Update Restriction**: Updates to the primary and foreign keys are restricted to prevent inconsistencies. In DataJoint, this preserves workflow immutability—workflow artifacts must be re-executed rather than updated.\n5. **Performance Optimization**: An index is automatically created on the foreign key in the child table to speed up searches and joins.\n\n**In DataJoint, foreign keys transform the schema into a directed acyclic graph (DAG)** that represents valid workflow execution sequences. The schema becomes an executable specification of your workflow, where foreign keys not only enforce referential integrity but also prescribe the order of operations and maintain computational validity throughout the workflow.\n\nFor more on how DataJoint extends foreign keys with workflow semantics, see [Relational Workflows](../20-concepts/05-workflows.md)." } ], "metadata": { @@ -545,4 +508,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/book/30-design/050-relationships.ipynb b/book/30-design/050-relationships.ipynb index a969829..ecc0119 100644 --- a/book/30-design/050-relationships.ipynb +++ b/book/30-design/050-relationships.ipynb @@ -3,46 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Modeling Relationships\n", - "\n", - "In this chapter, we'll explore how to build complex relationships between entities using a combination of uniqueness constraints and referential constraints. Understanding these patterns is essential for designing schemas that accurately represent business rules and data dependencies.\n", - "\n", - "## Uniqueness Constraints\n", - "\n", - "Uniqueness constraints are typically set through primary keys, but tables can also have additional unique indexes beyond the primary key. These constraints ensure that specific combinations of attributes remain unique across all rows in a table.\n", - "\n", - "## Referential Constraints\n", - "\n", - "Referential constraints establish relationships between tables and are enforced by [foreign keys](030-foreign-keys.ipynb). They ensure that references between tables remain valid and prevent orphaned records.\n", - "\n", - "In DataJoint, foreign keys also participate in the **relational workflow model** introduced earlier: each dependency not only enforces referential integrity but also prescribes the order of operations in a workflow. When table `B` references table `A`, `A` must be populated before `B`, and deleting from `A` cascades through all dependent workflow steps. The resulting schema is a directed acyclic graph (DAG) whose arrows describe both data relationships and workflow execution order (see [Relational Workflows](../20-concepts/04-workflows.md)).\n", - "\n", - "## Foreign Keys Establish 1:N or 1:1 Relationships\n", - "\n", - "When a child table defines a foreign key constraint to a parent table, it creates a relationship between the entities in the parent and child tables. The cardinality of this relationship is always **1 on the parent side**: each entry in the child table must refer to a single entity in the parent table.\n", - "\n", - "On the child side, the relationship can have different cardinalities:\n", - "\n", - "- **0–1 (optional one-to-one)**: if the foreign key field in the child table has a unique constraint\n", - "- **1 (mandatory one-to-one)**: if the foreign key is the entire primary key of the child table\n", - "- **N (one-to-many)**: if no uniqueness constraint is applied to the foreign key field\n", - "\n", - "## What We'll Cover\n", - "\n", - "This chapter explores these key relationship patterns:\n", - "\n", - "* **One-to-Many Relationships**: The most common pattern, using foreign keys in secondary attributes\n", - "* **One-to-One Relationships**: Using primary key foreign keys or unique constraints\n", - "* **Many-to-Many Relationships**: Using association tables with composite primary keys\n", - "* **Sequences**: Cascading one-to-one relationships for workflows\n", - "* **Hierarchies**: Cascading one-to-many relationships for nested data structures\n", - "* **Parameterization**: Association tables where the association itself is the primary entity\n", - "* **Directed Graphs**: Self-referencing relationships with renamed foreign keys\n", - "* **Complex Constraints**: Using nullable enums with unique indexes for special requirements\n", - "\n", - "Let's begin by illustrating these possibilities with examples of bank customers and their accounts." - ] + "source": "# Modeling Relationships\n\nIn this chapter, we'll explore how to build complex relationships between entities using a combination of uniqueness constraints and referential constraints. Understanding these patterns is essential for designing schemas that accurately represent business rules and data dependencies.\n\n## Uniqueness Constraints\n\nUniqueness constraints are typically set through primary keys, but tables can also have additional unique indexes beyond the primary key. These constraints ensure that specific combinations of attributes remain unique across all rows in a table.\n\n## Referential Constraints\n\nReferential constraints establish relationships between tables and are enforced by [foreign keys](030-foreign-keys.ipynb). They ensure that references between tables remain valid and prevent orphaned records.\n\nIn DataJoint, foreign keys also participate in the **relational workflow model** introduced earlier: each dependency not only enforces referential integrity but also prescribes the order of operations in a workflow. When table `B` references table `A`, `A` must be populated before `B`, and deleting from `A` cascades through all dependent workflow steps. The resulting schema is a directed acyclic graph (DAG) whose arrows describe both data relationships and workflow execution order (see [Relational Workflows](../20-concepts/05-workflows.md)).\n\n## Foreign Keys Establish 1:N or 1:1 Relationships\n\nWhen a child table defines a foreign key constraint to a parent table, it creates a relationship between the entities in the parent and child tables. The cardinality of this relationship is always **1 on the parent side**: each entry in the child table must refer to a single entity in the parent table.\n\nOn the child side, the relationship can have different cardinalities:\n\n- **0–1 (optional one-to-one)**: if the foreign key field in the child table has a unique constraint\n- **1 (mandatory one-to-one)**: if the foreign key is the entire primary key of the child table\n- **N (one-to-many)**: if no uniqueness constraint is applied to the foreign key field\n\n## What We'll Cover\n\nThis chapter explores these key relationship patterns:\n\n* **One-to-Many Relationships**: The most common pattern, using foreign keys in secondary attributes\n* **One-to-One Relationships**: Using primary key foreign keys or unique constraints\n* **Many-to-Many Relationships**: Using association tables with composite primary keys\n* **Sequences**: Cascading one-to-one relationships for workflows\n* **Hierarchies**: Cascading one-to-many relationships for nested data structures\n* **Parameterization**: Association tables where the association itself is the primary entity\n* **Directed Graphs**: Self-referencing relationships with renamed foreign keys\n* **Complex Constraints**: Using nullable enums with unique indexes for special requirements\n\nLet's begin by illustrating these possibilities with examples of bank customers and their accounts." }, { "cell_type": "code", @@ -1252,15 +1213,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "## Sequences\n", - "\n", - "As discussed in the [Relational Workflows](../20-concepts/04-workflows.md) chapter, DataJoint schemas are directional: dependencies form a *directed-acyclic graph* (DAG) representing sequences of steps or operations.\n", - "The diagrams are plotted with all the dependencies pointing in the same direction (top-to-bottom or left-to-right), so that a schema diagram can be understood as an operational workflow.\n", - "\n", - "Let's model a simple sequence of operations such as placing an order, shipping, and delivery.\n", - "The three entities: `Order`, `Shipment`, and `Delivery` form a sequence of one-to-one relationships:" - ] + "source": "## Sequences\n\nAs discussed in the [Relational Workflows](../20-concepts/05-workflows.md) chapter, DataJoint schemas are directional: dependencies form a *directed-acyclic graph* (DAG) representing sequences of steps or operations.\nThe diagrams are plotted with all the dependencies pointing in the same direction (top-to-bottom or left-to-right), so that a schema diagram can be understood as an operational workflow.\n\nLet's model a simple sequence of operations such as placing an order, shipping, and delivery.\nThe three entities: `Order`, `Shipment`, and `Delivery` form a sequence of one-to-one relationships:" }, { "cell_type": "code", diff --git a/book/30-design/053-master-part.ipynb b/book/30-design/053-master-part.ipynb index de4d56e..070023d 100644 --- a/book/30-design/053-master-part.ipynb +++ b/book/30-design/053-master-part.ipynb @@ -147,7 +147,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "## Workflow semantics\n\nMaster-part relationships are most powerful in auto-computed tables (`dj.Computed` or `dj.Imported`). When a master is populated, DataJoint executes the master's `make()` method inside an atomic transaction. This guarantees that all parts belonging to the master are inserted (or updated) together with the master row. If any part fails to insert, the entire transaction is rolled back, preserving consistency.\n\nThis workflow semantics offers two key benefits:\n\n- **Atomicity** – From the points of view of other agents accessing the database, every master row is all-or-nothing. Either the master and all its parts appear together, or none of them do. Partially populated entries are only visible from inside transactions that create them. Downstream tables can safely depend on the master row, knowing that its parts are already in place.\n- **Isolation** – Any transaction that populates a master and its parts operates on a consistent isolated state of the database. Any changes that ocurr during this transaction will not affect computations of the individual parts. In DataJoint, this is typically already the case if users eschew updates and computations from data outside direct upstream dependencies. \n\nBecause master/part tables are created together, they naturally fit into the [relational workflow model](../20-concepts/04-workflows.md). The master participates in the DAG as a normal node, while the parts are executed inside the same workflow step.\n\n:::{seealso}\nFor a complete computational example demonstrating master-part relationships in an image analysis pipeline, see the [Blob Detection](../80-examples/075-blob-detection.ipynb) example, where `Detection` (master) and `Detection.Blob` (part) capture aggregate blob counts and per-feature coordinates atomically.\n:::" + "source": "## Workflow semantics\n\nMaster-part relationships are most powerful in auto-computed tables (`dj.Computed` or `dj.Imported`). When a master is populated, DataJoint executes the master's `make()` method inside an atomic transaction. This guarantees that all parts belonging to the master are inserted (or updated) together with the master row. If any part fails to insert, the entire transaction is rolled back, preserving consistency.\n\nThis workflow semantics offers two key benefits:\n\n- **Atomicity** – From the points of view of other agents accessing the database, every master row is all-or-nothing. Either the master and all its parts appear together, or none of them do. Partially populated entries are only visible from inside transactions that create them. Downstream tables can safely depend on the master row, knowing that its parts are already in place.\n- **Isolation** – Any transaction that populates a master and its parts operates on a consistent isolated state of the database. Any changes that ocurr during this transaction will not affect computations of the individual parts. In DataJoint, this is typically already the case if users eschew updates and computations from data outside direct upstream dependencies. \n\nBecause master/part tables are created together, they naturally fit into the [relational workflow model](../20-concepts/05-workflows.md). The master participates in the DAG as a normal node, while the parts are executed inside the same workflow step.\n\n:::{seealso}\nFor a complete computational example demonstrating master-part relationships in an image analysis pipeline, see the [Blob Detection](../80-examples/075-blob-detection.ipynb) example, where `Detection` (master) and `Detection.Blob` (part) capture aggregate blob counts and per-feature coordinates atomically.\n:::" }, { "cell_type": "markdown", diff --git a/book/60-computation/010-computation.ipynb b/book/60-computation/010-computation.ipynb index aba777f..86b2fdb 100644 --- a/book/60-computation/010-computation.ipynb +++ b/book/60-computation/010-computation.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Computation as Workflow\n\nDataJoint's central innovation is to recast relational databases as executable workflow specifications comprising a mixture of manual and automated steps. This chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/04-workflows.md) to practical computation patterns.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing. The [Blob-detection Pipeline](../80-examples/075-blob-detection.ipynb) from the examples chapter demonstrates how this plays out in practice.\n\n## Case Study: Blob Detection\n\nThe notebook `075-blob-detection.ipynb` assembles a compact image-analysis workflow:\n\n1. **Store source imagery** – `Image` is a manual table with a `longblob` field. NumPy arrays fetched from `skimage` are serialized automatically, illustrating that binary payloads need a serializer when stored in a relational database.\n2. **Scan parameter space** – `BlobParamSet` is a lookup table of min/max sigma and threshold values for `skimage.feature.blob_doh`. Each combination represents an alternative experiment configuration.\n3. **Compute detections** – `Detection` depends on both upstream tables. Its part table `Detection.Blob` holds every circle (x, y, radius) produced by the detector so that master and detail rows stay in sync.\n\n```python\n@schema\nclass Detection(dj.Computed):\n definition = \"\"\"\n -> Image\n -> BlobParamSet\n ---\n nblobs : int\n \"\"\"\n\n class Blob(dj.Part):\n definition = \"\"\"\n -> master\n blob_id : int\n ---\n x : float\n y : float\n r : float\n \"\"\"\n\n def make(self, key):\n img = (Image & key).fetch1(\"image\")\n params = (BlobParamSet & key).fetch1()\n blobs = blob_doh(img,\n min_sigma=params['min_sigma'],\n max_sigma=params['max_sigma'],\n threshold=params['threshold'])\n self.insert1(dict(key, nblobs=len(blobs)))\n self.Blob.insert(dict(key, blob_id=i, x=x, y=y, r=r)\n for i, (x, y, r) in enumerate(blobs))\n```\n\nRunning `Detection.populate(display_progress=True)` fans out over every `(image, paramset)` pair, creating six jobs in the demo notebook. Because each job lives in an atomic transaction, half-written results never leak—this is the **isolation** guarantee that maintains workflow integrity.\n\n## Curate the Preferred Result\n\nAfter inspecting the plots, a small manual table `SelectDetection` records the \"best\" parameter set for each image. That drives a final visualization that renders only the chosen detections. This illustrates a common pattern: let automation explore the combinatorics, then capture human judgment in a concise manual table.\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n- **Reproducibility** — Rerunning `populate()` regenerates every derived table from raw inputs, providing a clear path from primary data to results\n- **Dependency-aware scheduling** — You do not need to script job order; DataJoint infers it from foreign keys (the DAG structure)\n- **Computational validity** — Foreign key constraints combined with immutable workflow artifacts ensure downstream results remain consistent with upstream inputs\n- **Provenance tracking** — The schema itself documents what was computed from what\n\n## Practical Tips\n\n- Develop `make()` logic with restrictions (e.g., `Detection.populate(key)`) before unlocking the entire pipeline\n- Use `display_progress=True` when you need visibility; use `reserve_jobs=True` when distributing work across multiple machines\n- If your computed table writes both summary and detail rows, keep them in a part table so the transaction boundary protects them together (see [Master-Part Relationships](../30-design/053-master-part.ipynb))\n\nThe blob-detection notebook is a self-contained template: swap in your own raw data source, adjust the parameter search, and you have the skeleton for an end-to-end computational database ready for scientific workflows." + "source": "# Computation as Workflow\n\nDataJoint's central innovation is to recast relational databases as executable workflow specifications comprising a mixture of manual and automated steps. This chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to practical computation patterns.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing. The [Blob-detection Pipeline](../80-examples/075-blob-detection.ipynb) from the examples chapter demonstrates how this plays out in practice.\n\n## Case Study: Blob Detection\n\nThe notebook `075-blob-detection.ipynb` assembles a compact image-analysis workflow:\n\n1. **Store source imagery** – `Image` is a manual table with a `longblob` field. NumPy arrays fetched from `skimage` are serialized automatically, illustrating that binary payloads need a serializer when stored in a relational database.\n2. **Scan parameter space** – `BlobParamSet` is a lookup table of min/max sigma and threshold values for `skimage.feature.blob_doh`. Each combination represents an alternative experiment configuration.\n3. **Compute detections** – `Detection` depends on both upstream tables. Its part table `Detection.Blob` holds every circle (x, y, radius) produced by the detector so that master and detail rows stay in sync.\n\n```python\n@schema\nclass Detection(dj.Computed):\n definition = \"\"\"\n -> Image\n -> BlobParamSet\n ---\n nblobs : int\n \"\"\"\n\n class Blob(dj.Part):\n definition = \"\"\"\n -> master\n blob_id : int\n ---\n x : float\n y : float\n r : float\n \"\"\"\n\n def make(self, key):\n img = (Image & key).fetch1(\"image\")\n params = (BlobParamSet & key).fetch1()\n blobs = blob_doh(img,\n min_sigma=params['min_sigma'],\n max_sigma=params['max_sigma'],\n threshold=params['threshold'])\n self.insert1(dict(key, nblobs=len(blobs)))\n self.Blob.insert(dict(key, blob_id=i, x=x, y=y, r=r)\n for i, (x, y, r) in enumerate(blobs))\n```\n\nRunning `Detection.populate(display_progress=True)` fans out over every `(image, paramset)` pair, creating six jobs in the demo notebook. Because each job lives in an atomic transaction, half-written results never leak—this is the **isolation** guarantee that maintains workflow integrity.\n\n## Curate the Preferred Result\n\nAfter inspecting the plots, a small manual table `SelectDetection` records the \"best\" parameter set for each image. That drives a final visualization that renders only the chosen detections. This illustrates a common pattern: let automation explore the combinatorics, then capture human judgment in a concise manual table.\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n- **Reproducibility** — Rerunning `populate()` regenerates every derived table from raw inputs, providing a clear path from primary data to results\n- **Dependency-aware scheduling** — You do not need to script job order; DataJoint infers it from foreign keys (the DAG structure)\n- **Computational validity** — Foreign key constraints combined with immutable workflow artifacts ensure downstream results remain consistent with upstream inputs\n- **Provenance tracking** — The schema itself documents what was computed from what\n\n## Practical Tips\n\n- Develop `make()` logic with restrictions (e.g., `Detection.populate(key)`) before unlocking the entire pipeline\n- Use `display_progress=True` when you need visibility; use `reserve_jobs=True` when distributing work across multiple machines\n- If your computed table writes both summary and detail rows, keep them in a part table so the transaction boundary protects them together (see [Master-Part Relationships](../30-design/053-master-part.ipynb))\n\nThe blob-detection notebook is a self-contained template: swap in your own raw data source, adjust the parameter search, and you have the skeleton for an end-to-end computational database ready for scientific workflows." }, { "cell_type": "markdown", diff --git a/book/80-examples/010-classic-sales.ipynb b/book/80-examples/010-classic-sales.ipynb index a1b4849..a657967 100644 --- a/book/80-examples/010-classic-sales.ipynb +++ b/book/80-examples/010-classic-sales.ipynb @@ -14,7 +14,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "## Background\n\nThe Classic Models dataset is a well-known sample schema that ships with many SQL tutorials. It captures the operations of a miniature wholesaler: offices, employees, customers, orders, order line items, product lines, and payments. Because the design is already normalized into clean entity sets, it is a convenient playground for illustrating DataJoint's relational workflow concepts.\n\nIn this notebook we:\n\n- **Load the canonical schema** exactly as published by the MySQL team so you can compare the DataJoint rendition with the original SQL definitions.\n- **Highlight the workflow perspective**: foreign keys organize the tables into a directed acyclic graph (customers → orders → payments, product lines → products → order details, etc.). This makes it easy to trace the sequence of business operations.\n- **Demonstrate interoperability**: we ingest the SQL dump with `%sql`, then use `schema.spawn_missing_classes()` to materialize DataJoint table classes directly from the existing relational structure.\n\nKeep the [Relational Workflows](../20-concepts/04-workflows.md) and [Modeling Relationships](../30-design/050-relationships.ipynb) chapters in mind as you work through this example; you will see the same principles—normalized entity tables, association tables, and workflow-directed foreign keys—applied to a realistic business domain." + "source": "## Background\n\nThe Classic Models dataset is a well-known sample schema that ships with many SQL tutorials. It captures the operations of a miniature wholesaler: offices, employees, customers, orders, order line items, product lines, and payments. Because the design is already normalized into clean entity sets, it is a convenient playground for illustrating DataJoint's relational workflow concepts.\n\nIn this notebook we:\n\n- **Load the canonical schema** exactly as published by the MySQL team so you can compare the DataJoint rendition with the original SQL definitions.\n- **Highlight the workflow perspective**: foreign keys organize the tables into a directed acyclic graph (customers → orders → payments, product lines → products → order details, etc.). This makes it easy to trace the sequence of business operations.\n- **Demonstrate interoperability**: we ingest the SQL dump with `%sql`, then use `schema.spawn_missing_classes()` to materialize DataJoint table classes directly from the existing relational structure.\n\nKeep the [Relational Workflows](../20-concepts/05-workflows.md) and [Modeling Relationships](../30-design/050-relationships.ipynb) chapters in mind as you work through this example; you will see the same principles—normalized entity tables, association tables, and workflow-directed foreign keys—applied to a realistic business domain." }, { "cell_type": "markdown", From 184f608f05aff0a38b594aad549cd92f54fa842a Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 20:46:24 +0000 Subject: [PATCH 07/18] Expand Master-Part Relationship chapter with ACID transactions and examples - Add detailed ACID transaction semantics for make calls (atomicity, consistency, isolation, durability) - Include Blob Detection example showing master's responsibility to populate all parts - Add section explaining how dependency on master implies dependency on all parts - Include SelectDetection example demonstrating downstream table access to parts - Improve practical guidelines with when to use and best practices - Expand summary with key principles --- book/30-design/053-master-part.ipynb | 50 ++++++---------------------- 1 file changed, 11 insertions(+), 39 deletions(-) diff --git a/book/30-design/053-master-part.ipynb b/book/30-design/053-master-part.ipynb index 070023d..c13ec90 100644 --- a/book/30-design/053-master-part.ipynb +++ b/book/30-design/053-master-part.ipynb @@ -10,18 +10,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "Master-part relationships express the idea that a given entity (*master*) may include several tightly-coupled component entities (*parts*) spread across multiple tables. \n", - "This notion is also described as *compositional integrity*.\n", - "\n", - "In DataJoint's relational workflow philosophy, a master part relationship expresses the notion of populated multiple related data artifacts in a single workflow step.\n", - "\n", - "For example, a purchase order may include several items that should be treated as indivisible components of the purchase order.\n", - "Another example is a measurement from several channels: all must be recorded jointly before any downstream processing can begin.\n", - "\n", - "When inserting or deleting a master entity with all its parts, the database client must do so as a single all-or-nothing (*atomic*) transaction so that the master entity always appears with all its parts.\n", - "Creating the master with any of its parts missing would constitute violation of compositional integrity." - ] + "source": "Master-part relationships express the idea that a given entity (*master*) may include several tightly-coupled component entities (*parts*) spread across multiple tables. \nThis notion is also described as *compositional integrity*.\n\nIn DataJoint's relational workflow philosophy, a master-part relationship expresses the notion of populating multiple related data artifacts in a single workflow step.\n\nFor example, a purchase order may include several items that should be treated as indivisible components of the purchase order.\nAnother example is a measurement from several channels: all must be recorded jointly before any downstream processing can begin.\n\nWhen inserting or deleting a master entity with all its parts, the database client must do so as a single all-or-nothing (*atomic*) transaction so that the master entity always appears with all its parts.\nCreating the master with any of its parts missing would constitute a violation of compositional integrity." }, { "cell_type": "markdown", @@ -125,49 +114,32 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "As seen in this example, DataJoint provides special syntax for defining master-part relationships:\n", - "\n", - "1. **Master tables are declared normally** The master entity is declared as any regular table by subclassing `dj.Manual`/`dj.Lookup`/`dj.Imported`/`dj.Computed`. Thus a table becomes a master table by virtue of having part tables.\n", - "\n", - "2. **Nested class definition** – Parts are declared as a nested class inside its master class, subclassing `dj.Part`. Thus the part tables are referred to by their full class name such as `Polygon.Vertex`. Their classes do not need to be wrapped the the `@schema` decorator: the decorated of the master class is responsible for declaring all of its parts. \n", - "\n", - "3. **Foreign key from part to master** The part tables declare a foreign key to its master directly or transitively through other parts. Inside the namespace of the master class a special object named `master` can be used to reference to the master table. Thus the definition of the `Vertex` table can declare the foreign key `-> master` as an equivalent alias to `-> Polygon` --- either will form a valid foreign key. \n", - "\n", - "4. **Diagram notation** – In schema diagrams, part tables are rendered without colored blocks. They appear as labels attached to the master node, emphasizing that they do not stand on their own. The absence of color also highlights that other tables rarely reference parts directly; the master represents the entity identity.\n", - "\n", - "5. **Workflow semantics**" - ] + "source": "As seen in this example, DataJoint provides special syntax for defining master-part relationships:\n\n1. **Master tables are declared normally** – The master entity is declared as any regular table by subclassing `dj.Manual`/`dj.Lookup`/`dj.Imported`/`dj.Computed`. Thus a table becomes a master table by virtue of having part tables.\n\n2. **Nested class definition** – Parts are declared as a nested class inside its master class, subclassing `dj.Part`. Thus the part tables are referred to by their full class name such as `Polygon.Vertex`. Their classes do not need to be wrapped with the `@schema` decorator: the decorator of the master class is responsible for declaring all of its parts. \n\n3. **Foreign key from part to master** – The part tables declare a foreign key to its master directly or transitively through other parts. Inside the namespace of the master class, a special object named `master` can be used to reference the master table. Thus the definition of the `Vertex` table can declare the foreign key `-> master` as an equivalent alias to `-> Polygon`—either will form a valid foreign key. \n\n4. **Diagram notation** – In schema diagrams, part tables are rendered without colored blocks. They appear as labels attached to the master node, emphasizing that they do not stand on their own. The absence of color also highlights that other tables rarely reference parts directly; the master represents the entity identity.\n\n5. **Workflow semantics** – For computed and imported tables, the master's `make()` method is responsible for inserting both the master row and all its parts within a single ACID transaction. This ensures compositional integrity is maintained automatically." }, { "cell_type": "markdown", "metadata": {}, - "source": "## Master-Part Semantics\n\nThe Master-Part relationship indicates to all client applications that inserts into the master and its parts must be done inside a dedicated transactions.\n\nTransactions cannot be nested and neither can master-part relationships. \nA part table cannot be a master table in another relationship.\n\nParts can only have one master. However, a master table can have \n\nAll parts must declare a foreign key to the master, although they can do so transitively through other parts.\n\nA foreign key made by a downstream table to the master signifies a dependency on the entire collection of all its parts. Therefore, deleting a master entry will naturally cascade to all its parts.\n\nParts cannot be deleted without deleting their master. Direct deletes of the parts are prohibited. Deleting \n\nAt insert time, DataJoint does not enforce the master-part semantics for manual and lookup tables. The master-part notation only signals to the client applications that they must use transactions when inserting records into masters and their parts.\n\n\n## Master-Part in Computations\n\nFor autopopulated tables, the `populate` mechanism properly handles transactions: each master `make` method is called within an atomic isolated transaction and it populates both the master and the parts. Any error ocurring inside the make method, rolls back the master and the parts already populated within that `make` call." + "source": "## Master-Part Semantics\n\nThe Master-Part relationship indicates to all client applications that inserts into the master and its parts must be done inside a dedicated transaction.\n\n**Structural rules:**\n\n- Transactions cannot be nested and neither can master-part relationships. A part table cannot be a master table in another relationship.\n- Parts can only have one master. However, a master table can have multiple part tables.\n- All parts must declare a foreign key to the master, although they can do so transitively through other parts.\n\n**Dependency semantics:**\n\nA foreign key made by a downstream table to the master signifies a dependency on the entire collection of all its parts. \nThis is a crucial property: when a table depends on a master, it implicitly depends on all the master's parts as well.\nThe downstream table can safely assume that whenever the master entry exists, all its associated parts are present and complete.\n\n**Deletion behavior:**\n\n- Deleting a master entry naturally cascades to all its parts due to the foreign key constraint.\n- Parts cannot be deleted without deleting their master. Direct deletes of the parts are prohibited.\n\n**For manual and lookup tables:**\n\nAt insert time, DataJoint does not enforce the master-part semantics for manual and lookup tables. \nThe master-part notation only signals to the client applications that they must use transactions when inserting records into masters and their parts." }, { "cell_type": "markdown", "metadata": {}, - "source": "## Workflow semantics\n\nMaster-part relationships are most powerful in auto-computed tables (`dj.Computed` or `dj.Imported`). When a master is populated, DataJoint executes the master's `make()` method inside an atomic transaction. This guarantees that all parts belonging to the master are inserted (or updated) together with the master row. If any part fails to insert, the entire transaction is rolled back, preserving consistency.\n\nThis workflow semantics offers two key benefits:\n\n- **Atomicity** – From the points of view of other agents accessing the database, every master row is all-or-nothing. Either the master and all its parts appear together, or none of them do. Partially populated entries are only visible from inside transactions that create them. Downstream tables can safely depend on the master row, knowing that its parts are already in place.\n- **Isolation** – Any transaction that populates a master and its parts operates on a consistent isolated state of the database. Any changes that ocurr during this transaction will not affect computations of the individual parts. In DataJoint, this is typically already the case if users eschew updates and computations from data outside direct upstream dependencies. \n\nBecause master/part tables are created together, they naturally fit into the [relational workflow model](../20-concepts/05-workflows.md). The master participates in the DAG as a normal node, while the parts are executed inside the same workflow step.\n\n:::{seealso}\nFor a complete computational example demonstrating master-part relationships in an image analysis pipeline, see the [Blob Detection](../80-examples/075-blob-detection.ipynb) example, where `Detection` (master) and `Detection.Blob` (part) capture aggregate blob counts and per-feature coordinates atomically.\n:::" + "source": "## Master-Part in Computed Tables\n\nMaster-part relationships are most powerful in auto-computed tables (`dj.Computed` or `dj.Imported`). \nThe master is responsible for populating all its parts within a single `make` call.\n\n### ACID Transactions\n\nWhen `populate` is called, DataJoint executes each `make()` method inside an **ACID transaction**:\n\n- **Atomicity** – The entire `make` call is all-or-nothing. Either the master row and all its parts are inserted together, or none of them are. If any error occurs—whether in computing results, inserting the master, or inserting any part—the entire transaction is rolled back. No partial results are ever committed to the database.\n\n- **Consistency** – The transaction moves the database from one valid state to another. The master-part relationship ensures that every master entry has its complete set of parts. Referential integrity constraints are satisfied at commit time.\n\n- **Isolation** – The transaction operates on a consistent snapshot of the database. Other concurrent transactions cannot see the partially inserted data until the transaction commits. This means other processes querying the database will never observe a master without its parts.\n\n- **Durability** – Once the transaction commits successfully, the data is permanently stored. Even if the system crashes immediately after, the master and all its parts will be present when the database restarts.\n\n### The Master's Responsibility\n\nThe master's `make` method is responsible for:\n1. Fetching all necessary input data\n2. Performing all computations\n3. Inserting the master row\n4. Inserting all part rows\n\nThis design ensures that the entire computation for one entity is self-contained within a single transactional boundary.\n\n### Example: Blob Detection\n\nConsider the [Blob Detection](../80-examples/075-blob-detection.ipynb) pipeline where `Detection` (master) and `Detection.Blob` (part) work together:\n\n```python\n@schema\nclass Detection(dj.Computed):\n definition = \"\"\"\n -> Image\n -> BlobParamSet\n ---\n nblobs : int\n \"\"\"\n\n class Blob(dj.Part):\n definition = \"\"\"\n -> master\n blob_id : int\n ---\n x : float\n y : float\n r : float\n \"\"\"\n\n def make(self, key):\n # fetch inputs\n img = (Image & key).fetch1(\"image\")\n params = (BlobParamSet & key).fetch1()\n\n # compute results\n blobs = blob_doh(\n img, \n min_sigma=params['min_sigma'], \n max_sigma=params['max_sigma'], \n threshold=params['threshold'])\n\n # insert master and parts (within one transaction)\n self.insert1(dict(key, nblobs=len(blobs)))\n self.Blob.insert(\n (dict(key, blob_id=i, x=x, y=y, r=r)\n for i, (x, y, r) in enumerate(blobs)))\n```\n\nIn this example:\n- The `make` method is called once per `(image_id, blob_paramset)` combination\n- Each call runs inside its own ACID transaction\n- The master row (`Detection`) stores the aggregate blob count\n- The part rows (`Detection.Blob`) store the coordinates of each detected blob\n- If `blob_doh` raises an exception or any insert fails, nothing is committed\n- An image with 200 detected blobs results in 1 master row + 200 part rows, all inserted atomically\n\nThis transactional guarantee means that any downstream table depending on `Detection` can trust that all `Detection.Blob` rows for that detection are present." + }, + { + "cell_type": "markdown", + "source": "## Dependency on Master Implies Dependency on Parts\n\nA key property of master-part relationships is that **a dependency on the master is also a dependency on all its parts**.\n\nWhen a downstream table declares a foreign key to a master table, it can safely assume that all the master's parts are present and complete. \nThis is guaranteed by the transactional semantics: the master and its parts are always inserted together atomically.\n\n### Example: SelectDetection\n\nIn the [Blob Detection](../80-examples/075-blob-detection.ipynb) example, `SelectDetection` allows users to mark the optimal parameter set for each image:\n\n```python\n@schema\nclass SelectDetection(dj.Manual):\n definition = \"\"\"\n -> Image\n ---\n -> Detection\n \"\"\"\n```\n\nThe foreign key `-> Detection` establishes a dependency on the `Detection` master table.\nAlthough `SelectDetection` does not explicitly reference `Detection.Blob`, it **implicitly depends on all blobs** for the referenced detection.\n\nThis has important implications:\n\n1. **Data availability** – When querying `SelectDetection`, you can join with `Detection.Blob` knowing that all blob coordinates for the selected detection exist:\n ```python\n # Get all blobs for selected detections\n Detection.Blob & SelectDetection\n ```\n\n2. **Cascading deletes** – If a `Detection` entry is deleted, its `Detection.Blob` parts are automatically deleted (due to the part's foreign key to master). The `SelectDetection` entry referencing that detection is also deleted (due to `SelectDetection`'s foreign key to `Detection`). The entire dependency chain is maintained.\n\n3. **Workflow integrity** – Downstream computed tables can depend on the master and freely access both master attributes and part details. The workflow guarantees that if the master exists, all its computational results (stored in parts) are complete.\n\n### Why This Matters\n\nThis design pattern enables clean separation of concerns:\n\n- The **master row** stores aggregate or summary information (e.g., total blob count)\n- The **part rows** store detailed, per-item information (e.g., coordinates of each blob)\n- **Downstream tables** reference only the master, keeping their definitions simple\n- **Queries** can access part details through joins when needed\n\nThe transactional guarantee ensures this separation never leads to inconsistent states where a master exists without its parts.", + "metadata": {} }, { "cell_type": "markdown", "metadata": {}, - "source": [ - "## Practical guidelines\n", - "\n", - "- **Use parts for tightly-coupled detail rows**: If the part data never exists without the master, implement it as a nested part rather than a separate table. Examples include waveform channels, spike units per recording, order lines for a purchase order, or parameter sweeps attached to a model fit.\n", - "- **Keep part logic inside `make()`**: Populate the master and insert all parts from within the master’s `make()` method. Do not create separate processes that attempt to fill the part tables independently—the transactional guarantees rely on the nesting.\n", - "- **Diagram awareness**: Remember that part nodes appear without colored blocks in the diagram. Use this visual cue to distinguish between independent entity tables and parts. Diagramming utilities provide the option of hiding all parts for a simplified view.\n" - ] + "source": "## Practical Guidelines\n\n**When to use master-part:**\n\n- **Tightly-coupled detail rows** – If the part data never exists without the master, implement it as a nested part rather than a separate table. Examples include waveform channels, spike units per recording, order lines for a purchase order, detected features in an image, or parameter sweeps attached to a model fit.\n- **Aggregate + detail pattern** – When computations produce both summary statistics (master) and per-item details (parts), master-part is the natural choice.\n- **Atomic multi-row results** – When a single computation produces multiple rows that must appear together or not at all.\n\n**Implementation best practices:**\n\n- **Keep all logic inside `make()`** – Populate the master and insert all parts from within the master's `make()` method. Do not create separate processes that attempt to fill the part tables independently—the transactional guarantees rely on this pattern.\n- **Insert master before parts** – While both are within the same transaction, inserting the master first ensures the foreign key constraint from parts to master is satisfied.\n- **Downstream tables reference the master** – Declare foreign keys to the master table, not to individual parts. This keeps definitions clean and leverages the implicit dependency on all parts.\n\n**Diagram awareness:**\n\n- Part nodes appear without colored blocks in the diagram, rendered as labels attached to the master node.\n- Use this visual cue to distinguish between independent entity tables and parts.\n- Diagramming utilities provide the option of hiding all parts for a simplified view." }, { "cell_type": "markdown", "metadata": {}, - "source": [ - "## Summary\n", - "\n", - "Master-part relationships provide a concise and expressive way to model a master record that owns several subordinate rows. DataJoint’s nested class syntax, simplified foreign key, and transactional populate make this pattern easy to express and safe to use. In the context of relational workflows, masters act as workflow nodes, while parts capture the detailed payload generated in the same workflow step.\n" - ] + "source": "## Summary\n\nMaster-part relationships provide a structured way to model entities that own subordinate detail rows. Key principles:\n\n1. **Compositional integrity** – A master and its parts form an indivisible unit. They are inserted and deleted together, never partially.\n\n2. **ACID transactions** – Each `make()` call runs inside a transaction guaranteeing atomicity, consistency, isolation, and durability. If any step fails, the entire operation is rolled back.\n\n3. **Master's responsibility** – The master's `make()` method is solely responsible for populating itself and all its parts. This keeps the transactional boundary clear and self-contained.\n\n4. **Implicit part dependency** – A foreign key to the master implies a dependency on all its parts. Downstream tables can safely assume that when the master exists, all its parts are present and complete.\n\n5. **Clean separation** – Masters hold aggregate/summary data while parts hold per-item details. Downstream tables reference the master; queries join with parts when details are needed.\n\nDataJoint's nested class syntax and transactional populate mechanism make this pattern easy to express and safe to use in relational workflows.\n\n:::{seealso}\nFor a complete working example demonstrating these concepts, see the [Blob Detection](../80-examples/075-blob-detection.ipynb) pipeline, where `Detection` (master) and `Detection.Blob` (part) illustrate atomic population and downstream dependency through `SelectDetection`.\n:::" } ], "metadata": { From 02a012ecb87ba924164621bde41f983997719530 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 20:51:23 +0000 Subject: [PATCH 08/18] Harmonize Computation chapter with Master-Part chapter - Streamline ACID transaction explanation with reference to master-part chapter - Add dedicated section on the populate method with examples - Simplify Blob Detection case study to avoid code duplication - Add cross-references between chapters using seealso blocks - Convert benefits list to table format for clarity - Focus on workflow automation concepts unique to this chapter --- book/60-computation/010-computation.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/60-computation/010-computation.ipynb b/book/60-computation/010-computation.ipynb index 86b2fdb..1191711 100644 --- a/book/60-computation/010-computation.ipynb +++ b/book/60-computation/010-computation.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Computation as Workflow\n\nDataJoint's central innovation is to recast relational databases as executable workflow specifications comprising a mixture of manual and automated steps. This chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to practical computation patterns.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing. The [Blob-detection Pipeline](../80-examples/075-blob-detection.ipynb) from the examples chapter demonstrates how this plays out in practice.\n\n## Case Study: Blob Detection\n\nThe notebook `075-blob-detection.ipynb` assembles a compact image-analysis workflow:\n\n1. **Store source imagery** – `Image` is a manual table with a `longblob` field. NumPy arrays fetched from `skimage` are serialized automatically, illustrating that binary payloads need a serializer when stored in a relational database.\n2. **Scan parameter space** – `BlobParamSet` is a lookup table of min/max sigma and threshold values for `skimage.feature.blob_doh`. Each combination represents an alternative experiment configuration.\n3. **Compute detections** – `Detection` depends on both upstream tables. Its part table `Detection.Blob` holds every circle (x, y, radius) produced by the detector so that master and detail rows stay in sync.\n\n```python\n@schema\nclass Detection(dj.Computed):\n definition = \"\"\"\n -> Image\n -> BlobParamSet\n ---\n nblobs : int\n \"\"\"\n\n class Blob(dj.Part):\n definition = \"\"\"\n -> master\n blob_id : int\n ---\n x : float\n y : float\n r : float\n \"\"\"\n\n def make(self, key):\n img = (Image & key).fetch1(\"image\")\n params = (BlobParamSet & key).fetch1()\n blobs = blob_doh(img,\n min_sigma=params['min_sigma'],\n max_sigma=params['max_sigma'],\n threshold=params['threshold'])\n self.insert1(dict(key, nblobs=len(blobs)))\n self.Blob.insert(dict(key, blob_id=i, x=x, y=y, r=r)\n for i, (x, y, r) in enumerate(blobs))\n```\n\nRunning `Detection.populate(display_progress=True)` fans out over every `(image, paramset)` pair, creating six jobs in the demo notebook. Because each job lives in an atomic transaction, half-written results never leak—this is the **isolation** guarantee that maintains workflow integrity.\n\n## Curate the Preferred Result\n\nAfter inspecting the plots, a small manual table `SelectDetection` records the \"best\" parameter set for each image. That drives a final visualization that renders only the chosen detections. This illustrates a common pattern: let automation explore the combinatorics, then capture human judgment in a concise manual table.\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n- **Reproducibility** — Rerunning `populate()` regenerates every derived table from raw inputs, providing a clear path from primary data to results\n- **Dependency-aware scheduling** — You do not need to script job order; DataJoint infers it from foreign keys (the DAG structure)\n- **Computational validity** — Foreign key constraints combined with immutable workflow artifacts ensure downstream results remain consistent with upstream inputs\n- **Provenance tracking** — The schema itself documents what was computed from what\n\n## Practical Tips\n\n- Develop `make()` logic with restrictions (e.g., `Detection.populate(key)`) before unlocking the entire pipeline\n- Use `display_progress=True` when you need visibility; use `reserve_jobs=True` when distributing work across multiple machines\n- If your computed table writes both summary and detail rows, keep them in a part table so the transaction boundary protects them together (see [Master-Part Relationships](../30-design/053-master-part.ipynb))\n\nThe blob-detection notebook is a self-contained template: swap in your own raw data source, adjust the parameter search, and you have the skeleton for an end-to-end computational database ready for scientific workflows." + "source": "# Computation as Workflow\n\nDataJoint's central innovation is to recast relational databases as executable workflow specifications comprising a mixture of manual and automated steps. This chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to practical computation patterns.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part Relationships](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection Pipeline](../80-examples/075-blob-detection.ipynb) demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [Blob Detection Pipeline](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part Relationships](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" }, { "cell_type": "markdown", From a520791e7df68635153dbcf1c4c7ab52a3f91f32 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 20:54:42 +0000 Subject: [PATCH 09/18] Rename Blob Detection chapter and update references - Change title from "Blob Detection Workflow" to "Blob Detection" - Update references in Computation chapter to match --- book/60-computation/010-computation.ipynb | 2 +- book/80-examples/075-blob-detection.ipynb | 17 ++--------------- 2 files changed, 3 insertions(+), 16 deletions(-) diff --git a/book/60-computation/010-computation.ipynb b/book/60-computation/010-computation.ipynb index 1191711..ce6ad58 100644 --- a/book/60-computation/010-computation.ipynb +++ b/book/60-computation/010-computation.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Computation as Workflow\n\nDataJoint's central innovation is to recast relational databases as executable workflow specifications comprising a mixture of manual and automated steps. This chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to practical computation patterns.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part Relationships](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection Pipeline](../80-examples/075-blob-detection.ipynb) demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [Blob Detection Pipeline](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part Relationships](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" + "source": "# Computation as Workflow\n\nDataJoint's central innovation is to recast relational databases as executable workflow specifications comprising a mixture of manual and automated steps. This chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to practical computation patterns.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part Relationships](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part Relationships](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" }, { "cell_type": "markdown", diff --git a/book/80-examples/075-blob-detection.ipynb b/book/80-examples/075-blob-detection.ipynb index 0f779ca..c187aad 100644 --- a/book/80-examples/075-blob-detection.ipynb +++ b/book/80-examples/075-blob-detection.ipynb @@ -3,20 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Blob Detection Workflow\n", - "\n", - "This example shows a compact image-analysis pipeline that detects bright blobs in two sample images using DataJoint. It demonstrates:\n", - "\n", - "- Seeding a small `Image` manual table with two entries of standard images from `skimage.data`.\n", - "- Defining multiple parameter sets for blob detection in a lookup table `BlobParamSet`\n", - "- Defining a computed master table `Detection` together with its nested part table `Detection.Blob`.\n", - "- Populating the master, which automatically inserts all part rows inside the same transaction.\n", - "- Visualizing the results by drawing detection circles on the images.\n", - "- Visually selecting the optimal parameter set for each image and saving the selection in a manual table `SelectDetection`.\n", - "\n", - "Along the way we illustrate why master-part relationships are ideal for computational workflows: the master stores aggregate results and the parts hold per-feature detail, all created atomically.\n" - ] + "source": "# Blob Detection\n\nThis example shows a compact image-analysis pipeline that detects bright blobs in two sample images using DataJoint. It demonstrates:\n\n- Seeding a small `Image` manual table with two entries of standard images from `skimage.data`.\n- Defining multiple parameter sets for blob detection in a lookup table `BlobParamSet`\n- Defining a computed master table `Detection` together with its nested part table `Detection.Blob`.\n- Populating the master, which automatically inserts all part rows inside the same transaction.\n- Visualizing the results by drawing detection circles on the images.\n- Visually selecting the optimal parameter set for each image and saving the selection in a manual table `SelectDetection`.\n\nAlong the way we illustrate why master-part relationships are ideal for computational workflows: the master stores aggregate results and the parts hold per-feature detail, all created atomically." }, { "cell_type": "markdown", @@ -653,4 +640,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file From 3013eaf3d73ecadc4c262439ebfb5f77080903d1 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 21:07:22 +0000 Subject: [PATCH 10/18] Rename Master-Part Relationships chapter to Master-Part - Shorten chapter title from "Master-Part Relationships" to "Master-Part" - Update references in Computation chapter and Data Integrity chapter --- book/20-concepts/04-integrity.md | 2 +- book/30-design/053-master-part.ipynb | 4 +--- book/60-computation/010-computation.ipynb | 2 +- 3 files changed, 3 insertions(+), 5 deletions(-) diff --git a/book/20-concepts/04-integrity.md b/book/20-concepts/04-integrity.md index c488321..e004b7b 100644 --- a/book/20-concepts/04-integrity.md +++ b/book/20-concepts/04-integrity.md @@ -127,7 +127,7 @@ Compositional integrity ensures multi-part entities are never partially stored: **Example:** An imaging session's metadata and all acquired frames are stored together or not at all. **Covered in:** -- [Master-Part Relationships](../30-design/053-master-part.ipynb) — Hierarchical compositions +- [Master-Part](../30-design/053-master-part.ipynb) — Hierarchical compositions - [Transactions](../40-operations/040-transactions.ipynb) — Atomic operations --- diff --git a/book/30-design/053-master-part.ipynb b/book/30-design/053-master-part.ipynb index c13ec90..91283f9 100644 --- a/book/30-design/053-master-part.ipynb +++ b/book/30-design/053-master-part.ipynb @@ -3,9 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Master-Part Relationships\n" - ] + "source": "# Master-Part" }, { "cell_type": "markdown", diff --git a/book/60-computation/010-computation.ipynb b/book/60-computation/010-computation.ipynb index ce6ad58..b318b45 100644 --- a/book/60-computation/010-computation.ipynb +++ b/book/60-computation/010-computation.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Computation as Workflow\n\nDataJoint's central innovation is to recast relational databases as executable workflow specifications comprising a mixture of manual and automated steps. This chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to practical computation patterns.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part Relationships](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part Relationships](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" + "source": "# Computation as Workflow\n\nDataJoint's central innovation is to recast relational databases as executable workflow specifications comprising a mixture of manual and automated steps. This chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to practical computation patterns.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" }, { "cell_type": "markdown", From 0d7946a1b4ad1ef9fe6d03aafe53e46506cec36d Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 21:09:01 +0000 Subject: [PATCH 11/18] Move Caching chapter to Special Topics section --- .../80-caching.ipynb => 85-special-topics/080-caching.ipynb} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename book/{40-operations/80-caching.ipynb => 85-special-topics/080-caching.ipynb} (100%) diff --git a/book/40-operations/80-caching.ipynb b/book/85-special-topics/080-caching.ipynb similarity index 100% rename from book/40-operations/80-caching.ipynb rename to book/85-special-topics/080-caching.ipynb From 3b832c5e36633dea465ea70e429f2bc02adeb23b Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 22:06:20 +0000 Subject: [PATCH 12/18] Reorganize documentation structure - Move Design Alterations chapter to Special Topics section - Rename "Modeling Relationships" to "Relationships" - Incorporate Reverse Engineering content into Create Schemas chapter - Update cross-references in Classic Sales example --- book/30-design/010-schema.ipynb | 5 + book/30-design/050-relationships.ipynb | 2 +- book/30-design/095-reverse-engineer.ipynb | 1749 ----------------- book/80-examples/010-classic-sales.ipynb | 2 +- .../091-alter.ipynb | 0 5 files changed, 7 insertions(+), 1751 deletions(-) delete mode 100644 book/30-design/095-reverse-engineer.ipynb rename book/{30-design => 85-special-topics}/091-alter.ipynb (100%) diff --git a/book/30-design/010-schema.ipynb b/book/30-design/010-schema.ipynb index c224e00..7393534 100644 --- a/book/30-design/010-schema.ipynb +++ b/book/30-design/010-schema.ipynb @@ -177,6 +177,11 @@ "%pycat code/subject.py" ] }, + { + "cell_type": "markdown", + "source": "# Working with Existing Schemas\n\nThis section describes how to work with database schemas without access to the original code that generated the schema. These situations often arise when:\n- The database is created by another user who has not shared the generating code yet\n- The database schema is created from a programming language other than Python\n- You need to explore an existing database before writing new code\n\n## Listing Available Schemas\n\nYou can use the `dj.list_schemas` function to list the names of database schemas available to you:\n\n```python\nimport datajoint as dj\ndj.list_schemas()\n```\n\n## Connecting to an Existing Schema\n\nJust as with a new schema, you start by creating a `schema` object to connect to the chosen database schema:\n\n```python\nschema = dj.Schema('existing_schema_name')\n```\n\nIf the schema already exists, `dj.Schema` is initialized as usual and you may plot the schema diagram. But instead of seeing class names, you will see the raw table names as they appear in the database:\n\n```python\ndj.Diagram(schema)\n```\n\n## Spawning Missing Classes\n\nWhen you connect to an existing schema without the original Python code, you can view the diagram but cannot interact with the tables. A similar situation arises when another developer has added new tables to the schema but has not yet shared the updated module code with you.\n\nYou may use the `schema.spawn_missing_classes` method to *spawn* classes into the local namespace for any tables missing their classes:\n\n```python\nschema.spawn_missing_classes()\n```\n\nNow you may interact with these tables as if they were declared right here in your namespace.\n\n## Creating a Virtual Module\n\nThe `spawn_missing_classes` method creates new classes in the local namespace. However, it is often more convenient to import a schema with its Python module, equivalent to:\n```python\nimport university as uni\n```\n\nWe can mimic this import without having access to `university.py` using the `create_virtual_module` function:\n\n```python\nimport datajoint as dj\n\nuni = dj.create_virtual_module('university', 'existing_schema_name')\n```\n\nNow `uni` behaves as an imported module complete with the `schema` object and all the table classes. You can use it like any other module:\n\n```python\ndj.Diagram(uni)\nuni.Student - uni.StudentMajor\n```\n\n## Virtual Module Options\n\n`dj.create_virtual_module` takes optional arguments:\n\n### create_schema\nThe `create_schema=False` argument (default) assures that an error is raised when the schema does not already exist. Set it to `True` if you want to create an empty schema:\n\n```python\n# This will raise an error if 'nonexistent' schema doesn't exist\ndj.create_virtual_module('what', 'nonexistent')\n\n# This will create the schema if it doesn't exist\ndj.create_virtual_module('what', 'nonexistent', create_schema=True)\n```\n\n### create_tables\nThe `create_tables=False` argument is passed to the schema object. It prevents the use of the schema object of the virtual module for creating new tables in the existing schema. This is a precautionary measure since virtual modules are often used for completed schemas.\n\nYou may set this argument to `True` if you wish to add new tables to the existing schema:\n\n```python\nuni = dj.create_virtual_module('university', 'existing_schema_name', create_tables=True)\n\n@uni.schema\nclass NewTable(dj.Manual):\n definition = \"\"\"\n -> uni.Student \n ---\n example : varchar(255)\n \"\"\"\n```\n\nA more common approach when you need to add tables is to create a new `schema` object and use the `spawn_missing_classes` function to make the existing classes available.", + "metadata": {} + }, { "cell_type": "markdown", "metadata": {}, diff --git a/book/30-design/050-relationships.ipynb b/book/30-design/050-relationships.ipynb index ecc0119..affb482 100644 --- a/book/30-design/050-relationships.ipynb +++ b/book/30-design/050-relationships.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Modeling Relationships\n\nIn this chapter, we'll explore how to build complex relationships between entities using a combination of uniqueness constraints and referential constraints. Understanding these patterns is essential for designing schemas that accurately represent business rules and data dependencies.\n\n## Uniqueness Constraints\n\nUniqueness constraints are typically set through primary keys, but tables can also have additional unique indexes beyond the primary key. These constraints ensure that specific combinations of attributes remain unique across all rows in a table.\n\n## Referential Constraints\n\nReferential constraints establish relationships between tables and are enforced by [foreign keys](030-foreign-keys.ipynb). They ensure that references between tables remain valid and prevent orphaned records.\n\nIn DataJoint, foreign keys also participate in the **relational workflow model** introduced earlier: each dependency not only enforces referential integrity but also prescribes the order of operations in a workflow. When table `B` references table `A`, `A` must be populated before `B`, and deleting from `A` cascades through all dependent workflow steps. The resulting schema is a directed acyclic graph (DAG) whose arrows describe both data relationships and workflow execution order (see [Relational Workflows](../20-concepts/05-workflows.md)).\n\n## Foreign Keys Establish 1:N or 1:1 Relationships\n\nWhen a child table defines a foreign key constraint to a parent table, it creates a relationship between the entities in the parent and child tables. The cardinality of this relationship is always **1 on the parent side**: each entry in the child table must refer to a single entity in the parent table.\n\nOn the child side, the relationship can have different cardinalities:\n\n- **0–1 (optional one-to-one)**: if the foreign key field in the child table has a unique constraint\n- **1 (mandatory one-to-one)**: if the foreign key is the entire primary key of the child table\n- **N (one-to-many)**: if no uniqueness constraint is applied to the foreign key field\n\n## What We'll Cover\n\nThis chapter explores these key relationship patterns:\n\n* **One-to-Many Relationships**: The most common pattern, using foreign keys in secondary attributes\n* **One-to-One Relationships**: Using primary key foreign keys or unique constraints\n* **Many-to-Many Relationships**: Using association tables with composite primary keys\n* **Sequences**: Cascading one-to-one relationships for workflows\n* **Hierarchies**: Cascading one-to-many relationships for nested data structures\n* **Parameterization**: Association tables where the association itself is the primary entity\n* **Directed Graphs**: Self-referencing relationships with renamed foreign keys\n* **Complex Constraints**: Using nullable enums with unique indexes for special requirements\n\nLet's begin by illustrating these possibilities with examples of bank customers and their accounts." + "source": "# Relationships\n\nIn this chapter, we'll explore how to build complex relationships between entities using a combination of uniqueness constraints and referential constraints. Understanding these patterns is essential for designing schemas that accurately represent business rules and data dependencies.\n\n## Uniqueness Constraints\n\nUniqueness constraints are typically set through primary keys, but tables can also have additional unique indexes beyond the primary key. These constraints ensure that specific combinations of attributes remain unique across all rows in a table.\n\n## Referential Constraints\n\nReferential constraints establish relationships between tables and are enforced by [foreign keys](030-foreign-keys.ipynb). They ensure that references between tables remain valid and prevent orphaned records.\n\nIn DataJoint, foreign keys also participate in the **relational workflow model** introduced earlier: each dependency not only enforces referential integrity but also prescribes the order of operations in a workflow. When table `B` references table `A`, `A` must be populated before `B`, and deleting from `A` cascades through all dependent workflow steps. The resulting schema is a directed acyclic graph (DAG) whose arrows describe both data relationships and workflow execution order (see [Relational Workflows](../20-concepts/05-workflows.md)).\n\n## Foreign Keys Establish 1:N or 1:1 Relationships\n\nWhen a child table defines a foreign key constraint to a parent table, it creates a relationship between the entities in the parent and child tables. The cardinality of this relationship is always **1 on the parent side**: each entry in the child table must refer to a single entity in the parent table.\n\nOn the child side, the relationship can have different cardinalities:\n\n- **0–1 (optional one-to-one)**: if the foreign key field in the child table has a unique constraint\n- **1 (mandatory one-to-one)**: if the foreign key is the entire primary key of the child table\n- **N (one-to-many)**: if no uniqueness constraint is applied to the foreign key field\n\n## What We'll Cover\n\nThis chapter explores these key relationship patterns:\n\n* **One-to-Many Relationships**: The most common pattern, using foreign keys in secondary attributes\n* **One-to-One Relationships**: Using primary key foreign keys or unique constraints\n* **Many-to-Many Relationships**: Using association tables with composite primary keys\n* **Sequences**: Cascading one-to-one relationships for workflows\n* **Hierarchies**: Cascading one-to-many relationships for nested data structures\n* **Parameterization**: Association tables where the association itself is the primary entity\n* **Directed Graphs**: Self-referencing relationships with renamed foreign keys\n* **Complex Constraints**: Using nullable enums with unique indexes for special requirements\n\nLet's begin by illustrating these possibilities with examples of bank customers and their accounts." }, { "cell_type": "code", diff --git a/book/30-design/095-reverse-engineer.ipynb b/book/30-design/095-reverse-engineer.ipynb deleted file mode 100644 index 5505369..0000000 --- a/book/30-design/095-reverse-engineer.ipynb +++ /dev/null @@ -1,1749 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": "# Reverse Engineering\n\nThis chapter describes how to work with database schemas without access to the original code that generated the schema. These situations often arise when:\n- The database is created by another user who has not shared the generating code yet\n- The database schema is created from a programming language other than Python\n- You need to explore an existing database before writing new code" - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "import datajoint as dj" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Working with schemas and their modules\n", - "\n", - "Typically a DataJoint schema is created as a dedicated Python module. This module defines a `schema` object that is used to link classes declared in the module to tables in the database schema. As an example, examine the `university` module in this folder (`./university.py`).\n", - "\n", - "You may then import the module to interact with its tables:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Connecting dimitri@localhost:3306\n" - ] - } - ], - "source": [ - "import university as uni" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "%3\n", - "\n", - "\n", - "\n", - "uni.LetterGrade\n", - "\n", - "\n", - "uni.LetterGrade\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Grade\n", - "\n", - "\n", - "uni.Grade\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.LetterGrade->uni.Grade\n", - "\n", - "\n", - "\n", - "\n", - "uni.CurrentTerm\n", - "\n", - "\n", - "uni.CurrentTerm\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Course\n", - "\n", - "\n", - "uni.Course\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Section\n", - "\n", - "\n", - "uni.Section\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Course->uni.Section\n", - "\n", - "\n", - "\n", - "\n", - "uni.Term\n", - "\n", - "\n", - "uni.Term\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Term->uni.CurrentTerm\n", - "\n", - "\n", - "\n", - "\n", - "uni.Term->uni.Section\n", - "\n", - "\n", - "\n", - "\n", - "uni.Enroll\n", - "\n", - "\n", - "uni.Enroll\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Section->uni.Enroll\n", - "\n", - "\n", - "\n", - "\n", - "uni.Department\n", - "\n", - "\n", - "uni.Department\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Department->uni.Course\n", - "\n", - "\n", - "\n", - "\n", - "uni.StudentMajor\n", - "\n", - "\n", - "uni.StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Department->uni.StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "uni.Student\n", - "\n", - "\n", - "uni.Student\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Student->uni.StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "uni.Student->uni.Enroll\n", - "\n", - "\n", - "\n", - "\n", - "uni.Enroll->uni.Grade\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dj.Diagram(uni)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that `dj.Diagram` can extract the diagram from a schema object or from a python module containing its `schema` object, lending further support to the convention of one-to-one correspondence between database schemas and Python modules in a datajoint project:\n", - "```python\n", - "dj.Diagram(uni)\n", - "```\n", - "is equvalent to \n", - "```python\n", - "dj.Diagram(uni.schema)\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

student_id

\n", - " university-wide ID number\n", - "
\n", - "

first_name

\n", - " \n", - "
\n", - "

last_name

\n", - " \n", - "
\n", - "

sex

\n", - " \n", - "
\n", - "

date_of_birth

\n", - " \n", - "
\n", - "

home_address

\n", - " mailing street address\n", - "
\n", - "

home_city

\n", - " mailing address\n", - "
\n", - "

home_state

\n", - " US state acronym: e.g. OH\n", - "
\n", - "

home_zip

\n", - " zipcode e.g. 93979-4979\n", - "
\n", - "

home_phone

\n", - " e.g. 414.657.6883x0881\n", - "
1003JonathanWilsonM2002-10-1991101 Summer ParkPort JacquelineVT75616982.251.1567x21450
1005RichardLopezM1993-10-1686210 Brooks StationWest DavidIL96184+1-385-481-6761x1575
1006LauraHammondF1984-12-030346 Shannon MotorwayEast DavidME22113(189)406-2652x143
\n", - "

...

\n", - "

Total: 751

\n", - " " - ], - "text/plain": [ - "*student_id first_name last_name sex date_of_birth home_address home_city home_state home_zip home_phone \n", - "+------------+ +------------+ +-----------+ +-----+ +------------+ +------------+ +------------+ +------------+ +----------+ +------------+\n", - "1003 Jonathan Wilson M 2002-10-19 91101 Summer P Port Jacquelin VT 75616 982.251.1567x2\n", - "1005 Richard Lopez M 1993-10-16 86210 Brooks S West David IL 96184 +1-385-481-676\n", - "1006 Laura Hammond F 1984-12-03 0346 Shannon M East David ME 22113 (189)406-2652x\n", - " ...\n", - " (Total: 751)" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# students without majors \n", - "uni.Student - uni.StudentMajor" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Spawning missing classes\n", - "Now imagine that you do not have access to `university.py` or you do not have its latest version. You can still connect to the database schema but you will not have classes declared to interact with it.\n", - "\n", - "So let's start over in this scenario.\n", - "\n", - "### **!!! Restart the kernel here to remove the previous class definitions !!!**" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can may use the `dj.list_schemas` function (new in `datajoint 0.12.0`) to list the names of database schemas available to you." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Connecting dimitri@localhost:3306\n" - ] - }, - { - "data": { - "text/plain": [ - "['dimitri_alter',\n", - " 'dimitri_attach',\n", - " 'dimitri_blob',\n", - " 'dimitri_blobs',\n", - " 'dimitri_nphoton',\n", - " 'dimitri_schema',\n", - " 'dimitri_university',\n", - " 'dimitri_uuid',\n", - " 'university']" - ] - }, - "execution_count": 1, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import datajoint as dj\n", - "dj.list_schemas()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Just as with a new schema, we start by creating a `schema` object to connect to the chosen database schema:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "schema = dj.schema('dimitri_university')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If the schema already exists, `dj.schema` is initialized as usual and you may plot the schema diagram. But instead of seeing class names, you will see the raw table names as they appear in the database." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "%3\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`course`\n", - "\n", - "`dimitri_university`.`course`\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`section`\n", - "\n", - "`dimitri_university`.`section`\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`course`->`dimitri_university`.`section`\n", - "\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`current_term`\n", - "\n", - "`dimitri_university`.`current_term`\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`department`\n", - "\n", - "`dimitri_university`.`department`\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`department`->`dimitri_university`.`course`\n", - "\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`student_major`\n", - "\n", - "`dimitri_university`.`student_major`\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`department`->`dimitri_university`.`student_major`\n", - "\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`enroll`\n", - "\n", - "`dimitri_university`.`enroll`\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`grade`\n", - "\n", - "`dimitri_university`.`grade`\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`enroll`->`dimitri_university`.`grade`\n", - "\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`letter_grade`\n", - "\n", - "`dimitri_university`.`letter_grade`\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`letter_grade`->`dimitri_university`.`grade`\n", - "\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`section`->`dimitri_university`.`enroll`\n", - "\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`student`\n", - "\n", - "`dimitri_university`.`student`\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`student`->`dimitri_university`.`enroll`\n", - "\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`student`->`dimitri_university`.`student_major`\n", - "\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`term`\n", - "\n", - "`dimitri_university`.`term`\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`term`->`dimitri_university`.`current_term`\n", - "\n", - "\n", - "\n", - "\n", - "`dimitri_university`.`term`->`dimitri_university`.`section`\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# let's plot its diagram\n", - "dj.Diagram(schema)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You may view the diagram but, at this point, there is now way to interact with these tables. A similar situation arises when another developer has added new tables to the schema but has not yet shared the updated module code with you. Then the diagram will show a mixture of class names and database table names.\n", - "\n", - "Now you may use the `schema.spawn_missing_classes` method to *spawn* classes into the local namespace for any tables missing their classes:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "%3\n", - "\n", - "\n", - "\n", - "Course\n", - "\n", - "\n", - "Course\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Section\n", - "\n", - "\n", - "Section\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Course->Section\n", - "\n", - "\n", - "\n", - "\n", - "Grade\n", - "\n", - "\n", - "Grade\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Department\n", - "\n", - "\n", - "Department\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Department->Course\n", - "\n", - "\n", - "\n", - "\n", - "StudentMajor\n", - "\n", - "\n", - "StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Department->StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "Enroll\n", - "\n", - "\n", - "Enroll\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Enroll->Grade\n", - "\n", - "\n", - "\n", - "\n", - "Student\n", - "\n", - "\n", - "Student\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Student->Enroll\n", - "\n", - "\n", - "\n", - "\n", - "Student->StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "LetterGrade\n", - "\n", - "\n", - "LetterGrade\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "LetterGrade->Grade\n", - "\n", - "\n", - "\n", - "\n", - "Term\n", - "\n", - "\n", - "Term\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Term->Section\n", - "\n", - "\n", - "\n", - "\n", - "CurrentTerm\n", - "\n", - "\n", - "CurrentTerm\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Term->CurrentTerm\n", - "\n", - "\n", - "\n", - "\n", - "Section->Enroll\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "schema.spawn_missing_classes()\n", - "dj.Di(schema)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now you may interact with these tables as if they were declared right here in this namespace:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

student_id

\n", - " university-wide ID number\n", - "
\n", - "

first_name

\n", - " \n", - "
\n", - "

last_name

\n", - " \n", - "
\n", - "

sex

\n", - " \n", - "
\n", - "

date_of_birth

\n", - " \n", - "
\n", - "

home_address

\n", - " mailing street address\n", - "
\n", - "

home_city

\n", - " mailing address\n", - "
\n", - "

home_state

\n", - " US state acronym: e.g. OH\n", - "
\n", - "

home_zip

\n", - " zipcode e.g. 93979-4979\n", - "
\n", - "

home_phone

\n", - " e.g. 414.657.6883x0881\n", - "
1003JonathanWilsonM2002-10-1991101 Summer ParkPort JacquelineVT75616982.251.1567x21450
1005RichardLopezM1993-10-1686210 Brooks StationWest DavidIL96184+1-385-481-6761x1575
1006LauraHammondF1984-12-030346 Shannon MotorwayEast DavidME22113(189)406-2652x143
\n", - "

...

\n", - "

Total: 751

\n", - " " - ], - "text/plain": [ - "*student_id first_name last_name sex date_of_birth home_address home_city home_state home_zip home_phone \n", - "+------------+ +------------+ +-----------+ +-----+ +------------+ +------------+ +------------+ +------------+ +----------+ +------------+\n", - "1003 Jonathan Wilson M 2002-10-19 91101 Summer P Port Jacquelin VT 75616 982.251.1567x2\n", - "1005 Richard Lopez M 1993-10-16 86210 Brooks S West David IL 96184 +1-385-481-676\n", - "1006 Laura Hammond F 1984-12-03 0346 Shannon M East David ME 22113 (189)406-2652x\n", - " ...\n", - " (Total: 751)" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# students without majors \n", - "Student - StudentMajor" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### **!!! Restart the kernel here to remove the previous class definitions !!!**\n", - "\n", - "# Creating a virtual module\n", - "Now `spawn_missing_classes` creates the new classes in the local namespace. However, it is often more convenient to import a schema with its python module, equivalent to the python the python command \n", - "```python\n", - "import university as uni\n", - "```\n", - "\n", - "We can mimick this import without having access to `university.py` using the `create_virtual_module` function:" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Connecting dimitri@localhost:3306\n" - ] - } - ], - "source": [ - "import datajoint as dj\n", - "\n", - "uni = dj.create_virtual_module('university', 'dimitri_university')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now `uni` behaves as an imported module complete with the `schema` object and all the table classes." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "%3\n", - "\n", - "\n", - "\n", - "uni.CurrentTerm\n", - "\n", - "\n", - "uni.CurrentTerm\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Section\n", - "\n", - "\n", - "uni.Section\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Enroll\n", - "\n", - "\n", - "uni.Enroll\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Section->uni.Enroll\n", - "\n", - "\n", - "\n", - "\n", - "uni.Grade\n", - "\n", - "\n", - "uni.Grade\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Enroll->uni.Grade\n", - "\n", - "\n", - "\n", - "\n", - "uni.StudentMajor\n", - "\n", - "\n", - "uni.StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Department\n", - "\n", - "\n", - "uni.Department\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Department->uni.StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "uni.Course\n", - "\n", - "\n", - "uni.Course\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Department->uni.Course\n", - "\n", - "\n", - "\n", - "\n", - "uni.Course->uni.Section\n", - "\n", - "\n", - "\n", - "\n", - "uni.Student\n", - "\n", - "\n", - "uni.Student\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Student->uni.Enroll\n", - "\n", - "\n", - "\n", - "\n", - "uni.Student->uni.StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "uni.Term\n", - "\n", - "\n", - "uni.Term\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Term->uni.CurrentTerm\n", - "\n", - "\n", - "\n", - "\n", - "uni.Term->uni.Section\n", - "\n", - "\n", - "\n", - "\n", - "uni.LetterGrade\n", - "\n", - "\n", - "uni.LetterGrade\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.LetterGrade->uni.Grade\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dj.Di(uni)" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

student_id

\n", - " university-wide ID number\n", - "
\n", - "

first_name

\n", - " \n", - "
\n", - "

last_name

\n", - " \n", - "
\n", - "

sex

\n", - " \n", - "
\n", - "

date_of_birth

\n", - " \n", - "
\n", - "

home_address

\n", - " mailing street address\n", - "
\n", - "

home_city

\n", - " mailing address\n", - "
\n", - "

home_state

\n", - " US state acronym: e.g. OH\n", - "
\n", - "

home_zip

\n", - " zipcode e.g. 93979-4979\n", - "
\n", - "

home_phone

\n", - " e.g. 414.657.6883x0881\n", - "
1003JonathanWilsonM2002-10-1991101 Summer ParkPort JacquelineVT75616982.251.1567x21450
1005RichardLopezM1993-10-1686210 Brooks StationWest DavidIL96184+1-385-481-6761x1575
1006LauraHammondF1984-12-030346 Shannon MotorwayEast DavidME22113(189)406-2652x143
\n", - "

...

\n", - "

Total: 751

\n", - " " - ], - "text/plain": [ - "*student_id first_name last_name sex date_of_birth home_address home_city home_state home_zip home_phone \n", - "+------------+ +------------+ +-----------+ +-----+ +------------+ +------------+ +------------+ +------------+ +----------+ +------------+\n", - "1003 Jonathan Wilson M 2002-10-19 91101 Summer P Port Jacquelin VT 75616 982.251.1567x2\n", - "1005 Richard Lopez M 1993-10-16 86210 Brooks S West David IL 96184 +1-385-481-676\n", - "1006 Laura Hammond F 1984-12-03 0346 Shannon M East David ME 22113 (189)406-2652x\n", - " ...\n", - " (Total: 751)" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "uni.Student - uni.StudentMajor" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`dj.create_virtual_module` takes optional arguments. \n", - "\n", - "First, `create_schema=False` assures that an error is raised when the schema does not already exist. Set it to `True` if you want to create an empty schema." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "ename": "DataJointError", - "evalue": "Database named `nonexistent` was not defined. Set argument create_schema=True to create it.", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mDataJointError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreate_virtual_module\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'what'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'nonexistent'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m~/dev/datajoint-python/datajoint/schema.py\u001b[0m in \u001b[0;36mcreate_virtual_module\u001b[0;34m(module_name, schema_name, create_schema, create_tables, connection)\u001b[0m\n\u001b[1;32m 242\u001b[0m \"\"\"\n\u001b[1;32m 243\u001b[0m \u001b[0mmodule\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtypes\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mModuleType\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodule_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 244\u001b[0;31m \u001b[0m_schema\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mSchema\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mschema_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcreate_schema\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcreate_schema\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcreate_tables\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcreate_tables\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconnection\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mconnection\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 245\u001b[0m \u001b[0m_schema\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspawn_missing_classes\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcontext\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmodule\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__dict__\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 246\u001b[0m \u001b[0mmodule\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__dict__\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'schema'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_schema\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/dev/datajoint-python/datajoint/schema.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, schema_name, context, connection, create_schema, create_tables)\u001b[0m\n\u001b[1;32m 65\u001b[0m raise DataJointError(\n\u001b[1;32m 66\u001b[0m \u001b[0;34m\"Database named `{name}` was not defined. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 67\u001b[0;31m \"Set argument create_schema=True to create it.\".format(name=schema_name))\n\u001b[0m\u001b[1;32m 68\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 69\u001b[0m \u001b[0;31m# create database\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mDataJointError\u001b[0m: Database named `nonexistent` was not defined. Set argument create_schema=True to create it." - ] - } - ], - "source": [ - "dj.create_virtual_module('what', 'nonexistent')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The other optional argument, `create_tables=False` is passed to the `schema` object. It prevents the use of the `schema` obect of the virtual module for creating new tables in the existing schema. This is a precautionary measure since virtual modules are often used for completed schemas. You may set this argument to `True` if you wish to add new tables to the existing schema. A more common approach in this scenario would be to create a new `schema` object and to use the `spawn_missing_classes` function to make the classes available.\n", - "\n", - "However, you if do decide to create new tables in an existing tables using the virtual module, you may do so by using the schema object from the module as the decorator for declaring new tables:" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "uni = dj.create_virtual_module('university.py', 'dimitri_university', create_tables=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "@uni.schema\n", - "class Example(dj.Manual):\n", - " definition = \"\"\"\n", - " -> uni.Student \n", - " ---\n", - " example : varchar(255)\n", - " \"\"\"" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "%3\n", - "\n", - "\n", - "\n", - "uni.CurrentTerm\n", - "\n", - "\n", - "uni.CurrentTerm\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Term\n", - "\n", - "\n", - "uni.Term\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Term->uni.CurrentTerm\n", - "\n", - "\n", - "\n", - "\n", - "uni.Section\n", - "\n", - "\n", - "uni.Section\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Term->uni.Section\n", - "\n", - "\n", - "\n", - "\n", - "uni.Enroll\n", - "\n", - "\n", - "uni.Enroll\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Section->uni.Enroll\n", - "\n", - "\n", - "\n", - "\n", - "uni.Grade\n", - "\n", - "\n", - "uni.Grade\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Enroll->uni.Grade\n", - "\n", - "\n", - "\n", - "\n", - "uni.StudentMajor\n", - "\n", - "\n", - "uni.StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Department\n", - "\n", - "\n", - "uni.Department\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Department->uni.StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "uni.Course\n", - "\n", - "\n", - "uni.Course\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Department->uni.Course\n", - "\n", - "\n", - "\n", - "\n", - "uni.Course->uni.Section\n", - "\n", - "\n", - "\n", - "\n", - "uni.Student\n", - "\n", - "\n", - "uni.Student\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Student->uni.Enroll\n", - "\n", - "\n", - "\n", - "\n", - "uni.Student->uni.StudentMajor\n", - "\n", - "\n", - "\n", - "\n", - "Example\n", - "\n", - "\n", - "Example\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.Student->Example\n", - "\n", - "\n", - "\n", - "\n", - "uni.LetterGrade\n", - "\n", - "\n", - "uni.LetterGrade\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "uni.LetterGrade->uni.Grade\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dj.Di(uni)" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "-> uni.Student\n", - "-> uni.Section\n", - "\n" - ] - } - ], - "source": [ - "uni.Enroll.describe();" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "uni.save()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.7" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file diff --git a/book/80-examples/010-classic-sales.ipynb b/book/80-examples/010-classic-sales.ipynb index a657967..885b8b7 100644 --- a/book/80-examples/010-classic-sales.ipynb +++ b/book/80-examples/010-classic-sales.ipynb @@ -14,7 +14,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "## Background\n\nThe Classic Models dataset is a well-known sample schema that ships with many SQL tutorials. It captures the operations of a miniature wholesaler: offices, employees, customers, orders, order line items, product lines, and payments. Because the design is already normalized into clean entity sets, it is a convenient playground for illustrating DataJoint's relational workflow concepts.\n\nIn this notebook we:\n\n- **Load the canonical schema** exactly as published by the MySQL team so you can compare the DataJoint rendition with the original SQL definitions.\n- **Highlight the workflow perspective**: foreign keys organize the tables into a directed acyclic graph (customers → orders → payments, product lines → products → order details, etc.). This makes it easy to trace the sequence of business operations.\n- **Demonstrate interoperability**: we ingest the SQL dump with `%sql`, then use `schema.spawn_missing_classes()` to materialize DataJoint table classes directly from the existing relational structure.\n\nKeep the [Relational Workflows](../20-concepts/05-workflows.md) and [Modeling Relationships](../30-design/050-relationships.ipynb) chapters in mind as you work through this example; you will see the same principles—normalized entity tables, association tables, and workflow-directed foreign keys—applied to a realistic business domain." + "source": "## Background\n\nThe Classic Models dataset is a well-known sample schema that ships with many SQL tutorials. It captures the operations of a miniature wholesaler: offices, employees, customers, orders, order line items, product lines, and payments. Because the design is already normalized into clean entity sets, it is a convenient playground for illustrating DataJoint's relational workflow concepts.\n\nIn this notebook we:\n\n- **Load the canonical schema** exactly as published by the MySQL team so you can compare the DataJoint rendition with the original SQL definitions.\n- **Highlight the workflow perspective**: foreign keys organize the tables into a directed acyclic graph (customers → orders → payments, product lines → products → order details, etc.). This makes it easy to trace the sequence of business operations.\n- **Demonstrate interoperability**: we ingest the SQL dump with `%sql`, then use `schema.spawn_missing_classes()` to materialize DataJoint table classes directly from the existing relational structure.\n\nKeep the [Relational Workflows](../20-concepts/05-workflows.md) and [Relationships](../30-design/050-relationships.ipynb) chapters in mind as you work through this example; you will see the same principles—normalized entity tables, association tables, and workflow-directed foreign keys—applied to a realistic business domain." }, { "cell_type": "markdown", diff --git a/book/30-design/091-alter.ipynb b/book/85-special-topics/091-alter.ipynb similarity index 100% rename from book/30-design/091-alter.ipynb rename to book/85-special-topics/091-alter.ipynb From 69c5703f330bf6b12998f672bd811ec208767702 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 22:36:50 +0000 Subject: [PATCH 13/18] Merge Computations section into Operations - Move computation chapter to 40-operations/050-populate.ipynb - Add new 060-orchestration.ipynb covering infrastructure concerns - Update cross-references in 20-concepts/04-integrity.md - Update cross-references in 30-design/015-table.ipynb - Remove 60-computation directory Operations section now contains: 1. Insert 2. Delete 3. Updates 4. Transactions 5. Populate (automated computation via make/populate) 6. Orchestration (infrastructure, containerization, monitoring) --- book/20-concepts/04-integrity.md | 2 +- book/30-design/015-table.ipynb | 2 +- book/40-operations/050-populate.ipynb | 21 ++++ book/40-operations/060-orchestration.ipynb | 126 +++++++++++++++++++++ book/60-computation/010-computation.ipynb | 21 ---- 5 files changed, 149 insertions(+), 23 deletions(-) create mode 100644 book/40-operations/050-populate.ipynb create mode 100644 book/40-operations/060-orchestration.ipynb delete mode 100644 book/60-computation/010-computation.ipynb diff --git a/book/20-concepts/04-integrity.md b/book/20-concepts/04-integrity.md index e004b7b..32f4877 100644 --- a/book/20-concepts/04-integrity.md +++ b/book/20-concepts/04-integrity.md @@ -162,7 +162,7 @@ Workflow integrity maintains valid operation sequences through: **Covered in:** - [Foreign Keys](../30-design/030-foreign-keys.ipynb) — How foreign keys encode workflow dependencies -- [Computation](../60-computation/010-computation.ipynb) — Automatic workflow execution and dependency resolution +- [Populate](../40-operations/050-populate.ipynb) — Automatic workflow execution and dependency resolution --- diff --git a/book/30-design/015-table.ipynb b/book/30-design/015-table.ipynb index a583e7a..dabe22f 100644 --- a/book/30-design/015-table.ipynb +++ b/book/30-design/015-table.ipynb @@ -352,7 +352,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Table Base Classes\n\nDataJoint provides four base classes for different data management patterns:\n\n| Base Class | Purpose | When to Use |\n|------------|---------|-------------|\n| `dj.Manual` | Manually entered data | Subject info, experimental protocols |\n| `dj.Lookup` | Reference data, rarely changes | Equipment lists, parameter sets |\n| `dj.Imported` | Data imported from external files | Raw recordings, behavioral videos |\n| `dj.Computed` | Derived from other tables | Spike sorting results, analyses |\n\nWe'll explore `Imported` and `Computed` tables in the [Computation](../60-computation/) section.\n\n```{seealso}\n- [Lookup Tables](018-lookup-tables.ipynb) — Managing reference data\n- [Computation](../60-computation/010-computation.ipynb) — Automated data processing\n```" + "source": "# Table Base Classes\n\nDataJoint provides four base classes for different data management patterns:\n\n| Base Class | Purpose | When to Use |\n|------------|---------|-------------|\n| `dj.Manual` | Manually entered data | Subject info, experimental protocols |\n| `dj.Lookup` | Reference data, rarely changes | Equipment lists, parameter sets |\n| `dj.Imported` | Data imported from external files | Raw recordings, behavioral videos |\n| `dj.Computed` | Derived from other tables | Spike sorting results, analyses |\n\nWe'll explore `Imported` and `Computed` tables in the [Populate](050-populate.ipynb) chapter.\n\n```{seealso}\n- [Lookup Tables](018-lookup-tables.ipynb) — Managing reference data\n- [Populate](050-populate.ipynb) — Automated data processing\n```" }, { "cell_type": "markdown", diff --git a/book/40-operations/050-populate.ipynb b/book/40-operations/050-populate.ipynb new file mode 100644 index 0000000..ebf2d45 --- /dev/null +++ b/book/40-operations/050-populate.ipynb @@ -0,0 +1,21 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": "# Populate\n\nThe `populate` operation is the engine of workflow automation in DataJoint.\nWhile `insert`, `delete`, and `update` are manual operations, `populate` automates data entry for **Imported** and **Computed** tables based on dependencies defined in the schema.\n\nThis chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to the practical `populate` operation.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/book/40-operations/060-orchestration.ipynb b/book/40-operations/060-orchestration.ipynb new file mode 100644 index 0000000..84083d1 --- /dev/null +++ b/book/40-operations/060-orchestration.ipynb @@ -0,0 +1,126 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cell-0", + "metadata": {}, + "source": [ + "# Orchestration\n", + "\n", + "While the `populate` operation provides the logic for automated computation, **orchestration** addresses the infrastructure and operational concerns of running these computations at scale:\n", + "\n", + "- **Infrastructure provisioning** — Allocating compute resources (servers, containers, cloud instances)\n", + "- **Dependency management** — Ensuring consistent runtime environments across workers\n", + "- **Automated execution** — Scheduling and triggering `populate` calls\n", + "- **Observability** — Monitoring job progress, failures, and system health\n", + "- **Performance and cost tracking** — Understanding resource utilization and expenses\n", + "\n", + "These concerns are **outside the scope of the core DataJoint library** (`datajoint-python`), which focuses on the data model and workflow logic. Orchestration is solved through complementary infrastructure.\n", + "\n", + "## The Orchestration Challenge\n", + "\n", + "A typical DataJoint workflow requires:\n", + "\n", + "1. **Database server** — MySQL/MariaDB instance with appropriate configuration\n", + "2. **Worker processes** — Python environments with DataJoint and domain-specific packages\n", + "3. **File storage** — For external blob storage (if using `dj.config['stores']`)\n", + "4. **Job coordination** — Managing which workers process which jobs\n", + "5. **Error handling** — Retrying failed jobs, alerting on persistent failures\n", + "6. **Scaling** — Adding workers during high-demand periods\n", + "\n", + "The `populate(reserve_jobs=True)` option handles job coordination at the database level, but provisioning and managing the workers themselves requires additional infrastructure.\n", + "\n", + "## Commercial Solution: DataJoint Works\n", + "\n", + "[DataJoint Works](https://datajoint.com) is a managed platform that provides comprehensive orchestration:\n", + "\n", + "| Feature | Description |\n", + "|---------|-------------|\n", + "| **Managed databases** | Provisioned and configured MySQL instances |\n", + "| **Container registry** | Store and version workflow container images |\n", + "| **Compute clusters** | Auto-scaling worker pools (cloud or on-premise) |\n", + "| **Job scheduler** | Automated triggering of `populate` operations |\n", + "| **Monitoring dashboard** | Real-time visibility into job status and errors |\n", + "| **Cost analytics** | Track compute and storage costs per workflow |\n", + "\n", + "This platform integrates directly with DataJoint schemas, providing a turnkey solution for teams that prefer managed infrastructure.\n", + "\n", + "## DIY Solutions\n", + "\n", + "Many teams build custom orchestration using standard DevOps tools. Common approaches include:\n", + "\n", + "### Containerization\n", + "\n", + "- **Docker** — Package DataJoint workflows with all dependencies\n", + "- **Singularity/Apptainer** — Container runtime for HPC environments\n", + "- **Conda environments** — Dependency management without full containerization\n", + "\n", + "### Container Orchestration\n", + "\n", + "- **Kubernetes** — Production-grade container orchestration\n", + "- **Docker Swarm** — Simpler container clustering\n", + "- **Nomad** — HashiCorp's workload orchestrator\n", + "\n", + "### Job Schedulers\n", + "\n", + "- **SLURM** — Common in academic HPC clusters\n", + "- **PBS/Torque** — Traditional batch scheduling\n", + "- **HTCondor** — High-throughput computing scheduler\n", + "- **Apache Airflow** — DAG-based workflow orchestration\n", + "- **Prefect** — Modern Python-native orchestration\n", + "- **Celery** — Distributed task queue\n", + "\n", + "### Cloud Infrastructure\n", + "\n", + "- **AWS Batch** — Managed batch computing on AWS\n", + "- **Google Cloud Run Jobs** — Serverless container execution\n", + "- **Azure Container Instances** — On-demand container execution\n", + "\n", + "### Monitoring and Observability\n", + "\n", + "- **Prometheus + Grafana** — Metrics collection and visualization\n", + "- **DataDog** — Commercial observability platform\n", + "- **CloudWatch / Stackdriver** — Cloud-native monitoring\n", + "\n", + "### Database Hosting\n", + "\n", + "- **Amazon RDS** — Managed MySQL on AWS\n", + "- **Google Cloud SQL** — Managed MySQL on GCP\n", + "- **Self-hosted MySQL/MariaDB** — On-premise or VM-based\n", + "\n", + "## Choosing an Approach\n", + "\n", + "The right orchestration strategy depends on your team's context:\n", + "\n", + "| Factor | Managed Platform | DIY |\n", + "|--------|-----------------|-----|\n", + "| **Setup time** | Hours | Days to weeks |\n", + "| **Maintenance** | Included | Team responsibility |\n", + "| **Customization** | Platform constraints | Full flexibility |\n", + "| **Cost model** | Subscription | Infrastructure costs |\n", + "| **Existing infrastructure** | May duplicate | Leverages investments |\n", + "| **Compliance requirements** | Check with vendor | Full control |\n", + "\n", + "Many teams start with DIY solutions using familiar tools, then evaluate managed platforms as workflows scale and operational overhead increases.\n", + "\n", + ":::{seealso}\n", + "- [DataJoint Works](https://datajoint.com) — Managed orchestration platform\n", + "- [Populate](050-populate.ipynb) — The underlying automation mechanism\n", + ":::" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/book/60-computation/010-computation.ipynb b/book/60-computation/010-computation.ipynb deleted file mode 100644 index b318b45..0000000 --- a/book/60-computation/010-computation.ipynb +++ /dev/null @@ -1,21 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": "# Computation as Workflow\n\nDataJoint's central innovation is to recast relational databases as executable workflow specifications comprising a mixture of manual and automated steps. This chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to practical computation patterns.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - } - ], - "metadata": { - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file From 0ce5301ef9b9b7a95ee47c40797846552e770590 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 22:44:57 +0000 Subject: [PATCH 14/18] Document the anatomy of a make function Add new section explaining the three-part structure of make methods: 1. Fetch - retrieve data from upstream tables using the key 2. Compute - perform the transformation/computation 3. Insert - store results in the table and any part tables Includes code examples and a complete example showing the pattern. --- book/40-operations/050-populate.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/40-operations/050-populate.ipynb b/book/40-operations/050-populate.ipynb index ebf2d45..b65b0dc 100644 --- a/book/40-operations/050-populate.ipynb +++ b/book/40-operations/050-populate.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Populate\n\nThe `populate` operation is the engine of workflow automation in DataJoint.\nWhile `insert`, `delete`, and `update` are manual operations, `populate` automates data entry for **Imported** and **Computed** tables based on dependencies defined in the schema.\n\nThis chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to the practical `populate` operation.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" + "source": "# Populate\n\nThe `populate` operation is the engine of workflow automation in DataJoint.\nWhile `insert`, `delete`, and `update` are manual operations, `populate` automates data entry for **Imported** and **Computed** tables based on dependencies defined in the schema.\n\nThis chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to the practical `populate` operation.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## Anatomy of a `make` Method\n\nThe `make()` method is where the actual computation happens.\nIts input is a single argument: the **key** dict identifying which entity to compute.\nThis key contains the primary key attributes from the key source—the join of all upstream dependencies.\n\nA well-structured `make()` method has three distinct parts:\n\n### 1. Fetch\n\nRetrieve the necessary data from upstream tables using the provided key:\n\n```python\ndef make(self, key):\n # 1. FETCH: Get data from upstream tables\n image = (Image & key).fetch1(\"image_data\")\n params = (BlobParamSet & key).fetch1()\n```\n\nThe key restricts each upstream table to exactly the relevant row(s).\nUse `fetch1()` when expecting a single row, `fetch()` for multiple rows.\n\n### 2. Compute\n\nPerform the actual computation or data transformation:\n\n```python\n # 2. COMPUTE: Perform the transformation\n blobs = detect_blobs(\n image,\n min_sigma=params[\"min_sigma\"],\n max_sigma=params[\"max_sigma\"],\n threshold=params[\"threshold\"],\n )\n```\n\nThis is the scientific or business logic—image processing, statistical analysis, simulation, or any transformation that produces derived data.\n\n### 3. Insert\n\nStore the results in the table (and any part tables):\n\n```python\n # 3. INSERT: Store results\n self.insert1({**key, \"blob_count\": len(blobs)})\n self.Blob.insert([{**key, \"blob_id\": i, **b} for i, b in enumerate(blobs)])\n```\n\nThe key must be included in the inserted row to maintain referential integrity.\nFor master-part structures, insert both the master row and all part rows within the same `make()` call.\n\n### Complete Example\n\n```python\n@schema\nclass Detection(dj.Computed):\n definition = \"\"\"\n -> Image\n -> BlobParamSet\n ---\n blob_count : int\n \"\"\"\n\n class Blob(dj.Part):\n definition = \"\"\"\n -> master\n blob_id : int\n ---\n x : float\n y : float\n sigma : float\n \"\"\"\n\n def make(self, key):\n # 1. FETCH\n image = (Image & key).fetch1(\"image_data\")\n params = (BlobParamSet & key).fetch1()\n\n # 2. COMPUTE\n blobs = detect_blobs(\n image,\n min_sigma=params[\"min_sigma\"],\n max_sigma=params[\"max_sigma\"],\n threshold=params[\"threshold\"],\n )\n\n # 3. INSERT\n self.insert1({**key, \"blob_count\": len(blobs)})\n self.Blob.insert([{**key, \"blob_id\": i, **b} for i, b in enumerate(blobs)])\n```\n\nThis three-part structure—**fetch, compute, insert**—keeps `make()` methods readable and maintainable.\nEach part has a clear responsibility, making it easy to debug and extend.\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" }, { "cell_type": "markdown", From f6d93a0bf47dfa63b3eef2eacc53c1cb2ae0a34c Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 22:52:01 +0000 Subject: [PATCH 15/18] Add dedicated chapter for make method anatomy - Create new 055-make.ipynb chapter covering: - Input: the key dictionary from the key source - Three-part anatomy: fetch, compute, insert - Restrictions on auto-populated tables (no manual insertion) - Upstream-only fetching constraint (foreign key chain) - Three-part pattern for long-running computations - Transaction handling strategies - Update 050-populate.ipynb to reference the new chapter instead of containing the full make anatomy inline --- book/40-operations/050-populate.ipynb | 2 +- book/40-operations/055-make.ipynb | 268 ++++++++++++++++++++++++++ 2 files changed, 269 insertions(+), 1 deletion(-) create mode 100644 book/40-operations/055-make.ipynb diff --git a/book/40-operations/050-populate.ipynb b/book/40-operations/050-populate.ipynb index b65b0dc..b2f3e4a 100644 --- a/book/40-operations/050-populate.ipynb +++ b/book/40-operations/050-populate.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Populate\n\nThe `populate` operation is the engine of workflow automation in DataJoint.\nWhile `insert`, `delete`, and `update` are manual operations, `populate` automates data entry for **Imported** and **Computed** tables based on dependencies defined in the schema.\n\nThis chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to the practical `populate` operation.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## Anatomy of a `make` Method\n\nThe `make()` method is where the actual computation happens.\nIts input is a single argument: the **key** dict identifying which entity to compute.\nThis key contains the primary key attributes from the key source—the join of all upstream dependencies.\n\nA well-structured `make()` method has three distinct parts:\n\n### 1. Fetch\n\nRetrieve the necessary data from upstream tables using the provided key:\n\n```python\ndef make(self, key):\n # 1. FETCH: Get data from upstream tables\n image = (Image & key).fetch1(\"image_data\")\n params = (BlobParamSet & key).fetch1()\n```\n\nThe key restricts each upstream table to exactly the relevant row(s).\nUse `fetch1()` when expecting a single row, `fetch()` for multiple rows.\n\n### 2. Compute\n\nPerform the actual computation or data transformation:\n\n```python\n # 2. COMPUTE: Perform the transformation\n blobs = detect_blobs(\n image,\n min_sigma=params[\"min_sigma\"],\n max_sigma=params[\"max_sigma\"],\n threshold=params[\"threshold\"],\n )\n```\n\nThis is the scientific or business logic—image processing, statistical analysis, simulation, or any transformation that produces derived data.\n\n### 3. Insert\n\nStore the results in the table (and any part tables):\n\n```python\n # 3. INSERT: Store results\n self.insert1({**key, \"blob_count\": len(blobs)})\n self.Blob.insert([{**key, \"blob_id\": i, **b} for i, b in enumerate(blobs)])\n```\n\nThe key must be included in the inserted row to maintain referential integrity.\nFor master-part structures, insert both the master row and all part rows within the same `make()` call.\n\n### Complete Example\n\n```python\n@schema\nclass Detection(dj.Computed):\n definition = \"\"\"\n -> Image\n -> BlobParamSet\n ---\n blob_count : int\n \"\"\"\n\n class Blob(dj.Part):\n definition = \"\"\"\n -> master\n blob_id : int\n ---\n x : float\n y : float\n sigma : float\n \"\"\"\n\n def make(self, key):\n # 1. FETCH\n image = (Image & key).fetch1(\"image_data\")\n params = (BlobParamSet & key).fetch1()\n\n # 2. COMPUTE\n blobs = detect_blobs(\n image,\n min_sigma=params[\"min_sigma\"],\n max_sigma=params[\"max_sigma\"],\n threshold=params[\"threshold\"],\n )\n\n # 3. INSERT\n self.insert1({**key, \"blob_count\": len(blobs)})\n self.Blob.insert([{**key, \"blob_id\": i, **b} for i, b in enumerate(blobs)])\n```\n\nThis three-part structure—**fetch, compute, insert**—keeps `make()` methods readable and maintainable.\nEach part has a clear responsibility, making it easy to debug and extend.\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" + "source": "# Populate\n\nThe `populate` operation is the engine of workflow automation in DataJoint.\nWhile `insert`, `delete`, and `update` are manual operations, `populate` automates data entry for **Imported** and **Computed** tables based on dependencies defined in the schema.\n\nThis chapter connects the theoretical foundations of the [Relational Workflow Model](../20-concepts/05-workflows.md) to the practical `populate` operation.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Human-entered data or data from external systems |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## The `make` Method\n\nThe `make()` method defines the computational logic for each entry.\nIt receives a **key** dictionary identifying which entity to compute and must **fetch** inputs, **compute** results, and **insert** them into the table.\n\nSee the dedicated [make Method](055-make.ipynb) chapter for:\n- The three-part anatomy (fetch, compute, insert)\n- Restrictions on auto-populated tables\n- The three-part pattern for long-running computations\n- Transaction handling strategies\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [The `make` Method](055-make.ipynb) — Anatomy, constraints, and patterns\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" }, { "cell_type": "markdown", diff --git a/book/40-operations/055-make.ipynb b/book/40-operations/055-make.ipynb new file mode 100644 index 0000000..5e4f255 --- /dev/null +++ b/book/40-operations/055-make.ipynb @@ -0,0 +1,268 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# The `make` Method\n", + "\n", + "The `make()` method defines the computational logic for auto-populated tables (`dj.Imported` and `dj.Computed`).\n", + "This chapter describes its anatomy, constraints, and the three-part pattern that enables long-running computations while preserving transactional integrity.\n", + "\n", + "## Input: The Key\n", + "\n", + "The `make()` method receives a single argument: the **key** dictionary.\n", + "This key identifies which entity to compute—it contains the primary key attributes from the table's *key source*.\n", + "\n", + "The key source is automatically determined by DataJoint as the join of all parent tables referenced by foreign keys in the auto-populated table's primary key, minus entries that already exist:\n", + "\n", + "```python\n", + "# For a table with dependencies -> Image and -> BlobParamSet,\n", + "# the key source is effectively:\n", + "Image.proj() * BlobParamSet.proj() - Detection\n", + "```\n", + "\n", + "Each call to `make()` processes exactly one key from this source.\n", + "\n", + "## The Three Parts\n", + "\n", + "A well-structured `make()` method has three distinct parts:\n", + "\n", + "### 1. Fetch\n", + "\n", + "Retrieve the necessary data from **upstream tables** using the provided key:\n", + "\n", + "```python\n", + "def make(self, key):\n", + " # 1. FETCH: Get data from upstream tables\n", + " image = (Image & key).fetch1(\"image_data\")\n", + " params = (BlobParamSet & key).fetch1()\n", + "```\n", + "\n", + "The key restricts each upstream table to exactly the relevant row(s).\n", + "Use `fetch1()` when expecting a single row, `fetch()` for multiple rows.\n", + "\n", + "**Upstream tables** are those reachable from the current table by following foreign key references upward through the dependency graph.\n", + "The fetch step should only access:\n", + "- Tables that are upstream dependencies (directly or transitively via foreign keys)\n", + "- Part tables of those upstream tables\n", + "\n", + "This constraint ensures computational reproducibility—the computation depends only on data that logically precedes it in the pipeline.\n", + "\n", + "### 2. Compute\n", + "\n", + "Perform the actual computation or data transformation:\n", + "\n", + "```python\n", + " # 2. COMPUTE: Perform the transformation\n", + " blobs = detect_blobs(\n", + " image,\n", + " min_sigma=params[\"min_sigma\"],\n", + " max_sigma=params[\"max_sigma\"],\n", + " threshold=params[\"threshold\"],\n", + " )\n", + "```\n", + "\n", + "This is the scientific or business logic—image processing, statistical analysis, simulation, or any transformation that produces derived data.\n", + "The compute step should be a pure function of the fetched data.\n", + "\n", + "### 3. Insert\n", + "\n", + "Store the results in the table (and any part tables):\n", + "\n", + "```python\n", + " # 3. INSERT: Store results\n", + " self.insert1({**key, \"blob_count\": len(blobs)})\n", + " self.Blob.insert([{**key, \"blob_id\": i, **b} for i, b in enumerate(blobs)])\n", + "```\n", + "\n", + "The key must be included in the inserted row to maintain referential integrity.\n", + "For master-part structures, insert both the master row and all part rows within the same `make()` call.\n", + "\n", + "## Restrictions on Auto-Populated Tables\n", + "\n", + "Auto-populated tables (`dj.Imported` and `dj.Computed`) enforce important constraints:\n", + "\n", + "1. **No manual insertion**: Users cannot insert data into auto-populated tables outside of the `make()` method. All data must come through the `populate()` mechanism.\n", + "\n", + "2. **Upstream-only fetching**: The fetch step should only access tables that are *upstream* in the pipeline—reachable by following foreign key references from the current table toward its dependencies.\n", + "\n", + "3. **Complete key inclusion**: Inserted rows must include the full primary key (the input `key` plus any additional primary key attributes defined in the table).\n", + "\n", + "These constraints ensure:\n", + "- **Reproducibility**: Results can be regenerated by re-running `populate()`\n", + "- **Provenance**: Every row traces back to specific upstream data\n", + "- **Consistency**: The dependency graph accurately reflects data flow\n", + "\n", + "## Complete Example\n", + "\n", + "```python\n", + "@schema\n", + "class Detection(dj.Computed):\n", + " definition = \"\"\"\n", + " -> Image\n", + " -> BlobParamSet\n", + " ---\n", + " blob_count : int\n", + " \"\"\"\n", + "\n", + " class Blob(dj.Part):\n", + " definition = \"\"\"\n", + " -> master\n", + " blob_id : int\n", + " ---\n", + " x : float\n", + " y : float\n", + " sigma : float\n", + " \"\"\"\n", + "\n", + " def make(self, key):\n", + " # 1. FETCH\n", + " image = (Image & key).fetch1(\"image_data\")\n", + " params = (BlobParamSet & key).fetch1()\n", + "\n", + " # 2. COMPUTE\n", + " blobs = detect_blobs(\n", + " image,\n", + " min_sigma=params[\"min_sigma\"],\n", + " max_sigma=params[\"max_sigma\"],\n", + " threshold=params[\"threshold\"],\n", + " )\n", + "\n", + " # 3. INSERT\n", + " self.insert1({**key, \"blob_count\": len(blobs)})\n", + " self.Blob.insert([{**key, \"blob_id\": i, **b} for i, b in enumerate(blobs)])\n", + "```\n", + "\n", + "## Transactional Integrity\n", + "\n", + "By default, each `make()` call executes inside an **ACID transaction**:\n", + "\n", + "- **Atomicity** — The entire computation either commits or rolls back as a unit\n", + "- **Isolation** — Partial results are never visible to other processes\n", + "- **Consistency** — The database moves from one valid state to another\n", + "\n", + "The transaction wraps the entire `make()` execution, including all fetches and inserts.\n", + "This guarantees that computed results are correctly associated with their specific inputs.\n", + "\n", + "## The Three-Part Pattern for Long Computations\n", + "\n", + "For long-running computations (hours or days), holding a database transaction open for the entire duration causes problems:\n", + "- Database locks block other operations\n", + "- Transaction timeouts may occur\n", + "- Resources are held unnecessarily\n", + "\n", + "The **three-part `make` pattern** solves this by separating the computation from the transaction:\n", + "\n", + "```python\n", + "@schema\n", + "class SignalAverage(dj.Computed):\n", + " definition = \"\"\"\n", + " -> RawSignal\n", + " ---\n", + " avg_signal : float\n", + " \"\"\"\n", + "\n", + " def make_fetch(self, key):\n", + " \"\"\"Step 1: Fetch input data (outside transaction)\"\"\"\n", + " raw_signal = (RawSignal & key).fetch1(\"signal\")\n", + " return (raw_signal,)\n", + "\n", + " def make_compute(self, key, fetched):\n", + " \"\"\"Step 2: Perform computation (outside transaction)\"\"\"\n", + " (raw_signal,) = fetched\n", + " avg = raw_signal.mean()\n", + " return (avg,)\n", + "\n", + " def make_insert(self, key, fetched, computed):\n", + " \"\"\"Step 3: Insert results (inside brief transaction)\"\"\"\n", + " (avg,) = computed\n", + " self.insert1({**key, \"avg_signal\": avg})\n", + "```\n", + "\n", + "### How It Works\n", + "\n", + "DataJoint executes the three parts with verification:\n", + "\n", + "```\n", + "fetched = make_fetch(key) # Outside transaction\n", + "computed = make_compute(key, fetched) # Outside transaction\n", + "\n", + "\n", + "fetched_again = make_fetch(key) # Re-fetch to verify\n", + "if fetched != fetched_again:\n", + " # Inputs changed—abort\n", + "else:\n", + " make_insert(key, fetched, computed)\n", + " \n", + "```\n", + "\n", + "The key insight: **the computation runs outside any transaction**, but referential integrity is preserved by re-fetching and verifying inputs before insertion.\n", + "If upstream data changed during computation, the job is cancelled rather than inserting inconsistent results.\n", + "\n", + "### Benefits\n", + "\n", + "| Aspect | Standard `make()` | Three-Part Pattern |\n", + "|--------|-------------------|--------------------|\n", + "| Transaction duration | Entire computation | Only final insert |\n", + "| Database locks | Held throughout | Minimal |\n", + "| Suitable for | Short computations | Hours/days |\n", + "| Integrity guarantee | Transaction | Re-fetch verification |\n", + "\n", + "### Generator Syntax Alternative\n", + "\n", + "The three-part pattern can also be expressed as a generator, which is more concise:\n", + "\n", + "```python\n", + "def make(self, key):\n", + " # 1. FETCH\n", + " raw_signal = (RawSignal & key).fetch1(\"signal\")\n", + " computed = yield (raw_signal,) # Yield fetched data\n", + "\n", + " if computed is None:\n", + " # 2. COMPUTE\n", + " avg = raw_signal.mean()\n", + " computed = (avg,)\n", + " yield computed # Yield computed results\n", + "\n", + " # 3. INSERT\n", + " (avg,) = computed\n", + " self.insert1({**key, \"avg_signal\": avg})\n", + " yield # Signal completion\n", + "```\n", + "\n", + "DataJoint automatically detects the generator pattern and handles the three-part execution.\n", + "\n", + "## When to Use Each Pattern\n", + "\n", + "| Computation Time | Pattern | Rationale |\n", + "|------------------|---------|----------|\n", + "| Seconds to minutes | Standard `make()` | Simple, transaction overhead acceptable |\n", + "| Minutes to hours | Three-part | Avoid long transactions |\n", + "| Hours to days | Three-part | Essential for stability |\n", + "\n", + "The three-part pattern trades off fetching data twice for dramatically reduced transaction duration.\n", + "Use it when computation time significantly exceeds fetch time.\n", + "\n", + ":::{seealso}\n", + "- [Populate](050-populate.ipynb) — The `populate()` method that calls `make()`\n", + "- [Transactions](040-transactions.ipynb) — ACID semantics in DataJoint\n", + "- [Master-Part](../30-design/053-master-part.ipynb) — Inserting master and part rows together\n", + ":::" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 5abba57f1a616ea9d4a98e09abf57dc009f26017 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 23:12:00 +0000 Subject: [PATCH 16/18] Complete Queries section with all query operators - Complete join chapter with semantic matching, foreign key relationships, left joins, and SQL translations - Complete union chapter with entity type compatibility, OR logic patterns, and best practices - Complete universal sets chapter covering dj.U() for extracting unique values, universal aggregation, and arbitrary groupings - Complete subqueries chapter with common patterns (existence, non-existence, AND/OR conditions, universal quantification) - Rewrite aggregation chapter with consistent style, clear examples using university database, and comprehensive SQL translations All chapters follow consistent format with: - Clear definitions without "powerful/fundamental" language - Examples from university database - SQL equivalents - Best practices - Practice exercises with solutions - Cross-references to Examples section --- book/50-queries/040-join.ipynb | 469 ++++++- book/50-queries/050-union.ipynb | 443 +++++- book/50-queries/055-aggregation.ipynb | 1823 ++++++------------------- book/50-queries/060-universal.ipynb | 415 +++++- book/50-queries/080-subqueries.ipynb | 866 ++++++++---- 5 files changed, 2182 insertions(+), 1834 deletions(-) diff --git a/book/50-queries/040-join.ipynb b/book/50-queries/040-join.ipynb index 82f303a..3bdb1bb 100644 --- a/book/50-queries/040-join.ipynb +++ b/book/50-queries/040-join.ipynb @@ -6,127 +6,448 @@ "source": [ "# Operator: Join\n", "\n", - "(This is an AI-generated template containing several mistakes. Work in progress)\n", + "The **join operator** combines data from two tables based on their shared attributes. It produces a new table containing all attributes from both input tables, with rows matched according to DataJoint's semantic matching rules.\n", "\n", - "The join operator in DataJoint is a powerful tool for combining data from multiple tables. It allows users to link related data based on shared attributes, enabling seamless integration of information across a pipeline.\n", + "## Understanding Join\n", "\n", - "## Overview of the Join Operator\n", + "Join **combines attributes** from two tables, matching rows where shared attributes have equal values. The result contains all columns from both tables (with shared columns appearing once) and only the rows where the matching attributes align.\n", "\n", - "The join operator, represented by the `*` symbol, merges two or more tables or queries into a single result set. It operates by matching rows based on their shared attributes (foreign key relationships) or combining all rows in the absence of a direct relationship.\n", + "### Key Concepts\n", "\n", - "### Syntax\n", + "- **Semantic matching**: Rows are matched on attributes that share both the same name and the same lineage through foreign keys\n", + "- **Algebraic closure**: The result is a valid relation with a well-defined primary key\n", + "- **Attribute combination**: The result contains all attributes from both tables\n", + "\n", + "### Basic Syntax\n", + "\n", + "```python\n", + "# Join two tables\n", + "result = TableA * TableB\n", + "```\n", + "\n", + "The `*` operator performs a natural join on semantically matched attributes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Semantic Matching\n", + "\n", + "DataJoint's join differs from SQL's `NATURAL JOIN` in an important way. Two attributes are matched only when they satisfy **both** conditions:\n", + "\n", + "1. They have the **same name** in both tables\n", + "2. They trace to the **same original definition** through an uninterrupted chain of foreign keys\n", + "\n", + "This prevents accidental joins on attributes that happen to share the same name but have different meanings.\n", + "\n", + "### Example: Semantic Matching in Action\n", + "\n", + "Consider tables `Student(student_id, name)` and `Course(course_id, name)`:\n", + "- Both have an attribute called `name`\n", + "- But `Student.name` refers to a person's name while `Course.name` refers to a course title\n", + "- These attributes do **not** share lineage through foreign keys\n", + "- DataJoint will raise an error if you attempt `Student * Course` because `name` collides\n", + "\n", + "**Resolution**: Use projection to rename the colliding attribute:\n", + "\n", + "```python\n", + "# Rename 'name' in Course before joining\n", + "Student * Course.proj(course_name='name')\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Types of Joins\n", + "\n", + "### 1. Join with Foreign Key Relationship\n", + "\n", + "The most common join connects tables linked by foreign keys. When table B has a foreign key referencing table A, joining them combines their attributes for each matching pair.\n", + "\n", + "```python\n", + "# Join students with their enrollments\n", + "# Enroll has a foreign key -> Student\n", + "Student * Enroll\n", + "```\n", + "\n", + "**Result structure**:\n", + "- Primary key: The union of primary keys from both tables (with shared attributes appearing once)\n", + "- Attributes: All attributes from both tables\n", + "\n", + "### 2. Join without Direct Relationship (Cartesian Product)\n", + "\n", + "When two tables share no common attributes, the join produces a **Cartesian product**—every row from the first table paired with every row from the second.\n", + "\n", + "```python\n", + "# All combinations of students and departments\n", + "Student.proj() * Department.proj()\n", + "```\n", + "\n", + "Use Cartesian products deliberately and with caution, as they can produce very large result sets.\n", + "\n", + "### 3. Chained Joins\n", + "\n", + "Multiple tables can be joined in sequence:\n", + "\n", + "```python\n", + "# Join students with enrollments and course information\n", + "Student * Enroll * Course\n", + "```\n", + "\n", + "Join is associative: `(A * B) * C` produces the same result as `A * (B * C)`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Primary Key of Join Results\n", + "\n", + "The primary key of a join result depends on the relationship between the tables:\n", + "\n", + "### Case 1: One-to-Many Relationship\n", + "\n", + "When joining a parent table with a child table (child has foreign key to parent):\n", + "- The result's primary key is the child table's primary key\n", + "- Each child row appears with its matching parent's attributes\n", "\n", "```python\n", - " * \n", + "# Student (parent) * Enroll (child with FK to Student)\n", + "# Result primary key: (student_id, course_id, section_id) from Enroll\n", + "Student * Enroll\n", "```\n", "\n", - "### Components\n", - "1. **`Table1` and `Table2`**:\n", - " - The tables or queries to be joined.\n", - " - These must share attributes if the join is to be based on a relationship.\n", + "### Case 2: Independent Tables\n", + "\n", + "When joining tables with no shared attributes:\n", + "- The result's primary key is the union of both primary keys\n", + "- Every combination is included\n", + "\n", + "```python\n", + "# Result primary key: (student_id, dept)\n", + "Student.proj() * Department.proj()\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Examples from the University Database\n", + "\n", + "The following examples use the university database schema with `Student`, `Department`, `Course`, `Section`, `Enroll`, and `Grade` tables.\n", "\n", - "## Types of Joins in DataJoint\n", + "### Example 1: Students with Their Majors\n", "\n", - "### 1. Natural Join (Default Behavior)\n", + "```python\n", + "# Join Student with StudentMajor to see each student's declared major\n", + "Student.proj('first_name', 'last_name') * StudentMajor\n", + "```\n", "\n", - "By default, the join operator in DataJoint performs a **natural join**, combining rows from both tables where their shared attributes match.\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT s.student_id, s.first_name, s.last_name, m.dept, m.declare_date\n", + "FROM student s\n", + "JOIN student_major m ON s.student_id = m.student_id;\n", + "```\n", "\n", - "#### Example\n", + "### Example 2: Enrollment Details with Course Names\n", "\n", "```python\n", - "import datajoint as dj\n", + "# Combine enrollment records with course information\n", + "Enroll * Course\n", + "```\n", "\n", - "schema = dj.Schema('example_schema')\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT e.*, c.course_name, c.credits\n", + "FROM enroll e\n", + "JOIN course c ON e.dept = c.dept AND e.course = c.course;\n", + "```\n", "\n", - "@schema\n", - "class Animal(dj.Manual):\n", - " definition = \"\"\"\n", - " animal_id: int # Unique identifier for the animal\n", - " ---\n", - " species: varchar(64) # Species of the animal\n", - " \"\"\"\n", + "### Example 3: Complete Grade Report\n", "\n", - "@schema\n", - "class Experiment(dj.Manual):\n", - " definition = \"\"\"\n", - " experiment_id: int # Unique experiment identifier\n", - " ---\n", - " animal_id: int # ID of the animal used in the experiment\n", - " description: varchar(255)\n", - " \"\"\"\n", + "```python\n", + "# Join multiple tables to create a comprehensive grade report\n", + "Student.proj('first_name', 'last_name') * Grade * Course * LetterGrade\n", + "```\n", "\n", - "# Insert example data\n", - "Animal.insert([\n", - " {'animal_id': 1, 'species': 'Dog'},\n", - " {'animal_id': 2, 'species': 'Cat'}\n", - "])\n", + "This produces a table with student names, course details, grades, and grade point values." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Join vs. Restriction by Subquery\n", + "\n", + "Join and restriction serve different purposes:\n", + "\n", + "| Operation | Purpose | Result Attributes |\n", + "|-----------|---------|-------------------|\n", + "| `A * B` (Join) | Combine data from both tables | All attributes from A and B |\n", + "| `A & B` (Restriction) | Filter A based on matching keys in B | Only attributes from A |\n", + "\n", + "### Example Comparison\n", + "\n", + "```python\n", + "# Join: Get student info WITH their enrollment details\n", + "Student * Enroll # Result has student AND enrollment attributes\n", + "\n", + "# Restriction: Get students WHO have enrollments\n", + "Student & Enroll # Result has only student attributes\n", + "```\n", + "\n", + "Use **join** when you need data from both tables. \n", + "Use **restriction** when you only need to filter one table based on another." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Left Join\n", + "\n", + "DataJoint's standard join (`*`) is an inner join—only rows with matches in both tables appear in the result. For cases where you need to include all rows from the left table regardless of matches, use the `.join()` method with `left=True`.\n", + "\n", + "```python\n", + "# Include all students, even those without declared majors\n", + "Student.proj('first_name', 'last_name').join(StudentMajor, left=True)\n", + "```\n", "\n", - "Experiment.insert([\n", - " {'experiment_id': 101, 'animal_id': 1, 'description': 'Behavioral test'},\n", - " {'experiment_id': 102, 'animal_id': 2, 'description': 'Cognitive test'}\n", - "])\n", + "**Result**: All students appear in the result. For students without majors, the major-related attributes contain `None`.\n", "\n", - "# Perform a natural join\n", - "joined_data = Animal * Experiment\n", - "print(joined_data.fetch())\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT s.student_id, s.first_name, s.last_name, m.dept, m.declare_date\n", + "FROM student s\n", + "LEFT JOIN student_major m ON s.student_id = m.student_id;\n", "```\n", "\n", - "### 2. Cartesian Product\n", + "**Note**: Left joins can produce results that don't represent a single well-defined entity type. Use them when necessary, but prefer inner joins when the semantics fit your query." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Combining Join with Other Operators\n", + "\n", + "Join works seamlessly with restriction and projection in query expressions.\n", + "\n", + "### Restrict Before Joining\n", + "\n", + "For efficiency, apply restrictions before joins to reduce the data being combined:\n", + "\n", + "```python\n", + "# Find enrollments for math courses only\n", + "(Course & {'dept': 'MATH'}) * Enroll\n", + "\n", + "# Find grades for current term only\n", + "Student * (Grade & CurrentTerm)\n", + "```\n", "\n", - "If the tables being joined do not share attributes, the join operator produces a **cartesian product**, combining every row from the first table with every row from the second.\n", + "### Project After Joining\n", "\n", - "#### Example\n", + "Use projection to select only the attributes you need from the combined result:\n", "\n", "```python\n", - "# Cartesian product of unrelated tables\n", - "unrelated_join = Animal * Experiment\n", - "print(unrelated_join.fetch())\n", + "# Get student names with their enrolled course names\n", + "(Student * Enroll * Course).proj('first_name', 'last_name', 'course_name')\n", "```\n", "\n", - "### 3. Combining with Restrictions\n", + "### Complex Query Example\n", "\n", - "The join operator can be combined with restrictions to filter the result set further.\n", + "```python\n", + "# Find students enrolled in MATH courses during the current term,\n", + "# showing their names and course details\n", + "(\n", + " Student.proj('first_name', 'last_name') \n", + " * (Enroll & CurrentTerm & {'dept': 'MATH'}) \n", + " * Course.proj('course_name', 'credits')\n", + ")\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## SQL Translation\n", "\n", - "#### Example\n", + "DataJoint's join translates to SQL `JOIN` operations:\n", "\n", + "### Basic Join\n", "```python\n", - "# Join and restrict\n", - "filtered_join = (Animal * Experiment) & {'species': 'Dog'}\n", - "print(filtered_join.fetch())\n", + "# DataJoint\n", + "Student * Enroll\n", "```\n", "\n", - "## Use Cases for Joins\n", + "```sql\n", + "-- SQL\n", + "SELECT s.*, e.dept, e.course, e.section_id\n", + "FROM student s\n", + "JOIN enroll e ON s.student_id = e.student_id;\n", + "```\n", "\n", - "1. **Linking Related Data**:\n", - " - Combine data from tables with foreign key relationships, such as linking experimental results to the animals used.\n", - "2. **Cross-Referencing**:\n", - " - Perform cross-references between independent datasets.\n", - "3. **Data Exploration**:\n", - " - Merge tables to explore combined attributes for analysis.\n", + "### Multi-Table Join\n", + "```python\n", + "# DataJoint\n", + "Student * Enroll * Course\n", + "```\n", "\n", + "```sql\n", + "-- SQL\n", + "SELECT s.*, e.section_id, c.course_name, c.credits\n", + "FROM student s\n", + "JOIN enroll e ON s.student_id = e.student_id\n", + "JOIN course c ON e.dept = c.dept AND e.course = c.course;\n", + "```\n", + "\n", + "### Left Join\n", + "```python\n", + "# DataJoint\n", + "Student.join(StudentMajor, left=True)\n", + "```\n", + "\n", + "```sql\n", + "-- SQL\n", + "SELECT s.*, m.dept, m.declare_date\n", + "FROM student s\n", + "LEFT JOIN student_major m ON s.student_id = m.student_id;\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## Best Practices\n", "\n", - "1. **Ensure Attribute Compatibility**:\n", - " - Verify that the tables being joined share appropriate attributes for natural joins.\n", - "2. **Restrict Before Joining**:\n", - " - Apply restrictions before performing joins to minimize the size of intermediate results.\n", - "3. **Use Cartesian Products Judiciously**:\n", - " - Avoid cartesian products unless explicitly required, as they can produce very large result sets.\n", - "4. **Test Queries**:\n", - " - Test join queries incrementally to ensure correctness and efficiency.\n", + "### 1. Understand Foreign Key Relationships\n", + "\n", + "Before joining tables, understand how they're connected:\n", + "- Check the schema diagram (`dj.Diagram(schema)`)\n", + "- Identify which attributes will be matched\n", + "- Predict the primary key of the result\n", + "\n", + "### 2. Restrict Before Joining\n", + "\n", + "Apply restrictions early to minimize intermediate result sizes:\n", + "\n", + "```python\n", + "# Better: restrict first, then join\n", + "(Student & {'home_state': 'CA'}) * Enroll\n", + "\n", + "# Less efficient: join first, then restrict\n", + "(Student * Enroll) & {'home_state': 'CA'}\n", + "```\n", + "\n", + "### 3. Resolve Name Collisions\n", + "\n", + "If two tables have attributes with the same name but different meanings, rename them before joining:\n", + "\n", + "```python\n", + "# If TableA and TableB both have 'name' with different meanings\n", + "TableA * TableB.proj(b_name='name')\n", + "```\n", + "\n", + "### 4. Use Projection to Keep Results Clean\n", + "\n", + "Project after joining to select only the attributes you need:\n", + "\n", + "```python\n", + "(Student * Enroll).proj('first_name', 'last_name', 'dept', 'course')\n", + "```\n", "\n", + "### 5. Be Cautious with Cartesian Products\n", + "\n", + "Joining tables with no shared attributes creates a Cartesian product. This is occasionally useful but can produce very large results:\n", + "\n", + "```python\n", + "# This creates 2000 students × 4 departments = 8000 rows\n", + "Student.proj() * Department.proj()\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## Summary\n", "\n", - "The join operator in DataJoint is a versatile tool for merging data across tables. Its ability to perform natural joins, cartesian products, and restricted joins makes it indispensable for building complex queries in relational pipelines. By mastering the join operator, users can unlock the full potential of their DataJoint schemas.\n", - "\n" + "The join operator combines data from multiple tables:\n", + "\n", + "1. **Syntax**: `TableA * TableB` performs a natural join on semantically matched attributes\n", + "2. **Semantic matching**: Attributes must share both name and lineage through foreign keys\n", + "3. **Result**: Contains all attributes from both tables with matching rows combined\n", + "4. **Primary key**: Determined by the relationship between the joined tables\n", + "5. **Left join**: Use `.join(other, left=True)` to include all rows from the left table\n", + "6. **Composition**: Join works with restriction and projection to build complex queries\n", + "\n", + "Join is essential for combining related data across tables. Use it when you need attributes from multiple tables in your result." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Practice Exercises\n", + "\n", + "Using the university database, try these exercises:\n", + "\n", + "### Exercise 1: Basic Join\n", + "\n", + "**Task**: Get all students with their enrolled courses, showing student name and course name.\n", + "\n", + "```python\n", + "(Student.proj('first_name', 'last_name') * Enroll * Course.proj('course_name'))\n", + "```\n", + "\n", + "### Exercise 2: Join with Restriction\n", + "\n", + "**Task**: Find all CS majors enrolled in current term courses.\n", + "\n", + "```python\n", + "(StudentMajor & {'dept': 'CS'}) * Student.proj('first_name', 'last_name') * (Enroll & CurrentTerm)\n", + "```\n", + "\n", + "### Exercise 3: Multi-Table Join\n", + "\n", + "**Task**: Create a complete transcript showing student name, course name, credits, and grade.\n", + "\n", + "```python\n", + "Student.proj('first_name', 'last_name') * Grade * Course.proj('course_name', 'credits')\n", + "```\n", + "\n", + "### Exercise 4: Left Join\n", + "\n", + "**Task**: List all students with their majors, including students who haven't declared a major.\n", + "\n", + "```python\n", + "Student.proj('first_name', 'last_name').join(StudentMajor, left=True)\n", + "```\n", + "\n", + ":::{seealso}\n", + "For more join examples with the university database, see the [University Queries](../80-examples/016-university-queries.ipynb) example.\n", + ":::" ] } ], "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { - "name": "python" + "name": "python", + "version": "3.11.0" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/book/50-queries/050-union.ipynb b/book/50-queries/050-union.ipynb index 7171f35..74ce22e 100644 --- a/book/50-queries/050-union.ipynb +++ b/book/50-queries/050-union.ipynb @@ -6,121 +6,418 @@ "source": [ "# Operator: Union\n", "\n", - "(This is an AI-generated template, work in progress.)\n", + "The **union operator** combines rows from two tables that represent the same entity type. It produces a table containing all unique rows from both input tables.\n", "\n", - "The union operator in DataJoint allows users to combine the results of multiple tables or queries into a single unified result set. This operator is particularly useful when dealing with data spread across similar tables or queries with compatible schemas.\n", + "## Understanding Union\n", "\n", - "## Overview of the Union Operator\n", + "Union **combines rows** from two tables into a single result. Unlike join (which combines columns), union stacks the rows of compatible tables.\n", "\n", - "The union operator, represented by the `+` symbol, merges the rows of two or more tables or queries. The resulting dataset includes all rows from the input sources, with duplicates automatically removed.\n", + "### Key Concepts\n", "\n", - "### Syntax\n", + "- **Same entity type**: Both tables must represent the same kind of entity with the same primary key\n", + "- **Semantic compatibility**: All shared attributes must be semantically matched\n", + "- **Deduplication**: Duplicate rows (based on primary key) are included only once\n", + "- **Algebraic closure**: The result has the same primary key as the input tables\n", + "\n", + "### Basic Syntax\n", + "\n", + "```python\n", + "# Combine rows from two tables\n", + "result = TableA + TableB\n", + "```\n", + "\n", + "The `+` operator performs a union on tables with compatible schemas." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Requirements for Union\n", + "\n", + "For a union to be valid, the two operands must satisfy these conditions:\n", + "\n", + "### 1. Same Primary Key\n", + "\n", + "Both tables must have the same primary key attributes—identical names, types, and semantic meaning. They must represent the same entity type.\n", + "\n", + "```python\n", + "# Valid: Both represent students with student_id as primary key\n", + "math_majors = Student & (StudentMajor & {'dept': 'MATH'})\n", + "physics_majors = Student & (StudentMajor & {'dept': 'PHYS'})\n", + "stem_majors = math_majors + physics_majors\n", + "```\n", + "\n", + "### 2. Semantic Compatibility\n", + "\n", + "All attributes shared between the two tables must be semantically compatible—they must trace to the same original definition through foreign keys.\n", + "\n", + "```python\n", + "# Invalid: Cannot union Student and Course—different entity types\n", + "# Student + Course # This would raise an error\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How Union Handles Attributes\n", + "\n", + "The result of a union includes:\n", + "\n", + "### Primary Key Attributes\n", + "\n", + "The result's primary key is identical to that of both operands. All primary key entries from either table are included.\n", + "\n", + "### Secondary Attributes\n", + "\n", + "| Scenario | Result |\n", + "|----------|--------|\n", + "| Attribute in both tables | Included; value from left operand takes precedence for overlapping keys |\n", + "| Attribute only in left table | Included; `NULL` for rows from right table |\n", + "| Attribute only in right table | Included; `NULL` for rows from left table |\n", + "\n", + "### Handling Overlapping Keys\n", + "\n", + "When the same primary key exists in both tables:\n", + "- The row appears once in the result\n", + "- Secondary attribute values come from the **left operand** (the first table in `A + B`)\n", + "\n", + "```python\n", + "# If student 1000 exists in both math_majors and physics_majors,\n", + "# the secondary attributes will come from math_majors\n", + "result = math_majors + physics_majors\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Common Use Cases\n", + "\n", + "### 1. Combining Query Results with OR Logic\n", + "\n", + "Union is useful when you need entities that satisfy one condition OR another, especially when those conditions involve different related tables.\n", + "\n", + "```python\n", + "# Students who speak English OR Spanish\n", + "english_speakers = Person & (Fluency & {'lang_code': 'en'})\n", + "spanish_speakers = Person & (Fluency & {'lang_code': 'es'})\n", + "bilingual_candidates = english_speakers + spanish_speakers\n", + "```\n", + "\n", + "### 2. Merging Subsets of the Same Table\n", + "\n", + "When you've created different filtered views of the same table, union combines them:\n", + "\n", + "```python\n", + "# Students from California or New York\n", + "ca_students = Student & {'home_state': 'CA'}\n", + "ny_students = Student & {'home_state': 'NY'}\n", + "coastal_students = ca_students + ny_students\n", + "```\n", + "\n", + "**Note**: For simple OR conditions on the same table, restriction with a list is often cleaner:\n", + "\n", + "```python\n", + "# Equivalent and more concise\n", + "coastal_students = Student & [{'home_state': 'CA'}, {'home_state': 'NY'}]\n", + "# Or using SQL syntax\n", + "coastal_students = Student & 'home_state IN (\"CA\", \"NY\")'\n", + "```\n", + "\n", + "### 3. Combining Results from Different Foreign Key Paths\n", + "\n", + "Union shines when the OR conditions involve different relationship paths:\n", + "\n", + "```python\n", + "# Students who either major in CS or are enrolled in a CS course\n", + "cs_majors = Student & (StudentMajor & {'dept': 'CS'})\n", + "cs_enrolled = Student & (Enroll & {'dept': 'CS'})\n", + "cs_students = cs_majors + cs_enrolled\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Examples from the University Database\n", + "\n", + "### Example 1: STEM Majors\n", + "\n", + "Find all students majoring in any STEM field:\n", "\n", "```python\n", - " + \n", + "# Students in STEM departments\n", + "math_majors = Student & (StudentMajor & {'dept': 'MATH'})\n", + "cs_majors = Student & (StudentMajor & {'dept': 'CS'})\n", + "physics_majors = Student & (StudentMajor & {'dept': 'PHYS'})\n", + "bio_majors = Student & (StudentMajor & {'dept': 'BIOL'})\n", + "\n", + "stem_students = math_majors + cs_majors + physics_majors + bio_majors\n", "```\n", "\n", - "### Components\n", - "1. **`Table1` and `Table2`**:\n", - " - The tables or queries to be combined.\n", - " - These must have compatible schemas (i.e., the same set of attributes).\n", + "### Example 2: Students with Academic Activity\n", "\n", - "## Combining Tables with Union\n", + "Find students who are either currently enrolled or have received grades:\n", "\n", - "The union operator consolidates rows from multiple sources while maintaining data integrity by removing duplicates.\n", + "```python\n", + "# Students with enrollments in current term\n", + "currently_enrolled = Student & (Enroll & CurrentTerm)\n", "\n", - "### Example\n", + "# Students with any grades on record\n", + "students_with_grades = Student & Grade\n", + "\n", + "# All academically active students\n", + "active_students = currently_enrolled + students_with_grades\n", + "```\n", + "\n", + "### Example 3: Honor Students\n", + "\n", + "Find students who either have high GPAs or are in the honors program:\n", "\n", "```python\n", - "import datajoint as dj\n", + "# High GPA students (3.5+)\n", + "high_gpa = Student.aggr(\n", + " Course * Grade * LetterGrade,\n", + " gpa='SUM(points * credits) / SUM(credits)'\n", + ") & 'gpa >= 3.5'\n", "\n", - "schema = dj.Schema('example_schema')\n", + "# Students in honors program (assuming an HonorsStudent table)\n", + "honors_enrolled = Student & HonorsStudent\n", "\n", - "@schema\n", - "class AnimalA(dj.Manual):\n", - " definition = \"\"\"\n", - " animal_id: int # Unique identifier for the animal in Table A\n", - " ---\n", - " species: varchar(64) # Species of the animal\n", - " age: int # Age of the animal in years\n", - " \"\"\"\n", + "# All honor students\n", + "all_honors = (Student & high_gpa) + honors_enrolled\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Union with Projection\n", "\n", - "@schema\n", - "class AnimalB(dj.Manual):\n", - " definition = \"\"\"\n", - " animal_id: int # Unique identifier for the animal in Table B\n", - " ---\n", - " species: varchar(64) # Species of the animal\n", - " age: int # Age of the animal in years\n", - " \"\"\"\n", + "When unioning query expressions, often you'll work with projections to ensure the tables have compatible structures:\n", "\n", - "# Insert example data\n", - "AnimalA.insert([\n", - " {'animal_id': 1, 'species': 'Dog', 'age': 5},\n", - " {'animal_id': 2, 'species': 'Cat', 'age': 3}\n", - "])\n", + "### Projecting to Primary Key Only\n", "\n", - "AnimalB.insert([\n", - " {'animal_id': 3, 'species': 'Rabbit', 'age': 2},\n", - " {'animal_id': 2, 'species': 'Cat', 'age': 3}\n", - "])\n", + "The simplest union uses only primary keys:\n", "\n", - "# Perform a union operation\n", - "combined_animals = AnimalA + AnimalB\n", - "print(combined_animals.fetch())\n", + "```python\n", + "# Get unique student IDs from multiple sources\n", + "math_students = (Student & (StudentMajor & {'dept': 'MATH'})).proj()\n", + "enrolled_students = (Student & Enroll).proj()\n", + "all_relevant = math_students + enrolled_students\n", "```\n", "\n", - "### Output\n", - "The result will include all unique rows from `AnimalA` and `AnimalB`:\n", + "### Ensuring Attribute Compatibility\n", "\n", - "```plaintext\n", - "[{'animal_id': 1, 'species': 'Dog', 'age': 5},\n", - " {'animal_id': 2, 'species': 'Cat', 'age': 3},\n", - " {'animal_id': 3, 'species': 'Rabbit', 'age': 2}]\n", + "If the queries have different secondary attributes, project to a common set:\n", + "\n", + "```python\n", + "# Both restricted to same attributes for clean union\n", + "ca_names = (Student & {'home_state': 'CA'}).proj('first_name', 'last_name')\n", + "ny_names = (Student & {'home_state': 'NY'}).proj('first_name', 'last_name')\n", + "coastal_names = ca_names + ny_names\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## SQL Translation\n", + "\n", + "DataJoint's union translates to SQL `UNION` operations:\n", + "\n", + "### Basic Union\n", + "\n", + "```python\n", + "# DataJoint\n", + "math_majors = Student & (StudentMajor & {'dept': 'MATH'})\n", + "cs_majors = Student & (StudentMajor & {'dept': 'CS'})\n", + "stem_students = math_majors.proj() + cs_majors.proj()\n", + "```\n", + "\n", + "```sql\n", + "-- SQL\n", + "SELECT student_id FROM student\n", + "WHERE student_id IN (SELECT student_id FROM student_major WHERE dept = 'MATH')\n", + "UNION\n", + "SELECT student_id FROM student\n", + "WHERE student_id IN (SELECT student_id FROM student_major WHERE dept = 'CS');\n", "```\n", "\n", - "## Use Cases for the Union Operator\n", + "### Union with Attributes\n", "\n", - "1. **Merging Similar Tables**:\n", - " - Combine data from tables with identical schemas that represent similar entities.\n", - "2. **Integrating Subsets**:\n", - " - Merge query results that filter different subsets of data from the same table.\n", - "3. **Building Comprehensive Results**:\n", - " - Consolidate data from different sources into a single dataset for analysis.\n", + "```python\n", + "# DataJoint\n", + "ca_students = (Student & {'home_state': 'CA'}).proj('first_name', 'last_name')\n", + "ny_students = (Student & {'home_state': 'NY'}).proj('first_name', 'last_name')\n", + "result = ca_students + ny_students\n", + "```\n", + "\n", + "```sql\n", + "-- SQL\n", + "SELECT student_id, first_name, last_name FROM student WHERE home_state = 'CA'\n", + "UNION\n", + "SELECT student_id, first_name, last_name FROM student WHERE home_state = 'NY';\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Union vs. Other Approaches\n", "\n", - "### Example: Union with Restrictions\n", + "### Union vs. OR in Restriction\n", "\n", - "The union operator can also be used with restricted queries:\n", + "For simple conditions on the same table, use OR (list restriction):\n", "\n", "```python\n", - "# Restrict and combine subsets from both tables\n", - "restricted_union = (AnimalA & 'age > 4') + (AnimalB & {'species': 'Rabbit'})\n", - "print(restricted_union.fetch())\n", + "# Using OR (preferred for simple cases)\n", + "coastal = Student & [{'home_state': 'CA'}, {'home_state': 'NY'}]\n", + "\n", + "# Using union (equivalent but more verbose)\n", + "coastal = (Student & {'home_state': 'CA'}) + (Student & {'home_state': 'NY'})\n", "```\n", "\n", + "### When Union is Necessary\n", + "\n", + "Use union when:\n", + "1. Conditions involve different related tables\n", + "2. Queries have different computation paths\n", + "3. You're combining results from separate query expressions\n", + "\n", + "```python\n", + "# This requires union—can't express with simple OR\n", + "honors_or_dean_list = (Student & HonorsProgram) + (Student & DeansList)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## Best Practices\n", "\n", - "1. **Ensure Schema Compatibility**:\n", - " - Verify that the tables or queries being combined have the same attributes.\n", - "2. **Use Restrictions**:\n", - " - Restrict the tables before applying the union operator to avoid unnecessary data processing.\n", - "3. **Understand Deduplication**:\n", - " - Be aware that duplicates are automatically removed in the resulting dataset.\n", - "4. **Test Results Incrementally**:\n", - " - Test individual queries before combining them to ensure accuracy.\n", + "### 1. Verify Entity Type Compatibility\n", + "\n", + "Before unioning, confirm both operands represent the same entity:\n", + "\n", + "```python\n", + "# Check primary keys match\n", + "print(query_a.primary_key)\n", + "print(query_b.primary_key)\n", + "```\n", + "\n", + "### 2. Use Projection for Cleaner Results\n", + "\n", + "Project to common attributes when operands have different secondary attributes:\n", + "\n", + "```python\n", + "# Project both to same structure\n", + "result = query_a.proj('name') + query_b.proj('name')\n", + "```\n", + "\n", + "### 3. Consider Alternatives for Simple Cases\n", + "\n", + "For simple OR conditions, restriction with a list is cleaner:\n", + "\n", + "```python\n", + "# Instead of union for simple cases\n", + "Student & 'home_state IN (\"CA\", \"NY\", \"TX\")'\n", + "```\n", + "\n", + "### 4. Be Aware of Left Precedence\n", + "\n", + "Remember that for overlapping primary keys, secondary attributes come from the left operand:\n", "\n", + "```python\n", + "# order_a's attributes take precedence for shared keys\n", + "result = order_a + order_b\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## Summary\n", "\n", - "The union operator in DataJoint is a simple yet powerful tool for combining data across tables or queries. By unifying datasets with compatible schemas, it facilitates comprehensive data retrieval while ensuring integrity through automatic deduplication. Mastery of the union operator enables users to streamline data integration workflows in complex pipelines.\n", - "\n" + "The union operator combines rows from compatible tables:\n", + "\n", + "1. **Syntax**: `TableA + TableB` combines rows from both tables\n", + "2. **Requirements**: Same primary key and entity type; semantically compatible attributes\n", + "3. **Deduplication**: Each primary key appears once; left operand takes precedence\n", + "4. **Use cases**: OR logic across different relationships, combining filtered subsets\n", + "5. **Alternatives**: For simple OR on the same table, use list restriction instead\n", + "\n", + "Union is useful for combining query results that represent the same entity type from different filtering paths." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Practice Exercises\n", + "\n", + "### Exercise 1: Simple Union\n", + "\n", + "**Task**: Find all students majoring in either Math or CS.\n", + "\n", + "```python\n", + "math_majors = Student & (StudentMajor & {'dept': 'MATH'})\n", + "cs_majors = Student & (StudentMajor & {'dept': 'CS'})\n", + "math_or_cs = math_majors + cs_majors\n", + "```\n", + "\n", + "### Exercise 2: Union Across Relationships\n", + "\n", + "**Task**: Find students who either have a declared major or are enrolled in at least one course.\n", + "\n", + "```python\n", + "students_with_major = Student & StudentMajor\n", + "students_enrolled = Student & Enroll\n", + "active_students = students_with_major + students_enrolled\n", + "```\n", + "\n", + "### Exercise 3: Union with Projection\n", + "\n", + "**Task**: Get names of students from western states (CA, OR, WA).\n", + "\n", + "```python\n", + "western_students = (\n", + " (Student & {'home_state': 'CA'}).proj('first_name', 'last_name') +\n", + " (Student & {'home_state': 'OR'}).proj('first_name', 'last_name') +\n", + " (Student & {'home_state': 'WA'}).proj('first_name', 'last_name')\n", + ")\n", + "\n", + "# Or more simply:\n", + "western_students = (Student & 'home_state IN (\"CA\", \"OR\", \"WA\")').proj('first_name', 'last_name')\n", + "```\n", + "\n", + ":::{seealso}\n", + "For more query examples, see the [University Queries](../80-examples/016-university-queries.ipynb) example.\n", + ":::" ] } ], "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { - "name": "python" + "name": "python", + "version": "3.11.0" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/book/50-queries/055-aggregation.ipynb b/book/50-queries/055-aggregation.ipynb index ead2cb0..860fba0 100644 --- a/book/50-queries/055-aggregation.ipynb +++ b/book/50-queries/055-aggregation.ipynb @@ -5,942 +5,421 @@ "metadata": {}, "source": [ "# Operator: Aggregation\n", - "\n" + "\n", + "The **aggregation operator** computes summary statistics from related entities. It augments each entity in one table with values computed from matching entities in another table.\n", + "\n", + "## Understanding Aggregation\n", + "\n", + "Aggregation answers questions like:\n", + "- How many experiments has each animal participated in?\n", + "- What is each student's GPA?\n", + "- How many direct reports does each manager have?\n", + "\n", + "The result preserves the primary key of the grouping table while adding computed attributes.\n", + "\n", + "### Key Concepts\n", + "\n", + "- **Grouping entity**: The table whose entities you're augmenting (e.g., Student)\n", + "- **Aggregated entity**: The table whose data is being summarized (e.g., Grade)\n", + "- **Algebraic closure**: The result has the same primary key as the grouping entity\n", + "- **Left join semantics**: Entities without matches still appear (with NULL or default values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "(This is an AI-generated template and work in progress.)\n", - "\n", - "The aggregation operator in DataJoint enables users to compute summary statistics or aggregate data from tables. It is a powerful tool for extracting meaningful insights by grouping and summarizing data directly within the database.\n", - "\n", - "## Overview of the Aggregation Operator\n", - "\n", - "The aggregation operator in DataJoint is implemented using the `.aggr()` method. This operator allows users to group data by specific attributes and compute aggregate functions such as sums, averages, counts, and more on the grouped data.\n", - "\n", - "### Syntax\n", + "## Basic Syntax\n", "\n", "```python\n", - ".aggr(, *aggregates, **renamed_aggregates)\n", + "# Aggregate related entities\n", + "result = GroupingTable.aggr(AggregatedTable, new_attr='AGG_FUNC(expression)', ...)\n", "```\n", "\n", "### Components\n", - "1. **``**:\n", - " - The table or query whose data is being aggregated.\n", - "2. **`*aggregates`**:\n", - " - A list of aggregate functions to compute.\n", - "3. **`**renamed_aggregates`**:\n", - " - Key-value pairs for creating new aggregated attributes, where the key is the new attribute name and the value is the aggregate function.\n", "\n", - "## Using Aggregation\n", + "| Component | Description |\n", + "|-----------|-------------|\n", + "| `GroupingTable` | The table whose entities define the groups |\n", + "| `AggregatedTable` | The table (or query) whose data is being summarized |\n", + "| `new_attr='...'` | Named aggregate expressions using SQL aggregate functions |\n", "\n", - "### Example: Counting Rows\n", + "### Aggregate Functions\n", "\n", - "The simplest aggregation operation is counting the number of rows for each group.\n", + "Common SQL aggregate functions available in expressions:\n", "\n", - "```python\n", - "import datajoint as dj\n", - "\n", - "schema = dj.Schema('example_schema')\n", - "\n", - "@schema\n", - "class Animal(dj.Manual):\n", - " definition = \"\"\"\n", - " animal_id: int # Unique identifier for the animal\n", - " ---\n", - " species: varchar(64) # Species of the animal\n", - " age: int # Age of the animal in years\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Experiment(dj.Manual):\n", - " definition = \"\"\"\n", - " experiment_id: int # Unique experiment identifier\n", - " ---\n", - " animal_id: int # ID of the animal used in the experiment\n", - " result: float # Result of the experiment\n", - " \"\"\"\n", - "\n", - "# Insert example data\n", - "Animal.insert([\n", - " {'animal_id': 1, 'species': 'Dog', 'age': 5},\n", - " {'animal_id': 2, 'species': 'Cat', 'age': 3},\n", - " {'animal_id': 3, 'species': 'Rabbit', 'age': 2}\n", - "])\n", - "\n", - "Experiment.insert([\n", - " {'experiment_id': 101, 'animal_id': 1, 'result': 75.0},\n", - " {'experiment_id': 102, 'animal_id': 1, 'result': 82.5},\n", - " {'experiment_id': 103, 'animal_id': 2, 'result': 90.0}\n", - "])\n", - "\n", - "# Aggregate experiments by animal_id, counting rows\n", - "experiment_counts = Animal.aggr(Experiment, count='count(*)')\n", - "print(experiment_counts.fetch())\n", - "```\n", + "| Function | Description |\n", + "|----------|-------------|\n", + "| `COUNT(*)` | Count of matching rows |\n", + "| `COUNT(attr)` | Count of non-NULL values |\n", + "| `SUM(attr)` | Sum of values |\n", + "| `AVG(attr)` | Average of values |\n", + "| `MIN(attr)` | Minimum value |\n", + "| `MAX(attr)` | Maximum value |\n", + "| `GROUP_CONCAT(attr)` | Concatenate values into a string |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Counting Related Entities\n", "\n", - "### Example: Computing Summary Statistics\n", + "The most common aggregation counts how many related entities exist for each grouping entity.\n", "\n", - "Aggregation can also compute summary statistics such as sums, averages, and maximums.\n", + "### Example: Count Enrollments per Student\n", "\n", "```python\n", - "# Aggregate experiments by animal_id, computing the average result\n", - "average_results = Animal.aggr(Experiment, avg_result='avg(result)')\n", - "print(average_results.fetch())\n", + "# How many courses is each student enrolled in?\n", + "enrollment_counts = Student.aggr(Enroll, n_courses='COUNT(*)')\n", "```\n", "\n", - "## Combining Aggregation with Restrictions\n", + "**Result structure**:\n", + "- Primary key: `student_id` (from Student)\n", + "- New attribute: `n_courses` (count of enrollments)\n", "\n", - "Aggregation can be combined with restrictions to focus on specific subsets of data.\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT s.*, COUNT(e.student_id) AS n_courses\n", + "FROM student s\n", + "LEFT JOIN enroll e ON s.student_id = e.student_id\n", + "GROUP BY s.student_id;\n", + "```\n", "\n", - "### Example\n", + "### Example: Count Students per Section\n", "\n", "```python\n", - "# Compute the average result for animals older than 3 years\n", - "restricted_avg = (Animal & 'age > 3').aggr(Experiment, avg_result='avg(result)')\n", - "print(restricted_avg.fetch())\n", + "# How many students are enrolled in each section?\n", + "section_sizes = Section.aggr(Enroll, n_students='COUNT(*)')\n", "```\n", "\n", - "## Best Practices\n", - "\n", - "1. **Understand Grouping**:\n", - " - The grouping is determined by the attributes of the primary table (e.g., `Animal` in the examples).\n", - "2. **Use Meaningful Aggregates**:\n", - " - Choose aggregates that provide actionable insights, such as averages, counts, or maximum values.\n", - "3. **Test Incrementally**:\n", - " - Test your aggregations with smaller datasets to verify correctness before applying them to larger datasets.\n", - "4. **Combine with Restrictions**:\n", - " - Apply restrictions to narrow down the data being aggregated for more focused results.\n", - "5. **Avoid Ambiguity**:\n", - " - Clearly define attribute names for aggregates using `renamed_aggregates` to avoid confusion in the results.\n", - "\n", - "## Summary\n", + "### Example: Count Direct Reports per Manager\n", "\n", - "The aggregation operator in DataJoint is an essential tool for summarizing data. By computing statistics like counts, averages, and more, it allows users to derive insights from their pipelines. Mastering this operator will enable efficient and meaningful data analysis directly within your database.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [], - "source": [ - "import pymysql\n", - "pymysql.install_as_MySQLdb()\n", - "%load_ext sql\n", - "%config SqlMagic.autocommit=True\n", - "%sql mysql://root:simple@127.0.0.1" + "```python\n", + "# For each manager, count their direct reports\n", + "managers = Employee.proj(manager_id='employee_id')\n", + "report_counts = managers.aggr(ReportsTo, n_reports='COUNT(*)')\n", + "```" ] }, { - "cell_type": "code", - "execution_count": 1, + "cell_type": "markdown", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[2023-11-01 00:25:59,466][INFO]: Connecting root@fakeservices.datajoint.io:3306\n", - "[2023-11-01 00:25:59,482][INFO]: Connected root@fakeservices.datajoint.io:3306\n" - ] - } - ], "source": [ - "import datajoint as dj\n", - "\n", - "sales = dj.Schema('classicsales')\n", - "sales.spawn_missing_classes()\n", + "## Computing Statistics\n", "\n", - "nations = dj.Schema('nation')\n", - "nations.spawn_missing_classes()\n", + "Aggregation can compute any SQL aggregate function on related data.\n", "\n", - "hotel = dj.Schema('hotel')\n", - "hotel.spawn_missing_classes()\n", + "### Example: Average Grade per Student\n", "\n", - "university = dj.Schema('university')\n", - "university.spawn_missing_classes()\n", + "```python\n", + "# Compute average grade for each student\n", + "avg_grades = Student.aggr(Grade, avg_grade='AVG(grade_value)')\n", + "```\n", "\n", - "app = dj.Schema('app')\n", - "app.spawn_missing_classes()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Concepts\n", + "### Example: GPA Calculation\n", "\n", - "Review the MySQL aggregate functions: https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html\n", + "GPA requires weighting grades by credits:\n", "\n", - "Three types of queries\n", + "```python\n", + "# Compute weighted GPA for each student\n", + "student_gpa = Student.aggr(\n", + " Course * Grade * LetterGrade,\n", + " gpa='SUM(points * credits) / SUM(credits)',\n", + " total_credits='SUM(credits)'\n", + ")\n", + "```\n", "\n", - "1. Aggregation functions with no `GROUP BY` clause produce 1 row. \n", - "2. Aggregation functions combined with a `GROUP BY` clause. The unique key of the result is composed of the columns of the `GROUP BY` clause.\n", - "3. Most common pattern: `JOIN` or `LEFT JOIN` of a table pair in a one-to-many relationship, grouped by the primary key of the left table. This aggregates the right entity set with respect to the left entity set. \n", + "Here, `Course * Grade * LetterGrade` joins the tables to access both `credits` (from Course) and `points` (from LetterGrade).\n", "\n", - "Note that MySQL with the default settings allows mixing aggregated and non-aggregated values (See https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_only_full_group_by). So you have to watch avoid invalid mixes of values.\n", + "### Example: Order Statistics\n", "\n", - "Using `HAVING` is equivalent to using a `WHERE` clause in an outer query." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import datajoint as dj\n", - "schema = dj.Schema('app')\n", - "schema.spawn_missing_classes()\n", - "dj.Diagram(schema)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pymysql\n", - "pymysql.install_as_MySQLdb()\n", - "%load_ext sql\n", - "%config SqlMagic.autocommit=True\n", - "%sql mysql://root:simple@127.0.0.1" + "```python\n", + "# For each order, compute item statistics\n", + "order_stats = Order.aggr(\n", + " OrderItem,\n", + " n_items='COUNT(*)',\n", + " total='SUM(quantity * unit_price)',\n", + " avg_item_price='AVG(unit_price)'\n", + ")\n", + "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Aggregation Queries\n", + "## Multiple Aggregate Expressions\n", "\n", - "Queries using aggregation functions, `GROUP BY`, and `HAVING` clauses. Using `LEFT JOIN` in combination with `GROUP BY`.\n", + "You can compute multiple aggregates in a single operation:\n", "\n", - "Aggregation functions: `MAX`, `MIN`, `AVG`, `SUM`, and `COUNT`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%sql\n", - "USE app" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "-- show the date of the last purchase \n", - "SELECT * FROM purchase ORDER BY purchase_date DESC LIMIT 1 " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "-- show the data of the last pruchase \n", - "SELECT max(purchase_date) last_purchase, min(purchase_date) first_purchase, phone, card_number FROM purchase" + "```python\n", + "# Compute multiple statistics per student\n", + "student_stats = Student.aggr(\n", + " Grade,\n", + " n_grades='COUNT(*)',\n", + " avg_grade='AVG(grade_value)',\n", + " min_grade='MIN(grade_value)',\n", + " max_grade='MAX(grade_value)'\n", + ")\n", + "```\n", + "\n", + "All aggregate expressions are computed simultaneously for each group." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Aggregation functions MAX, MIN, AVG, SUM, COUNT" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "-- show the date of birth of the youngest person\n", - "SELECT * FROM account ORDER BY dob DESC LIMIT 1" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "-- show the date of birth of the youngest person \n", - "-- This is an invalid query because it mixes aggregation and regular fields\n", - "SELECT max(dob) as dob, phone FROM account" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "SELECT * FROM account where phone=10013740006" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "-- show the youngest person \n", - "SELECT * FROM account WHERE dob = (SELECT max(dob) FROM account)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "# show average male age\n", - "dj.U().aggr(Account & 'sex=\"M\"' , avg_age=\"floor(avg(DATEDIFF(now(), dob)) / 365.25)\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "SELECT floor(avg(DATEDIFF(now(), dob)) / 365.25) as avg_age FROM account WHERE sex=\"M\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "SELECT count(*), count(phone), count(DISTINCT first_name, last_name), count(dob) FROM account;" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "-- show how many of purchases have been done for each addon\n", + "## Aggregation with Restrictions\n", "\n", - "SELECT addon_id, count(*) n FROM purchase GROUP BY addon_id " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "SELECT * FROM `#add_on` LIMIT 10" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "SELECT * FROM purchase NATURAL JOIN `#add_on` LIMIT 10" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "-- show the total money spent by each account (limit to top 10)\n", + "Apply restrictions to either the grouping table or the aggregated table.\n", "\n", - "SELECT phone, sum(price) as total_spending \n", - " FROM purchase NATURAL JOIN `#add_on` \n", - " GROUP BY (phone) \n", - " ORDER BY total_spending DESC LIMIT 10" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "-- show the names of people who spent less than $100\n", - "\n", - "SELECT phone, sum(price) as total_spending \n", - " FROM purchase NATURAL JOIN `#add_on` \n", - " WHERE total_spending < 100\n", - " GROUP BY (phone) \n", - " LIMIT 10" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "-- show the names of people who spent less than $100\n", + "### Restricting the Grouping Table\n", "\n", - "SELECT * FROM (\n", - " SELECT phone, first_name, last_name, sum(price) as total_spending \n", - " FROM account NATURAL JOIN purchase NATURAL JOIN `#add_on` \n", - " GROUP BY (phone)) as q \n", - "WHERE total_spending < 100\n", - "LIMIT 10\n", + "Filter which entities receive aggregated values:\n", "\n", - "-- almost correct but does not include people who spent nothing" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql \n", - "-- explaining LEFT joins\n", - "SELECT * FROM account NATURAL LEFT JOIN purchase NATURAL LEFT JOIN `#add_on` LIMIT 10" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [ - "%%sql\n", - "-- show the names of people who spent less than $100\n", - "SELECT * FROM (\n", - " SELECT phone, first_name, last_name, sum(ifnull(price), 0) as total_spending \n", - " FROM account NATURAL LEFT JOIN purchase NATURAL LEFT JOIN `#add_on` \n", - " GROUP BY (phone)) as q \n", - "WHERE total_spending < 100\n", - "LIMIT 10\n" + "```python\n", + "# GPA only for CS majors\n", + "cs_student_gpa = (Student & (StudentMajor & {'dept': 'CS'})).aggr(\n", + " Course * Grade * LetterGrade,\n", + " gpa='SUM(points * credits) / SUM(credits)'\n", + ")\n", + "```\n", + "\n", + "### Restricting the Aggregated Table\n", + "\n", + "Filter which data is included in the aggregation:\n", + "\n", + "```python\n", + "# Count only current term enrollments per student\n", + "current_enrollments = Student.aggr(\n", + " Enroll & CurrentTerm,\n", + " n_current='COUNT(*)'\n", + ")\n", + "\n", + "# Average grade for math courses only\n", + "math_avg = Student.aggr(\n", + " Grade & {'dept': 'MATH'},\n", + " math_avg='AVG(grade_value)'\n", + ")\n", + "```\n", + "\n", + "### Combining Both\n", + "\n", + "```python\n", + "# For seniors only, compute GPA from upper-division courses\n", + "senior_upper_gpa = (Student & {'class_standing': 'Senior'}).aggr(\n", + " Course * Grade * LetterGrade & 'course >= 3000',\n", + " upper_gpa='SUM(points * credits) / SUM(credits)'\n", + ")\n", + "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Summary of principles \n", - "1. Without a `GROUP BY`, aggregation functions collapse the table into a single row.\n", - "2. With `GROUP BY`, the grouping attributes become the new primary key of the result. \n", - "3. Do not mix aggregated and non-aggregated values in the result with or without a `GROUP BY`.\n", - "4. `HAVING` plays the same role as the `WHERE` clause in a nesting outer query so it can use the output of the aggregation functions.\n", - "5. `LEFT JOIN` is often follwed with a `GROUP BY` by the primary key attributes of the left table. In this scenario the entities in the right table are aggregated for each matching row in the left table.\n" + "## Filtering Aggregation Results\n", + "\n", + "After aggregation, you can restrict based on the computed values:\n", + "\n", + "```python\n", + "# Students with GPA above 3.5\n", + "student_gpa = Student.aggr(\n", + " Course * Grade * LetterGrade,\n", + " gpa='SUM(points * credits) / SUM(credits)'\n", + ")\n", + "honor_students = student_gpa & 'gpa >= 3.5'\n", + "```\n", + "\n", + "```python\n", + "# Sections with more than 30 students\n", + "section_sizes = Section.aggr(Enroll, n='COUNT(*)')\n", + "large_sections = section_sizes & 'n > 30'\n", + "```\n", + "\n", + "**SQL Equivalent** (using HAVING):\n", + "```sql\n", + "SELECT s.*, COUNT(*) AS n\n", + "FROM section s\n", + "LEFT JOIN enroll e USING (dept, course, section_id)\n", + "GROUP BY s.dept, s.course, s.section_id\n", + "HAVING n > 30;\n", + "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "1. Always aggregate entity B grouped by entity A\n", - "2. Then GROUP BY the primary of A\n", - "3. Aggregate the attribute B but not A\n", - "4. SELECT non-aggregated attributes of A but not B\n", - "5. Use an left join if you need to include rows of A for which there is no match in B " + "## Left Join Behavior\n", + "\n", + "Aggregation uses left join semantics: all entities from the grouping table appear in the result, even if they have no matching records in the aggregated table.\n", + "\n", + "### Example: Students Without Grades\n", + "\n", + "```python\n", + "# All students with their grade count (0 for students without grades)\n", + "grade_counts = Student.aggr(Grade, n_grades='COUNT(*)')\n", + "```\n", + "\n", + "Students without any grades will have `n_grades = 0`.\n", + "\n", + "### Example: Preserving All Sections\n", + "\n", + "```python\n", + "# All sections with enrollment count (0 for empty sections)\n", + "all_section_sizes = Section.aggr(Enroll, n_students='COUNT(*)')\n", + "```\n", + "\n", + "Empty sections appear with `n_students = 0`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "dj.Diagram(sales)" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "%3\n", - "\n", - "\n", - "\n", - "0\n", - "\n", - "0\n", - "\n", - "\n", - "\n", - "Customer\n", - "\n", - "\n", - "Customer\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "0->Customer\n", - "\n", - "\n", - "\n", - "\n", - "1\n", - "\n", - "1\n", - "\n", - "\n", - "\n", - "Report\n", - "\n", - "\n", - "Report\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "1->Report\n", - "\n", - "\n", - "\n", - "\n", - "Payment\n", - "\n", - "\n", - "Payment\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer->Payment\n", - "\n", - "\n", - "\n", - "\n", - "Order\n", - "\n", - "\n", - "Order\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer->Order\n", - "\n", - "\n", - "\n", - "\n", - "Office\n", - "\n", - "\n", - "Office\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Employee\n", - "\n", - "\n", - "Employee\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Office->Employee\n", - "\n", - "\n", - "\n", - "\n", - "Order.Item\n", - "\n", - "\n", - "Order.Item\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Order->Order.Item\n", - "\n", - "\n", - "\n", - "\n", - "ProductLine\n", - "\n", - "\n", - "ProductLine\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Product\n", - "\n", - "\n", - "Product\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "ProductLine->Product\n", - "\n", - "\n", - "\n", - "\n", - "Employee->0\n", - "\n", - "\n", - "\n", - "\n", - "Employee->1\n", - "\n", - "\n", - "\n", - "\n", - "Employee->Report\n", - "\n", - "\n", - "\n", - "\n", - "Product->Order.Item\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dj.Diagram(sales)" + "## Examples from the University Database\n", + "\n", + "### Example 1: Student Statistics\n", + "\n", + "```python\n", + "# Comprehensive student statistics\n", + "student_stats = Student.aggr(\n", + " Course * Grade * LetterGrade,\n", + " n_courses='COUNT(*)',\n", + " total_credits='SUM(credits)',\n", + " gpa='SUM(points * credits) / SUM(credits)'\n", + ")\n", + "```\n", + "\n", + "### Example 2: Department Statistics\n", + "\n", + "```python\n", + "# Count majors per department\n", + "dept_major_counts = Department.aggr(StudentMajor, n_majors='COUNT(*)')\n", + "\n", + "# Average GPA per department (from majors in that department)\n", + "dept_gpa = Department.aggr(\n", + " StudentMajor * Student.aggr(\n", + " Course * Grade * LetterGrade,\n", + " gpa='SUM(points * credits) / SUM(credits)'\n", + " ),\n", + " dept_avg_gpa='AVG(gpa)'\n", + ")\n", + "```\n", + "\n", + "### Example 3: Course Popularity\n", + "\n", + "```python\n", + "# Count total enrollments per course (across all sections and terms)\n", + "course_popularity = Course.aggr(\n", + " Section * Enroll,\n", + " total_enrollments='COUNT(*)'\n", + ")\n", + "\n", + "# Most popular courses\n", + "popular_courses = course_popularity & 'total_enrollments > 100'\n", + "```\n", + "\n", + "### Example 4: Grade Distribution\n", + "\n", + "```python\n", + "# Count each grade type per course\n", + "grade_distribution = Course.aggr(\n", + " Grade,\n", + " a_count='SUM(grade=\"A\")',\n", + " b_count='SUM(grade=\"B\")',\n", + " c_count='SUM(grade=\"C\")',\n", + " d_count='SUM(grade=\"D\")',\n", + " f_count='SUM(grade=\"F\")'\n", + ")\n", + "```" ] }, { "cell_type": "markdown", "metadata": {}, - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - "
\n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

order_number

\n", - " \n", - "
\n", - "

n

\n", - " calculated attribute\n", - "
1010816
101096
1011016
101116
101122
101134
\n", - " \n", - "

Total: 6

\n", - " " - ], - "text/plain": [ - "*order_number n \n", - "+------------+ +----+\n", - "10108 16 \n", - "10109 6 \n", - "10110 16 \n", - "10111 6 \n", - "10112 2 \n", - "10113 4 \n", - " (Total: 6)" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ - "# Show all the orders made in March 2003 and the total number of items on each\n", - "(Order & 'order_date between \"2003-03-01\" and \"2003-03-31\"').aggr(Order.Item(), n='count(*)', keep_all_rows=True)" + "## Aggregation vs. Join\n", + "\n", + "| Operation | Purpose | Result Rows | Result Attributes |\n", + "|-----------|---------|-------------|-------------------|\n", + "| `A.aggr(B, ...)` | Summarize B for each A | One row per A | A's attributes + aggregates |\n", + "| `A * B` | Combine A and B | One row per A-B pair | All attributes from A and B |\n", + "\n", + "### Example Comparison\n", + "\n", + "```python\n", + "# Join: One row per student-enrollment pair\n", + "Student * Enroll # Many rows per student (one per enrollment)\n", + "\n", + "# Aggregation: One row per student with enrollment count\n", + "Student.aggr(Enroll, n='COUNT(*)') # One row per student\n", + "```\n", + "\n", + "Use **join** when you need individual related records. \n", + "Use **aggregation** when you need summary statistics per entity." ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

reports_to

\n", - " \n", - "
\n", - "

n

\n", - " calculated attribute\n", - "
10022
10564
10883
11026
11436
16211
\n", - " \n", - "

Total: 6

\n", - " " - ], - "text/plain": [ - "*reports_to n \n", - "+------------+ +---+\n", - "1002 2 \n", - "1056 4 \n", - "1088 3 \n", - "1102 6 \n", - "1143 6 \n", - "1621 1 \n", - " (Total: 6)" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ - "# Show all employees and the number of their direct reports\n", + "## SQL Translation\n", + "\n", + "DataJoint's aggregation translates to SQL with `LEFT JOIN` and `GROUP BY`:\n", + "\n", + "### Basic Aggregation\n", "\n", - "```{tab-set}\n", - "```{tab-item} DataJoint\n", "```python\n", - "Employee.proj(reports_to='employee_number').aggr(Report, n='count(employee_number)')\n", + "# DataJoint\n", + "Student.aggr(Enroll, n='COUNT(*)')\n", "```\n", - "```{tab-item} SQL\n", + "\n", "```sql\n", - "SELECT employee.employee_number, first_name, last_name, count(report.employee_number) as n \n", - "FROM employee LEFT JOIN report ON (employee.employee_number = report.reports_to)\n", - "GROUP BY employee.employee_number\n", + "-- SQL\n", + "SELECT s.*, COUNT(e.student_id) AS n\n", + "FROM student s\n", + "LEFT JOIN enroll e ON s.student_id = e.student_id\n", + "GROUP BY s.student_id;\n", "```\n", + "\n", + "### Aggregation with Joined Tables\n", + "\n", + "```python\n", + "# DataJoint\n", + "Student.aggr(\n", + " Course * Grade * LetterGrade,\n", + " gpa='SUM(points * credits) / SUM(credits)'\n", + ")\n", + "```\n", + "\n", + "```sql\n", + "-- SQL\n", + "SELECT s.*, SUM(lg.points * c.credits) / SUM(c.credits) AS gpa\n", + "FROM student s\n", + "LEFT JOIN grade g ON s.student_id = g.student_id\n", + "LEFT JOIN course c ON g.dept = c.dept AND g.course = c.course\n", + "LEFT JOIN letter_grade lg ON g.grade = lg.grade\n", + "GROUP BY s.student_id;\n", + "```\n", + "\n", + "### Aggregation with Restriction on Result\n", + "\n", + "```python\n", + "# DataJoint\n", + "Student.aggr(Enroll, n='COUNT(*)') & 'n > 5'\n", + "```\n", + "\n", + "```sql\n", + "-- SQL\n", + "SELECT s.*, COUNT(*) AS n\n", + "FROM student s\n", + "LEFT JOIN enroll e ON s.student_id = e.student_id\n", + "GROUP BY s.student_id\n", + "HAVING COUNT(*) > 5;\n", "```" ] }, @@ -948,566 +427,126 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Show all employees and the number of their direct reports\n", + "## Best Practices\n", + "\n", + "### 1. Understand the Grouping\n", + "\n", + "The grouping table determines:\n", + "- The primary key of the result\n", + "- Which entities appear in the output\n", + "- The entity type represented by each row\n", + "\n", + "### 2. Use Meaningful Aggregate Names\n", "\n", - "```{tab-set}\n", - "```{tab-item} DataJoint\n", "```python\n", - "Employee.proj(reports_to='employee_number').aggr(Report, n='count(employee_number)')\n", + "# Good: descriptive names\n", + "Student.aggr(Enroll, n_enrollments='COUNT(*)', total_credits='SUM(credits)')\n", + "\n", + "# Avoid: generic names\n", + "Student.aggr(Enroll, n='COUNT(*)', x='SUM(credits)')\n", "```\n", - "```{tab-item} SQL\n", - "```sql\n", - "SELECT employee.employee_number, first_name, last_name, count(report.employee_number) as n \n", - "FROM employee LEFT JOIN report ON (employee.employee_number = report.reports_to)\n", - "GROUP BY employee.employee_number\n", + "\n", + "### 3. Restrict Before Aggregating When Possible\n", + "\n", + "```python\n", + "# More efficient: restrict first\n", + "(Student & {'home_state': 'CA'}).aggr(Enroll, n='COUNT(*)')\n", + "\n", + "# Less efficient: restrict after\n", + "Student.aggr(Enroll, n='COUNT(*)') & {'home_state': 'CA'}\n", + "```\n", + "\n", + "### 4. Handle NULL Values\n", + "\n", + "When the aggregated table has NULL values, use IFNULL or COALESCE:\n", + "\n", + "```python\n", + "Student.aggr(Grade, avg_grade='AVG(IFNULL(grade_value, 0))')\n", "```\n", - "```\n" + "\n", + "### 5. Test with Small Data First\n", + "\n", + "```python\n", + "# Verify the aggregation works correctly\n", + "test_result = (Student & {'student_id': 1001}).aggr(Grade, n='COUNT(*)')\n", + "print(test_result.fetch())\n", + "```" ] }, { - "cell_type": "code", - "execution_count": 51, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " * mysql://root:***@127.0.0.1\n", - "0 rows affected.\n", - "23 rows affected.\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
employee_numberfirst_namelast_namen
1002DianeMurphy2
1056MaryPatterson4
1076JeffFirrelli0
1088WilliamPatterson3
1102GerardBondur6
1143AnthonyBow6
1165LeslieJennings0
1166LeslieThompson0
1188JulieFirrelli0
1216StevePatterson0
1286Foon YueTseng0
1323GeorgeVanauf0
1337LouiBondur0
1370GerardHernandez0
1401PamelaCastillo0
1501LarryBott0
1504BarryJones0
1611AndyFixter0
1612PeterMarsh0
1619TomKing0
1621MamiNishi1
1625YoshimiKato0
1702MartinGerard0
" - ], - "text/plain": [ - "[(1002, 'Diane', 'Murphy', 2),\n", - " (1056, 'Mary', 'Patterson', 4),\n", - " (1076, 'Jeff', 'Firrelli', 0),\n", - " (1088, 'William', 'Patterson', 3),\n", - " (1102, 'Gerard', 'Bondur', 6),\n", - " (1143, 'Anthony', 'Bow', 6),\n", - " (1165, 'Leslie', 'Jennings', 0),\n", - " (1166, 'Leslie', 'Thompson', 0),\n", - " (1188, 'Julie', 'Firrelli', 0),\n", - " (1216, 'Steve', 'Patterson', 0),\n", - " (1286, 'Foon Yue', 'Tseng', 0),\n", - " (1323, 'George', 'Vanauf', 0),\n", - " (1337, 'Loui', 'Bondur', 0),\n", - " (1370, 'Gerard', 'Hernandez', 0),\n", - " (1401, 'Pamela', 'Castillo', 0),\n", - " (1501, 'Larry', 'Bott', 0),\n", - " (1504, 'Barry', 'Jones', 0),\n", - " (1611, 'Andy', 'Fixter', 0),\n", - " (1612, 'Peter', 'Marsh', 0),\n", - " (1619, 'Tom', 'King', 0),\n", - " (1621, 'Mami', 'Nishi', 1),\n", - " (1625, 'Yoshimi', 'Kato', 0),\n", - " (1702, 'Martin', 'Gerard', 0)]" - ] - }, - "execution_count": 51, - "metadata": {}, - "output_type": "execute_result" - } - ], + "cell_type": "markdown", + "metadata": {}, "source": [ - "%%sql\n", + "## Summary\n", "\n", - "use classicsales;\n", + "The aggregation operator computes summary statistics from related entities:\n", "\n", - "SELECT employee.employee_number, first_name, last_name, count(report.employee_number) as n \n", - "FROM employee LEFT JOIN report ON (employee.employee_number = report.reports_to)\n", - "GROUP BY employee.employee_number" + "1. **Syntax**: `GroupingTable.aggr(AggregatedTable, new_attr='AGG_FUNC(...)')`\n", + "2. **Result**: Has the same primary key as the grouping table\n", + "3. **Left join**: All grouping entities appear, even without matches\n", + "4. **Multiple aggregates**: Compute several statistics in one operation\n", + "5. **Chaining**: Combine with restrictions and other operators\n", + "\n", + "Use aggregation when you need summary statistics per entity rather than individual related records." ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " * mysql://root:***@127.0.0.1\n", - "39 rows affected.\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
employee_numberfirst_namelast_namesubordinate
1002DianeMurphy1056
1002DianeMurphy1076
1056MaryPatterson1088
1056MaryPatterson1102
1056MaryPatterson1143
1056MaryPatterson1621
1076JeffFirrelliNone
1088WilliamPatterson1611
1088WilliamPatterson1612
1088WilliamPatterson1619
1102GerardBondur1337
1102GerardBondur1370
1102GerardBondur1401
1102GerardBondur1501
1102GerardBondur1504
1102GerardBondur1702
1143AnthonyBow1165
1143AnthonyBow1166
1143AnthonyBow1188
1143AnthonyBow1216
1143AnthonyBow1286
1143AnthonyBow1323
1165LeslieJenningsNone
1166LeslieThompsonNone
1188JulieFirrelliNone
1216StevePattersonNone
1286Foon YueTsengNone
1323GeorgeVanaufNone
1337LouiBondurNone
1370GerardHernandezNone
1401PamelaCastilloNone
1501LarryBottNone
1504BarryJonesNone
1611AndyFixterNone
1612PeterMarshNone
1619TomKingNone
1621MamiNishi1625
1625YoshimiKatoNone
1702MartinGerardNone
" - ], - "text/plain": [ - "[(1002, 'Diane', 'Murphy', 1056),\n", - " (1002, 'Diane', 'Murphy', 1076),\n", - " (1056, 'Mary', 'Patterson', 1088),\n", - " (1056, 'Mary', 'Patterson', 1102),\n", - " (1056, 'Mary', 'Patterson', 1143),\n", - " (1056, 'Mary', 'Patterson', 1621),\n", - " (1076, 'Jeff', 'Firrelli', None),\n", - " (1088, 'William', 'Patterson', 1611),\n", - " (1088, 'William', 'Patterson', 1612),\n", - " (1088, 'William', 'Patterson', 1619),\n", - " (1102, 'Gerard', 'Bondur', 1337),\n", - " (1102, 'Gerard', 'Bondur', 1370),\n", - " (1102, 'Gerard', 'Bondur', 1401),\n", - " (1102, 'Gerard', 'Bondur', 1501),\n", - " (1102, 'Gerard', 'Bondur', 1504),\n", - " (1102, 'Gerard', 'Bondur', 1702),\n", - " (1143, 'Anthony', 'Bow', 1165),\n", - " (1143, 'Anthony', 'Bow', 1166),\n", - " (1143, 'Anthony', 'Bow', 1188),\n", - " (1143, 'Anthony', 'Bow', 1216),\n", - " (1143, 'Anthony', 'Bow', 1286),\n", - " (1143, 'Anthony', 'Bow', 1323),\n", - " (1165, 'Leslie', 'Jennings', None),\n", - " (1166, 'Leslie', 'Thompson', None),\n", - " (1188, 'Julie', 'Firrelli', None),\n", - " (1216, 'Steve', 'Patterson', None),\n", - " (1286, 'Foon Yue', 'Tseng', None),\n", - " (1323, 'George', 'Vanauf', None),\n", - " (1337, 'Loui', 'Bondur', None),\n", - " (1370, 'Gerard', 'Hernandez', None),\n", - " (1401, 'Pamela', 'Castillo', None),\n", - " (1501, 'Larry', 'Bott', None),\n", - " (1504, 'Barry', 'Jones', None),\n", - " (1611, 'Andy', 'Fixter', None),\n", - " (1612, 'Peter', 'Marsh', None),\n", - " (1619, 'Tom', 'King', None),\n", - " (1621, 'Mami', 'Nishi', 1625),\n", - " (1625, 'Yoshimi', 'Kato', None),\n", - " (1702, 'Martin', 'Gerard', None)]" - ] - }, - "execution_count": 50, - "metadata": {}, - "output_type": "execute_result" - } - ], + "cell_type": "markdown", + "metadata": {}, "source": [ - "%%sql\n", + "## Practice Exercises\n", + "\n", + "### Exercise 1: Count Enrollments\n", + "\n", + "**Task**: Count how many students are enrolled in each section.\n", + "\n", + "```python\n", + "section_counts = Section.aggr(Enroll, n_students='COUNT(*)')\n", + "```\n", + "\n", + "### Exercise 2: Compute GPA\n", + "\n", + "**Task**: Compute weighted GPA for each student.\n", + "\n", + "```python\n", + "student_gpa = Student.aggr(\n", + " Course * Grade * LetterGrade,\n", + " gpa='SUM(points * credits) / SUM(credits)'\n", + ")\n", + "```\n", + "\n", + "### Exercise 3: Department Statistics\n", "\n", - "SELECT employee.employee_number, first_name, last_name, report.employee_number as subordinate \n", - "FROM employee LEFT JOIN report ON (employee.employee_number = report.reports_to)" + "**Task**: Count the number of courses offered by each department.\n", + "\n", + "```python\n", + "dept_course_counts = Department.aggr(Course, n_courses='COUNT(*)')\n", + "```\n", + "\n", + "### Exercise 4: High Enrollment Courses\n", + "\n", + "**Task**: Find courses with more than 50 total enrollments across all sections.\n", + "\n", + "```python\n", + "course_enrollments = Course.aggr(Section * Enroll, total='COUNT(*)')\n", + "high_enrollment = course_enrollments & 'total > 50'\n", + "```\n", + "\n", + "### Exercise 5: Student Course Load\n", + "\n", + "**Task**: For each student enrolled in the current term, compute the total credits.\n", + "\n", + "```python\n", + "current_load = Student.aggr(\n", + " (Enroll & CurrentTerm) * Course,\n", + " current_credits='SUM(credits)'\n", + ")\n", + "```\n", + "\n", + ":::{seealso}\n", + "For more aggregation examples, see the [University Queries](../80-examples/016-university-queries.ipynb) example.\n", + ":::" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "sql" - } - }, - "outputs": [], - "source": [] } ], "metadata": { @@ -1517,18 +556,10 @@ "name": "python3" }, "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.17" + "version": "3.11.0" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/book/50-queries/060-universal.ipynb b/book/50-queries/060-universal.ipynb index 4ec2061..1913052 100644 --- a/book/50-queries/060-universal.ipynb +++ b/book/50-queries/060-universal.ipynb @@ -6,20 +6,427 @@ "source": [ "# Universal Sets\n", "\n", - "Many queries require a special operand -- a *Universal Set* -- which is constructed using the `dj.U` class in DataJoint." + "**Universal sets** are symbolic constructs in DataJoint that represent the set of all possible values for specified attributes. They enable queries that extract unique values or perform aggregations without a natural grouping entity.\n", + "\n", + "## Understanding Universal Sets\n", + "\n", + "The `dj.U()` class creates a universal set—a conceptual table that can be restricted or used in aggregations. Universal sets are not directly fetchable; they serve as operands in query expressions.\n", + "\n", + "### Two Forms\n", + "\n", + "| Form | Meaning | Primary Key |\n", + "|------|---------|-------------|\n", + "| `dj.U('attr1', 'attr2', ...)` | All possible combinations of the specified attributes | The specified attributes |\n", + "| `dj.U()` | A singular universal entity (one conceptual row) | Empty set |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extracting Unique Values\n", + "\n", + "The most common use of universal sets is extracting distinct values from a table.\n", + "\n", + "### Basic Syntax\n", + "\n", + "```python\n", + "# Get unique values of an attribute\n", + "unique_values = dj.U('attribute_name') & SomeTable\n", + "```\n", + "\n", + "When restricted by an existing table, `dj.U()` returns the distinct values of those attributes present in the table.\n", + "\n", + "### Example: Unique First Names\n", + "\n", + "```python\n", + "# All unique first names among students\n", + "unique_first_names = dj.U('first_name') & Student\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT DISTINCT first_name FROM student;\n", + "```\n", + "\n", + "### Example: Unique Name Combinations\n", + "\n", + "```python\n", + "# All unique first_name + last_name combinations\n", + "unique_full_names = dj.U('first_name', 'last_name') & Student\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT DISTINCT first_name, last_name FROM student;\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Restricting Unique Values\n", + "\n", + "Universal sets can be combined with restrictions to find unique values within filtered subsets.\n", + "\n", + "### Example: Unique Names of Male Students\n", + "\n", + "```python\n", + "# Unique first names among male students only\n", + "male_names = dj.U('first_name') & (Student & {'sex': 'M'})\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT DISTINCT first_name FROM student WHERE sex = 'M';\n", + "```\n", + "\n", + "### Example: Birth Years of Current Students\n", + "\n", + "```python\n", + "# Unique birth years among students enrolled in current term\n", + "birth_years = dj.U('year') & (\n", + " Student.proj(year='YEAR(date_of_birth)') & (Enroll & CurrentTerm)\n", + ")\n", + "```\n", + "\n", + "This extracts the unique birth years from a projection that computes the year from the date of birth." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Universal Aggregation\n", + "\n", + "The empty universal set `dj.U()` represents a single entity that spans all rows. It's used for aggregations that summarize an entire table rather than grouping by a specific entity.\n", + "\n", + "### Total Count\n", + "\n", + "```python\n", + "# Count total number of students\n", + "total_count = dj.U().aggr(Student, n_students='COUNT(*)')\n", + "```\n", + "\n", + "**Result**: A table with one row and one attribute `n_students` containing the count.\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT COUNT(*) AS n_students FROM student;\n", + "```\n", + "\n", + "### Multiple Aggregate Statistics\n", + "\n", + "```python\n", + "# Compute multiple statistics across all students\n", + "student_stats = dj.U().aggr(\n", + " Student.proj(age='TIMESTAMPDIFF(YEAR, date_of_birth, CURDATE())'),\n", + " n_students='COUNT(*)',\n", + " avg_age='AVG(age)',\n", + " min_age='MIN(age)',\n", + " max_age='MAX(age)'\n", + ")\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT \n", + " COUNT(*) AS n_students,\n", + " AVG(TIMESTAMPDIFF(YEAR, date_of_birth, CURDATE())) AS avg_age,\n", + " MIN(TIMESTAMPDIFF(YEAR, date_of_birth, CURDATE())) AS min_age,\n", + " MAX(TIMESTAMPDIFF(YEAR, date_of_birth, CURDATE())) AS max_age\n", + "FROM student;\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Aggregation by Arbitrary Groupings\n", + "\n", + "When you need to aggregate data by attributes that don't form a natural entity type in your schema, use `dj.U()` to create an arbitrary grouping.\n", + "\n", + "### Example: Students per Birth Year and Month\n", + "\n", + "```python\n", + "# Count students born in each year and month\n", + "student_counts = dj.U('birth_year', 'birth_month').aggr(\n", + " Student.proj(\n", + " birth_year='YEAR(date_of_birth)', \n", + " birth_month='MONTH(date_of_birth)'\n", + " ),\n", + " n_students='COUNT(*)'\n", + ")\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT \n", + " YEAR(date_of_birth) AS birth_year,\n", + " MONTH(date_of_birth) AS birth_month,\n", + " COUNT(*) AS n_students\n", + "FROM student\n", + "GROUP BY YEAR(date_of_birth), MONTH(date_of_birth);\n", + "```\n", + "\n", + "### Example: Enrollments per Department per Term\n", + "\n", + "```python\n", + "# Count enrollments by department and term\n", + "enrollment_counts = dj.U('dept', 'term').aggr(\n", + " Enroll * Section,\n", + " n_enrollments='COUNT(*)'\n", + ")\n", + "```\n", + "\n", + "This creates a grouping by department and term without requiring a DepartmentTerm entity in your schema." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Examples from the University Database\n", + "\n", + "### Example 1: Unique First Names by Gender\n", + "\n", + "```python\n", + "# All unique first names among male students\n", + "male_names = dj.U('first_name') & (Student & {'sex': 'M'})\n", + "\n", + "# All unique first names among female students\n", + "female_names = dj.U('first_name') & (Student & {'sex': 'F'})\n", + "```\n", + "\n", + "### Example 2: Birth Years of Enrolled Students\n", + "\n", + "```python\n", + "# Show all birth years for students enrolled in current term\n", + "birth_years = dj.U('year') & (\n", + " Student.proj(year='YEAR(date_of_birth)') & (Enroll & CurrentTerm)\n", + ")\n", + "```\n", + "\n", + "### Example 3: Department Statistics\n", + "\n", + "```python\n", + "# Count students in each department\n", + "dept_counts = Department.aggr(StudentMajor, count='COUNT(student_id)')\n", + "\n", + "# Count male and female students per department\n", + "gender_counts = Department.aggr(\n", + " StudentMajor * Student, \n", + " males='SUM(sex=\"M\")', \n", + " females='SUM(sex=\"F\")'\n", + ")\n", + "```\n", + "\n", + "### Example 4: GPA and Credits Summary\n", + "\n", + "```python\n", + "# Overall average GPA across all graded students\n", + "overall_gpa = dj.U().aggr(\n", + " Course * Grade * LetterGrade,\n", + " avg_gpa='SUM(points * credits) / SUM(credits)'\n", + ")\n", + "```" ] }, { "cell_type": "markdown", "metadata": {}, - "source": [] + "source": [ + "## Comparison: Universal Sets vs. Standard Aggregation\n", + "\n", + "| Approach | Use Case | Result Primary Key |\n", + "|----------|----------|-------------------|\n", + "| `Entity.aggr(Related, ...)` | Aggregate by existing entity type | Entity's primary key |\n", + "| `dj.U('attrs').aggr(Table, ...)` | Aggregate by arbitrary grouping | Specified attributes |\n", + "| `dj.U().aggr(Table, ...)` | Universal aggregate (whole table) | Empty (single row) |\n", + "\n", + "### When to Use Each\n", + "\n", + "**Standard aggregation** (`Entity.aggr(...)`):\n", + "- When grouping by an existing entity (e.g., count enrollments per student)\n", + "- Result represents augmented entities\n", + "\n", + "**Arbitrary grouping** (`dj.U('attrs').aggr(...)`):\n", + "- When grouping by computed or non-entity attributes (e.g., count by birth year)\n", + "- Result represents a new grouping not in your schema\n", + "\n", + "**Universal aggregate** (`dj.U().aggr(...)`):\n", + "- When computing totals across an entire table\n", + "- Result is a single row of summary statistics" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## SQL Translation\n", + "\n", + "### Unique Values\n", + "\n", + "```python\n", + "# DataJoint\n", + "dj.U('first_name') & Student\n", + "```\n", + "\n", + "```sql\n", + "-- SQL\n", + "SELECT DISTINCT first_name FROM student;\n", + "```\n", + "\n", + "### Universal Aggregation\n", + "\n", + "```python\n", + "# DataJoint\n", + "dj.U().aggr(Student, count='COUNT(*)')\n", + "```\n", + "\n", + "```sql\n", + "-- SQL\n", + "SELECT COUNT(*) AS count FROM student;\n", + "```\n", + "\n", + "### Arbitrary Grouping\n", + "\n", + "```python\n", + "# DataJoint\n", + "dj.U('home_state').aggr(Student, n='COUNT(*)')\n", + "```\n", + "\n", + "```sql\n", + "-- SQL\n", + "SELECT home_state, COUNT(*) AS n\n", + "FROM student\n", + "GROUP BY home_state;\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Best Practices\n", + "\n", + "### 1. Choose the Right Approach\n", + "\n", + "- For unique values: `dj.U('attr') & Table`\n", + "- For total counts/sums: `dj.U().aggr(Table, ...)`\n", + "- For grouping by non-entity attributes: `dj.U('attr').aggr(Table, ...)`\n", + "\n", + "### 2. Use Projection to Create Grouping Attributes\n", + "\n", + "When grouping by computed values, project them first:\n", + "\n", + "```python\n", + "# Project to create the grouping attribute\n", + "with_year = Student.proj(birth_year='YEAR(date_of_birth)')\n", + "\n", + "# Then aggregate by that attribute\n", + "counts_by_year = dj.U('birth_year').aggr(with_year, n='COUNT(*)')\n", + "```\n", + "\n", + "### 3. Understand the Result Structure\n", + "\n", + "- `dj.U('attr') & Table` — primary key is `attr`\n", + "- `dj.U().aggr(...)` — primary key is empty (single-row result)\n", + "\n", + "### 4. Combine with Restrictions for Filtered Results\n", + "\n", + "```python\n", + "# Unique departments among currently enrolled students\n", + "active_depts = dj.U('dept') & (Enroll & CurrentTerm)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "Universal sets provide three key capabilities:\n", + "\n", + "1. **Extract unique values**: `dj.U('attr') & Table` returns distinct values\n", + "2. **Universal aggregation**: `dj.U().aggr(Table, ...)` summarizes entire tables\n", + "3. **Arbitrary grouping**: `dj.U('attrs').aggr(Table, ...)` groups by non-entity attributes\n", + "\n", + "Use universal sets when:\n", + "- You need distinct values from a table\n", + "- You want totals across an entire table\n", + "- You need to group by attributes that don't form a natural entity in your schema" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Practice Exercises\n", + "\n", + "### Exercise 1: Unique Values\n", + "\n", + "**Task**: Find all unique home states represented by students.\n", + "\n", + "```python\n", + "unique_states = dj.U('home_state') & Student\n", + "```\n", + "\n", + "### Exercise 2: Universal Aggregation\n", + "\n", + "**Task**: Count the total number of course enrollments.\n", + "\n", + "```python\n", + "total_enrollments = dj.U().aggr(Enroll, n='COUNT(*)')\n", + "```\n", + "\n", + "### Exercise 3: Filtered Unique Values\n", + "\n", + "**Task**: Find unique departments that have students enrolled in the current term.\n", + "\n", + "```python\n", + "active_depts = dj.U('dept') & (Enroll & CurrentTerm)\n", + "```\n", + "\n", + "### Exercise 4: Arbitrary Grouping\n", + "\n", + "**Task**: Count students by home state.\n", + "\n", + "```python\n", + "students_by_state = dj.U('home_state').aggr(Student, n_students='COUNT(*)')\n", + "```\n", + "\n", + "### Exercise 5: Computed Grouping\n", + "\n", + "**Task**: Count students by birth year.\n", + "\n", + "```python\n", + "students_by_year = dj.U('birth_year').aggr(\n", + " Student.proj(birth_year='YEAR(date_of_birth)'),\n", + " n_students='COUNT(*)'\n", + ")\n", + "```\n", + "\n", + ":::{seealso}\n", + "For more examples using universal sets, see the [University Queries](../80-examples/016-university-queries.ipynb) example.\n", + ":::" + ] } ], "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { - "name": "python" + "name": "python", + "version": "3.11.0" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/book/50-queries/080-subqueries.ipynb b/book/50-queries/080-subqueries.ipynb index 0a4a15e..a88d084 100644 --- a/book/50-queries/080-subqueries.ipynb +++ b/book/50-queries/080-subqueries.ipynb @@ -1,290 +1,582 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Subqueries and Complex Query Patterns\n", - "\n", - "Subqueries are one of the most powerful features in database querying, allowing you to build complex queries by nesting one query inside another. This chapter explores various subquery patterns and their applications.\n", - "\n", - "## Understanding Subqueries\n", - "\n", - "A **subquery** is a query nested inside another query. The outer query uses the results of the inner query to filter, join, or otherwise process data. Subqueries are essential for answering complex questions that require information from multiple tables.\n", - "\n", - "### Types of Subqueries\n", - "\n", - "1. **Scalar subqueries**: Return a single value\n", - "2. **Row subqueries**: Return a single row with multiple columns\n", - "3. **Table subqueries**: Return multiple rows (used with IN, EXISTS, etc.)\n", - "\n", - "## Basic Subquery Patterns\n", - "\n", - "### Pattern 1: Filtering with IN\n", - "\n", - "The most common subquery pattern uses `IN` to filter one table based on values from another:\n", - "\n", - "```python\n", - "# Find all people who speak English\n", - "english_speakers = Person & (Fluency & {'lang_code': 'en'})\n", - "```\n", - "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "SELECT *\n", - "FROM person\n", - "WHERE person_id IN (\n", - " SELECT person_id\n", - " FROM fluency\n", - " WHERE lang_code = 'en'\n", - ");\n", - "```\n", - "\n", - "### Pattern 2: Filtering with NOT IN\n", - "\n", - "Use `NOT IN` to exclude records that match the subquery:\n", - "\n", - "```python\n", - "# Find people who don't speak English\n", - "non_english_speakers = Person - (Fluency & {'lang_code': 'en'})\n", - "```\n", - "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "SELECT p\n", - "FROM person\n", - "WHERE person_id NOT IN (\n", - " SELECT person_id\n", - " FROM fluency\n", - " WHERE lang_code = 'ENG'\n", - ");\n", - "```\n", - "\n", - "## Complex Subquery Patterns\n", - "\n", - "### Pattern 3: Multiple Conditions with AND\n", - "\n", - "When you need to satisfy multiple conditions, use multiple subqueries with AND:\n", - "\n", - "```python\n", - "# Find people who speak both English AND Spanish\n", - "english_speakers = Person & (Fluency & {'lang_code': 'ENG'})\n", - "spanish_speakers = Person & (Fluency & {'lang_code': 'SPA'})\n", - "bilingual = english_speakers & spanish_speakers\n", - "```\n", - "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "SELECT DISTINCT p.*\n", - "FROM person\n", - "WHERE person_id IN (\n", - " SELECT f.person_id\n", - " FROM fluency f\n", - " WHERE f.lang_code = 'ENG'\n", - ")\n", - "AND person_id IN (\n", - " SELECT person_id\n", - " FROM fluency\n", - " WHERE lang_code = 'SPA'\n", - ");\n", - "```\n", - "\n", - "### Pattern 4: Multiple Conditions with OR\n", - "\n", - "Use OR to find records that satisfy any of multiple conditions:\n", - "\n", - "```python\n", - "# Find people who speak English OR Spanish\n", - "english_or_spanish = Person & ((Fluency & {'lang_code': 'ENG'}) | (Fluency & {'lang_code': 'SPA'}))\n", - "```\n", - "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "SELECT *\n", - "FROM person\n", - "WHERE person_id IN (\n", - " SELECT person_id\n", - " FROM fluency\n", - " WHERE lang_code IN ('ENG', 'SPA')\n", - ");\n", - "```\n", - "\n", - "### Pattern 5: Negated Conditions\n", - "\n", - "Find records that don't match specific criteria:\n", - "\n", - "```python\n", - "# Find people who speak Japanese but not fluently\n", - "japanese_speakers = Person & (Fluency & {'lang_code': 'JPN'})\n", - "fluent_japanese = Person & (Fluency & {'lang_code': 'JPN', 'fluency_level': 'fluent'})\n", - "japanese_non_fluent = japanese_speakers - fluent_japanese\n", - "```\n", - "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "SELECT DISTINCT *\n", - "FROM person\n", - "WHERE person_id IN (\n", - " SELECT person_id\n", - " FROM fluency\n", - " WHERE lang_code = 'JPN'\n", - ")\n", - "AND p.person_id NOT IN (\n", - " SELECT person_id\n", - " FROM fluency f\n", - " WHERE lang_code = 'JPN' AND fluency_level = 'fluent'\n", - ");\n", - "```\n", - "\n", - "## Self-Referencing Tables\n", - "\n", - "Self-referencing tables create relationships within the same table, such as hierarchical structures.\n", - "\n", - "### Example: Management Hierarchy\n", - "\n", - "```python\n", - "@schema\n", - "class Person(dj.Manual):\n", - " definition = \"\"\"\n", - " person_id : int\n", - " ---\n", - " name : varchar(60)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class ReportsTo(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Person\n", - " manager_id : int # Renamed foreign key to Person\n", - " ---\n", - " \"\"\"\n", - "```\n", - "\n", - "### Querying Hierarchical Data\n", - "\n", - "```python\n", - "# Find all managers (people who have others reporting to them)\n", - "managers = Person & ReportsTo.proj(manager_id='person_id')\n", - "\n", - "# Find all people who have managers\n", - "people_with_managers = Person & ReportsTo\n", - "\n", - "# Find top-level managers (people who don't report to anyone)\n", - "top_managers = Person - ReportsTo.proj(manager_id='person_id')\n", - "```\n", - "\n", - "**SQL Equivalents:**\n", - "```sql\n", - "-- Find all managers\n", - "SELECT *\n", - "FROM person p\n", - "WHERE person_id IN (\n", - " SELECT manager_id\n", - " FROM reports_to\n", - ");\n", - "\n", - "-- Find people with managers\n", - "SELECT *\n", - "FROM person p\n", - "WHERE person_id IN (\n", - " SELECT person_id\n", - " FROM reports_to\n", - ");\n", - "\n", - "-- Find top-level managers\n", - "SELECT *\n", - "FROM person\n", - "WHERE person_id NOT IN (\n", - " SELECT manager_id\n", - " FROM reports_to\n", - ");\n", - "```\n", - "\n", - "## Advanced Subquery Patterns\n", - "\n", - "### Pattern 6: Correlated Subqueries\n", - "\n", - "Correlated subqueries reference columns from the outer query:\n", - "\n", - "```python\n", - "# Find people who speak more languages than the average\n", - "# This requires aggregation and comparison\n", - "```\n", - "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "SELECT p.*\n", - "FROM person p\n", - "WHERE (\n", - " SELECT COUNT(*)\n", - " FROM fluency f\n", - " WHERE f.person_id = p.person_id\n", - ") > (\n", - " SELECT AVG(lang_count)\n", - " FROM (\n", - " SELECT COUNT(*) as lang_count\n", - " FROM fluency\n", - " GROUP BY person_id\n", - " ) counts\n", - ");\n", - "```\n", - "\n", - "### Pattern 7: EXISTS vs IN\n", - "\n", - "Use EXISTS for better performance when checking for existence:\n", - "\n", - "```python\n", - "# Find people who speak at least one language fluently\n", - "fluent_speakers = Person & (Fluency & {'fluency_level': 'fluent'})\n", - "```\n", - "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "SELECT p.*\n", - "FROM person p\n", - "WHERE EXISTS (\n", - " SELECT 1\n", - " FROM fluency f\n", - " WHERE f.person_id = p.person_id\n", - " AND f.fluency_level = 'fluent'\n", - ");\n", - "```\n", - "\n", - "## Best Practices for Subqueries\n", - "\n", - "1. **Use meaningful aliases**: Make your queries readable with clear table aliases\n", - "2. **Test subqueries independently**: Verify each subquery works before combining them\n", - "3. **Consider performance**: EXISTS is often more efficient than IN for large datasets\n", - "4. **Use parentheses**: Group complex conditions clearly\n", - "5. **Document complex logic**: Add comments explaining the business logic\n", - "\n", - "## Common Pitfalls\n", - "\n", - "1. **NULL handling**: NOT IN with NULLs can produce unexpected results\n", - "2. **Performance**: Nested subqueries can be slow on large datasets\n", - "3. **Readability**: Deeply nested subqueries can be hard to understand\n", - "4. **Maintenance**: Complex subqueries can be difficult to modify\n", - "\n", - "## Summary\n", - "\n", - "Subqueries are essential for complex data analysis. The key patterns covered include:\n", - "\n", - "- **Filtering with IN/NOT IN**: Basic subquery filtering\n", - "- **Multiple conditions**: Combining AND/OR logic\n", - "- **Negated conditions**: Finding records that don't match criteria\n", - "- **Self-referencing tables**: Hierarchical data structures\n", - "- **Correlated subqueries**: Advanced comparisons\n", - "- **EXISTS vs IN**: Performance considerations\n", - "\n", - "Mastering these patterns will enable you to answer complex questions about your data and build sophisticated database applications.\n" - ] - } - ], - "metadata": { - "language_info": { - "name": "python" - } + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Subqueries and Query Patterns\n", + "\n", + "**Subqueries** are queries nested inside other queries. In DataJoint, subqueries emerge naturally when you use query expressions as restriction conditions. This chapter explores common patterns for answering complex questions using composed queries.\n", + "\n", + "## Understanding Subqueries in DataJoint\n", + "\n", + "In DataJoint, you create subqueries by using one query expression to restrict another. The restriction operator (`&` or `-`) accepts query expressions as conditions, effectively creating a semijoin or antijoin.\n", + "\n", + "### Basic Concept\n", + "\n", + "```python\n", + "# Outer query restricted by inner query (subquery)\n", + "result = OuterTable & InnerQuery\n", + "```\n", + "\n", + "The `InnerQuery` acts as a subquery—its primary key values determine which rows from `OuterTable` are included in the result." + ] }, - "nbformat": 4, - "nbformat_minor": 2 + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pattern 1: Existence Check (IN)\n", + "\n", + "Find entities that have related records in another table.\n", + "\n", + "### Pattern\n", + "\n", + "```python\n", + "# Find A where matching B exists\n", + "result = A & B\n", + "```\n", + "\n", + "### Example: Students with Enrollments\n", + "\n", + "```python\n", + "# Find all students who are enrolled in at least one course\n", + "enrolled_students = Student & Enroll\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT * FROM student\n", + "WHERE student_id IN (SELECT student_id FROM enroll);\n", + "```\n", + "\n", + "### Example: Students with Math Majors\n", + "\n", + "```python\n", + "# Find students majoring in math\n", + "math_students = Student & (StudentMajor & {'dept': 'MATH'})\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT * FROM student\n", + "WHERE student_id IN (\n", + " SELECT student_id FROM student_major WHERE dept = 'MATH'\n", + ");\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pattern 2: Non-Existence Check (NOT IN)\n", + "\n", + "Find entities that do NOT have related records in another table.\n", + "\n", + "### Pattern\n", + "\n", + "```python\n", + "# Find A where no matching B exists\n", + "result = A - B\n", + "```\n", + "\n", + "### Example: Students Without Enrollments\n", + "\n", + "```python\n", + "# Find students who are not enrolled in any course\n", + "unenrolled_students = Student - Enroll\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT * FROM student\n", + "WHERE student_id NOT IN (SELECT student_id FROM enroll);\n", + "```\n", + "\n", + "### Example: Students Without Math Courses\n", + "\n", + "```python\n", + "# Find students who have never taken a math course\n", + "no_math_students = Student - (Enroll & {'dept': 'MATH'})\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT * FROM student\n", + "WHERE student_id NOT IN (\n", + " SELECT student_id FROM enroll WHERE dept = 'MATH'\n", + ");\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pattern 3: Multiple Conditions (AND)\n", + "\n", + "Find entities that satisfy multiple conditions simultaneously.\n", + "\n", + "### Pattern\n", + "\n", + "```python\n", + "# Find A where both B1 and B2 conditions are met\n", + "result = (A & B1) & B2\n", + "# Or equivalently\n", + "result = A & B1 & B2\n", + "```\n", + "\n", + "### Example: Students Speaking Both Languages\n", + "\n", + "```python\n", + "# Find people who speak BOTH English AND Spanish\n", + "english_speakers = Person & (Fluency & {'lang_code': 'en'})\n", + "spanish_speakers = Person & (Fluency & {'lang_code': 'es'})\n", + "bilingual = english_speakers & spanish_speakers\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT * FROM person\n", + "WHERE person_id IN (\n", + " SELECT person_id FROM fluency WHERE lang_code = 'en'\n", + ")\n", + "AND person_id IN (\n", + " SELECT person_id FROM fluency WHERE lang_code = 'es'\n", + ");\n", + "```\n", + "\n", + "### Example: Students with Major AND Current Enrollment\n", + "\n", + "```python\n", + "# Find students who have declared a major AND are enrolled this term\n", + "active_declared = (Student & StudentMajor) & (Enroll & CurrentTerm)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pattern 4: Either/Or Conditions (OR)\n", + "\n", + "Find entities that satisfy at least one of multiple conditions.\n", + "\n", + "### Pattern Using List Restriction\n", + "\n", + "For simple OR on the same attribute:\n", + "\n", + "```python\n", + "# Find A where condition1 OR condition2\n", + "result = A & [condition1, condition2]\n", + "```\n", + "\n", + "### Pattern Using Union\n", + "\n", + "For OR across different relationships:\n", + "\n", + "```python\n", + "# Find A where B1 OR B2 condition is met\n", + "result = (A & B1) + (A & B2)\n", + "```\n", + "\n", + "### Example: Students in Multiple States\n", + "\n", + "```python\n", + "# Find students from California OR New York (simple OR)\n", + "coastal_students = Student & [{'home_state': 'CA'}, {'home_state': 'NY'}]\n", + "\n", + "# Or using SQL syntax\n", + "coastal_students = Student & 'home_state IN (\"CA\", \"NY\")'\n", + "```\n", + "\n", + "### Example: Students Speaking Either Language\n", + "\n", + "```python\n", + "# Find people who speak English OR Spanish (cross-relationship OR)\n", + "english_speakers = Person & (Fluency & {'lang_code': 'en'})\n", + "spanish_speakers = Person & (Fluency & {'lang_code': 'es'})\n", + "either_language = english_speakers + spanish_speakers\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pattern 5: Exclusion with Condition\n", + "\n", + "Find entities that have some relationship but NOT a specific variant of it.\n", + "\n", + "### Pattern\n", + "\n", + "```python\n", + "# Find A where B exists but B with specific condition does not\n", + "result = (A & B) - (B & specific_condition)\n", + "```\n", + "\n", + "### Example: Non-Fluent Speakers\n", + "\n", + "```python\n", + "# Find people who speak Japanese but are NOT fluent\n", + "japanese_speakers = Person & (Fluency & {'lang_code': 'ja'})\n", + "fluent_japanese = Person & (Fluency & {'lang_code': 'ja', 'fluency_level': 'fluent'})\n", + "non_fluent_japanese = japanese_speakers - fluent_japanese\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT * FROM person\n", + "WHERE person_id IN (\n", + " SELECT person_id FROM fluency WHERE lang_code = 'ja'\n", + ")\n", + "AND person_id NOT IN (\n", + " SELECT person_id FROM fluency \n", + " WHERE lang_code = 'ja' AND fluency_level = 'fluent'\n", + ");\n", + "```\n", + "\n", + "### Example: Students with Incomplete Grades\n", + "\n", + "```python\n", + "# Find students enrolled in current term without grades yet\n", + "currently_enrolled = Student & (Enroll & CurrentTerm)\n", + "graded_this_term = Student & (Grade & CurrentTerm)\n", + "awaiting_grades = currently_enrolled - graded_this_term\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pattern 6: All-or-Nothing (Universal Quantification)\n", + "\n", + "Find entities where ALL related records meet a condition, or where NO related records fail a condition.\n", + "\n", + "### Pattern: All Match\n", + "\n", + "```python\n", + "# Find A where ALL related B satisfy condition\n", + "# Equivalent to: A with B, minus A with B that doesn't satisfy condition\n", + "result = (A & B) - (B - condition)\n", + "```\n", + "\n", + "### Example: All-A Students\n", + "\n", + "```python\n", + "# Find students who have received ONLY 'A' grades (no non-A grades)\n", + "students_with_grades = Student & Grade\n", + "students_with_non_a = Student & (Grade - {'grade': 'A'})\n", + "all_a_students = students_with_grades - students_with_non_a\n", + "```\n", + "\n", + "**SQL Equivalent**:\n", + "```sql\n", + "SELECT * FROM student\n", + "WHERE student_id IN (SELECT student_id FROM grade)\n", + "AND student_id NOT IN (\n", + " SELECT student_id FROM grade WHERE grade <> 'A'\n", + ");\n", + "```\n", + "\n", + "### Example: Languages with Only Fluent Speakers\n", + "\n", + "```python\n", + "# Find languages where all speakers are fluent (no non-fluent speakers)\n", + "languages_with_speakers = Language & Fluency\n", + "languages_with_non_fluent = Language & (Fluency - {'fluency_level': 'fluent'})\n", + "all_fluent_languages = languages_with_speakers - languages_with_non_fluent\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pattern 7: Reverse Perspective\n", + "\n", + "Sometimes you need to flip the perspective—instead of asking about entities, ask about their related entities.\n", + "\n", + "### Example: Languages Without Speakers\n", + "\n", + "```python\n", + "# Find languages that no one speaks\n", + "languages_spoken = Language & Fluency\n", + "unspoken_languages = Language - languages_spoken\n", + "```\n", + "\n", + "### Example: Courses Without Enrollments\n", + "\n", + "```python\n", + "# Find courses with no students enrolled this term\n", + "courses_with_enrollment = Course & (Enroll & CurrentTerm)\n", + "empty_courses = Course - courses_with_enrollment\n", + "```\n", + "\n", + "### Example: Departments Without Majors\n", + "\n", + "```python\n", + "# Find departments that have no declared majors\n", + "departments_with_majors = Department & StudentMajor\n", + "departments_without_majors = Department - departments_with_majors\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Examples from the University Database\n", + "\n", + "### Example 1: Students with Ungraded Enrollments\n", + "\n", + "Find students enrolled in the current term who haven't received grades yet:\n", + "\n", + "```python\n", + "# Students enrolled this term\n", + "enrolled_current = Student & (Enroll & CurrentTerm)\n", + "\n", + "# Students with grades this term\n", + "graded_current = Student & (Grade & CurrentTerm)\n", + "\n", + "# Students awaiting grades\n", + "awaiting_grades = enrolled_current - graded_current\n", + "```\n", + "\n", + "### Example 2: Students in Specific Courses\n", + "\n", + "```python\n", + "# Students enrolled in Introduction to CS (CS 1410)\n", + "cs_intro_students = Student & (Enroll & {'dept': 'CS', 'course': 1410})\n", + "\n", + "# Students who have taken both CS 1410 and CS 2420\n", + "cs_1410 = Student & (Enroll & {'dept': 'CS', 'course': 1410})\n", + "cs_2420 = Student & (Enroll & {'dept': 'CS', 'course': 2420})\n", + "both_courses = cs_1410 & cs_2420\n", + "```\n", + "\n", + "### Example 3: High-Performing Students\n", + "\n", + "```python\n", + "# Students with only A or B grades (no C or below)\n", + "students_with_grades = Student & Grade\n", + "students_with_low_grades = Student & (Grade & 'grade NOT IN (\"A\", \"B\")')\n", + "honor_roll = students_with_grades - students_with_low_grades\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Self-Referencing Patterns\n", + "\n", + "Some tables reference themselves through foreign keys, creating hierarchies like management structures or prerequisite chains.\n", + "\n", + "### Management Hierarchy Example\n", + "\n", + "Consider a schema where employees can report to other employees:\n", + "\n", + "```python\n", + "@schema\n", + "class Employee(dj.Manual):\n", + " definition = \"\"\"\n", + " employee_id : int\n", + " ---\n", + " name : varchar(60)\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class ReportsTo(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Employee\n", + " ---\n", + " -> Employee.proj(manager_id='employee_id')\n", + " \"\"\"\n", + "```\n", + "\n", + "### Finding Managers\n", + "\n", + "```python\n", + "# Employees who have direct reports (are managers)\n", + "managers = Employee & ReportsTo.proj(employee_id='manager_id')\n", + "```\n", + "\n", + "### Finding Top-Level Managers\n", + "\n", + "```python\n", + "# Employees who don't report to anyone\n", + "top_managers = Employee - ReportsTo\n", + "```\n", + "\n", + "### Finding Non-Managers\n", + "\n", + "```python\n", + "# Employees with no direct reports\n", + "non_managers = Employee - ReportsTo.proj(employee_id='manager_id')\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Building Queries Systematically\n", + "\n", + "Complex queries are best built incrementally. Follow this approach:\n", + "\n", + "### Step 1: Identify the Target Entity\n", + "\n", + "What type of entity do you want in your result?\n", + "\n", + "### Step 2: List the Conditions\n", + "\n", + "What criteria must the entities satisfy?\n", + "\n", + "### Step 3: Build Each Condition as a Query\n", + "\n", + "Create separate query expressions for each condition.\n", + "\n", + "### Step 4: Combine with Appropriate Operators\n", + "\n", + "- Use `&` for AND conditions\n", + "- Use `-` for NOT conditions\n", + "- Use `+` for OR conditions across different paths\n", + "\n", + "### Step 5: Test Incrementally\n", + "\n", + "Verify each intermediate result.\n", + "\n", + "### Example: Building a Complex Query\n", + "\n", + "**Goal**: Find CS majors who are enrolled this term but haven't received any grades yet.\n", + "\n", + "```python\n", + "# Step 1: Target entity is Student\n", + "# Step 2: Conditions:\n", + "# - Has CS major\n", + "# - Enrolled in current term\n", + "# - No grades in current term\n", + "\n", + "# Step 3: Build each condition\n", + "cs_majors = Student & (StudentMajor & {'dept': 'CS'})\n", + "enrolled_current = Student & (Enroll & CurrentTerm)\n", + "graded_current = Student & (Grade & CurrentTerm)\n", + "\n", + "# Step 4: Combine\n", + "result = cs_majors & enrolled_current - graded_current\n", + "\n", + "# Step 5: Verify counts\n", + "print(f\"CS majors: {len(cs_majors)}\")\n", + "print(f\"Enrolled current term: {len(enrolled_current)}\")\n", + "print(f\"CS majors enrolled, no grades: {len(result)}\")\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary of Patterns\n", + "\n", + "| Pattern | DataJoint | SQL Equivalent |\n", + "|---------|-----------|----------------|\n", + "| Existence (IN) | `A & B` | `WHERE id IN (SELECT ...)` |\n", + "| Non-existence (NOT IN) | `A - B` | `WHERE id NOT IN (SELECT ...)` |\n", + "| AND (both conditions) | `A & B1 & B2` | `WHERE ... AND ...` |\n", + "| OR (either condition) | `(A & B1) + (A & B2)` | `WHERE ... OR ...` |\n", + "| Exclusion | `(A & B) - B_condition` | `WHERE IN (...) AND NOT IN (...)` |\n", + "| Universal (all match) | `(A & B) - (B - condition)` | `WHERE IN (...) AND NOT IN (NOT condition)` |\n", + "\n", + "Key principles:\n", + "1. **Build incrementally** — construct complex queries from simpler parts\n", + "2. **Test intermediate results** — verify each step before combining\n", + "3. **Think in sets** — restriction filters sets, not individual records\n", + "4. **Primary key is preserved** — restrictions never change the entity type" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Practice Exercises\n", + "\n", + "### Exercise 1: Existence\n", + "\n", + "**Task**: Find all departments that have at least one student major.\n", + "\n", + "```python\n", + "active_departments = Department & StudentMajor\n", + "```\n", + "\n", + "### Exercise 2: Non-Existence\n", + "\n", + "**Task**: Find students who have never taken a biology course.\n", + "\n", + "```python\n", + "no_bio = Student - (Enroll & {'dept': 'BIOL'})\n", + "```\n", + "\n", + "### Exercise 3: AND Conditions\n", + "\n", + "**Task**: Find students who major in MATH AND have taken at least one CS course.\n", + "\n", + "```python\n", + "math_majors = Student & (StudentMajor & {'dept': 'MATH'})\n", + "took_cs = Student & (Enroll & {'dept': 'CS'})\n", + "math_majors_with_cs = math_majors & took_cs\n", + "```\n", + "\n", + "### Exercise 4: All-A Students\n", + "\n", + "**Task**: Find students who have received only 'A' grades.\n", + "\n", + "```python\n", + "has_grades = Student & Grade\n", + "has_non_a = Student & (Grade - {'grade': 'A'})\n", + "all_a = has_grades - has_non_a\n", + "```\n", + "\n", + "### Exercise 5: Complex Query\n", + "\n", + "**Task**: Find departments where all students have a GPA above 3.0.\n", + "\n", + "```python\n", + "# Students with GPA (computed via aggregation)\n", + "student_gpa = Student.aggr(\n", + " Course * Grade * LetterGrade,\n", + " gpa='SUM(points * credits) / SUM(credits)'\n", + ")\n", + "\n", + "# Students with low GPA\n", + "low_gpa_students = student_gpa & 'gpa < 3.0'\n", + "\n", + "# Departments with low-GPA students\n", + "depts_with_low_gpa = Department & (StudentMajor & low_gpa_students)\n", + "\n", + "# Departments where all students have GPA >= 3.0\n", + "all_high_gpa_depts = (Department & StudentMajor) - depts_with_low_gpa\n", + "```\n", + "\n", + ":::{seealso}\n", + "For more subquery examples, see the [University Queries](../80-examples/016-university-queries.ipynb) example.\n", + ":::" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 } From 6d3bffd0053327085fb0606ac528a8c83b8a9dfb Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 23:36:15 +0000 Subject: [PATCH 17/18] Combine and simplify query chapters Merge "Queries in Context" and "Five Query Operators" into a single streamlined chapter. The new chapter focuses on practical pedagogy: - Core principles (entity integrity, algebraic closure) - Quick reference for all five operators with examples - Comparison with SQL - Building complex queries step by step Removes the heavy theoretical/historical content in favor of clear, actionable guidance for learning the query operators. --- book/50-queries/008-datajoint-in-context.md | 386 ------------- book/50-queries/010-operators.ipynb | 595 ++++++-------------- 2 files changed, 169 insertions(+), 812 deletions(-) delete mode 100644 book/50-queries/008-datajoint-in-context.md diff --git a/book/50-queries/008-datajoint-in-context.md b/book/50-queries/008-datajoint-in-context.md deleted file mode 100644 index 2ecc759..0000000 --- a/book/50-queries/008-datajoint-in-context.md +++ /dev/null @@ -1,386 +0,0 @@ -# Queries in Context - -A Comparison with Traditional Models - -> **Note for Readers:** This chapter provides historical and theoretical context for DataJoint's design by comparing it with traditional relational databases, SQL, and the Entity-Relationship Model. **If you're primarily interested in learning how to use DataJoint**, you can skip this chapter and proceed directly to the practical query operator chapters. This material is intended for readers who want to understand: -> - Why DataJoint makes certain design choices -> - How DataJoint differs from SQL conceptually -> - The theoretical foundations underlying DataJoint's approach -> - The evolution of database query languages - ---- - -## Query Power - -### From Data Files to Data Insights - -At its core, a database query is a formal question posed to a collection of stored data. More powerfully, queries can be understood as functions that operate on this data to present a precise cross-section, tailored rigorously and efficiently for the analysis at hand. A query language provides a universal, declarative method for specifying the desired result, leaving the complex procedural details of how to locate, retrieve, and combine the data to the database management system. This ability to ask flexible, ad-hoc questions of large datasets is a fundamental departure from older, more rigid methods of data handling and a cornerstone of modern data analysis. - -This approach stands in stark contrast to the typical research workflow where data is managed in files and folders. In such environments, researchers often store data in a variety of formats, organized within a hierarchy of directories. To analyze this data, they must write custom scripts that manually navigate the directory structure, open individual files, parse their contents, and combine the necessary information. While seemingly straightforward for simple tasks, this file-based approach presents significant challenges as data complexity and scale increase. - -The file-and-folder method is fraught with inherent problems that hinder efficient and reliable research [@10.1145/1107499.1107503]: - -**Data Isolation and Fragmentation**: Information is scattered across numerous separate files, often in different formats. Answering a single research question may require writing a complex script to find and integrate data from multiple, isolated sources. - -**Redundancy and Inconsistency**: The same piece of information—such as a subject's ID or a parameter value—is often duplicated across many files. This not only wastes storage but creates a high risk of inconsistency; an update made in one file may not be propagated to all other copies, leading to a loss of data integrity. - -**Data Dependence**: The structure of the data files is tightly coupled to the specific scripts written to read them. If the format of a file changes, any script that relies on the old format is likely to break, creating a significant maintenance burden. - -**Lack of Provenance**: As analyses are run and re-run with different parameters, a "massive proliferation of output files" is generated. Researchers often resort to ad-hoc file naming conventions to track versions, but these are easily forgotten, making it difficult to determine the exact origin and processing history (provenance) of a given result and hindering reproducibility. - -A formal query language, as part of a database system, is designed to solve these very problems. By consolidating data into a structured, centralized repository, it reduces redundancy and enforces consistency. It provides a flexible and powerful interface for retrieving data that is independent of the underlying file storage, allowing researchers to ask new and unanticipated questions without having to write new, complex parsing scripts from scratch. This shift from manual data traversal to formal, declarative querying was the critical step that paved the way for the relational revolution. - -## Historical Background: Codd's Vision of Data Independence - -The modern database landscape, dominated by the principles of the relational model, is a testament to the revolutionary ideas put forth by Edgar F. Codd in the early 1970s. Before Codd, the world of data management was fundamentally different, characterized by systems that tightly coupled the logic of applications to the physical storage of data. Codd's primary motivation was not merely the introduction of tables as a data structure, but the pursuit of a far more profound and abstract goal: "data independence". This principle, which sought to sever the dependencies between how data is logically viewed and how it is physically stored, was a radical departure from the prevailing paradigms. It was this very act of abstraction, however, that while liberating application development, also introduced a new set of challenges, ultimately creating a conceptual gap between the real world and its representation in the database—a gap that DataJoint was designed to address. - -### The Pre-Relational Landscape: Navigating Data Dependencies - -In the late 1960s and early 1970s, the dominant database systems were structured according to two main schools of thought: the hierarchical model and the network model. The hierarchical model, exemplified by IBM's Information Management System (IMS), organized data in a tree-like structure, with parent-child relationships. The network model, an evolution of the hierarchical approach, allowed records to have multiple "parent" records, forming a more general graph-like structure. While functional for the specific applications they were designed for, these models shared a critical flaw: they were fundamentally navigational. - -Application programs written for these systems had to possess intimate knowledge of the data's physical layout. To retrieve information, a program was responsible for traversing the predefined links, pointers, and access paths embedded within the database structure. This created a state of tight coupling, where any change to the physical storage—such as reordering records for efficiency, adding new data types, or altering the access paths—risked breaking the applications that relied on that specific structure. - -Codd identified this inflexibility as a symptom of a deeper problem and articulated the need to eliminate several specific kinds of data dependencies that plagued these early systems: - -**Ordering Dependence**: Programs often assumed that the order in which records were presented was identical to their stored order. Any change to this physical ordering for performance reasons could cause such programs to fail. - -**Indexing Dependence**: Indices, which should be purely performance-oriented components, were often exposed to the application. Programs that referred to specific indexing chains by name would break if those indices were modified or removed. - -**Access Path Dependence**: The most significant issue was the reliance on predefined access paths. In a hierarchical or network model, a program's logic for retrieving data was inextricably linked to the specific parent-child or owner-member relationships defined in the schema. If a business requirement changed—for example, altering the relationship between Projects and Parts in a manufacturing database—the application programs that navigated the old structure would become logically impaired and cease to function correctly. - -Codd's central argument was that users and applications needed to be "protected from having to know how the data is organized in the machine". This protection, which he termed "data independence," was the foundational goal of his new model. - -### A Relational Model for Data: Core Principles of Relations, Tuples, and Domains - -In his seminal 1970 paper, "A Relational Model of Data for Large Shared Data Banks," Codd proposed a radical new approach grounded in the mathematical theory of relations. Instead of representing data as a graph of interconnected records, he proposed representing it as a simple collection of "relations." In its mathematical sense, a relation $R$ on a collection of sets $S_1, S_2, \ldots, S_n$ (which Codd called "domains") is simply a set of $n$-tuples, where each tuple's $j$-th element is drawn from the domain $S_j$. - -When visualized, a relation can be thought of as a table (though Codd himself used the term "array" in his original paper), but with a set of strict mathematical properties: - -- Each row represents a single tuple in the relation. -- All rows are distinct; duplicates are not permitted in a set. -- The ordering of rows is immaterial, a direct consequence of the set-based definition. -- The ordering of columns is significant, corresponding to the ordering of the underlying domains. Each column is labeled with the name of its domain. - -The most profound innovation of this model was how it represented relationships between data. In the navigational models, relationships were physical constructs—pointers or links. In Codd's relational model, relationships are based entirely on the values stored within the data itself. For example, to associate an employee with a department, one does not create a physical link. Instead, the Employee relation would include an attribute containing the department's unique identifier. The relationship is inferred by matching the value of this attribute to the corresponding unique identifier in the Department relation. - -This approach achieved the goal of data independence. The database management system (DBMS) would be responsible for how the relations were physically stored, indexed, and accessed. The application program only needed to know the logical structure of the relations—their names and the names of their attributes (domains). The physical implementation could be changed at will without affecting the application's logic. To support this, Codd formally defined key concepts that remain central to database theory today, including the primary key—a domain or combination of domains whose values uniquely identify each tuple in a relation—and the foreign key, which implements the value-based referencing between relations. - -### The Power of Relational Algebra: Closure and Relational Completeness - -Having defined the data structure, Codd and his colleagues subsequently developed a formal system for manipulating it: relational algebra. This provided the theoretical foundation for a universal data query language. Relational algebra consists of a set of operators that take one or more relations as input and produce a new relation as output. The primitive operators include: - -**Selection (σ)**: Filters the tuples (rows) of a relation based on a specified condition. - -**Projection (π)**: Selects a subset of the attributes (columns) of a relation, removing duplicate rows from the result. - -**Union (∪)**: Combines the tuples of two union-compatible relations into a single relation. - -**Set Difference (−)**: Returns the tuples that are in the first relation but not in the second. - -**Cartesian Product (×)**: Combines every tuple from one relation with every tuple from another, creating a new, wider relation. - -From these primitives, other useful operators like Join (⋈) and Intersection (∩) can be derived. Two properties of this algebra are particularly crucial for understanding its power and elegance. - -First is the **Closure property**. This principle states that the result of any operation in relational algebra is itself a relation. This is a profoundly important feature. Because the output of an operation is the same type of object as the input, operators can be composed and nested to form expressions of arbitrary complexity. A query can be built up from sub-queries, each of which produces a valid relation that can be fed into the next operation. This property is the foundation of modern query languages like SQL. - -Second is the concept of **Relational Completeness**. Relational algebra serves as a theoretical benchmark for the expressive power of any database query language. A language is said to be "relationally complete" if it can be used to formulate any query that is expressible in relational algebra (or its declarative equivalent, relational calculus). This provides a formal yardstick to measure whether a language is sufficiently powerful to perform any standard relational query without needing to resort to procedural constructs like loops or branching. - -The pursuit of data independence was, in essence, a deliberate act of abstraction. By elevating the representation of data to a purely logical, mathematical level, Codd successfully decoupled applications from the intricacies of physical storage. However, this abstraction came at a cost. The relational model, in its pure form, is powerful precisely because it is "semantically poor." It operates on mathematical sets of tuples and logical predicates, remaining agnostic to the real-world meaning of the data it represents. A relation for Students and a relation for Enrollments are, to the algebra, structurally identical. The relationship between them is not an explicit construct within the model but an inference to be made by joining them on common attribute values. This focus on logical consistency over semantic richness created a powerful but abstract foundation, one that left a void in conceptual clarity. It answered the question of what could be queried with mathematical precision but offered few guidelines on what makes sense to query. This semantic sparseness created a conceptual gap between the way humans think about the world and the way data was represented, a gap that would soon necessitate a new layer of modeling to reintroduce meaning. - -## The Semantic Layer: Chen's Entity-Relationship Model - -While Codd's relational model provided a mathematically rigorous and logically consistent foundation for data management, its abstract nature quickly revealed a practical challenge. The model's strength—its separation from real-world semantics in favor of pure logical structure—was also a weakness from the perspective of database design. Translating a complex real-world business problem directly into a set of normalized relations was a non-intuitive task that required a high level of expertise. In response to this challenge, Peter Pin-Shan Chen introduced the Entity-Relationship Model (ERM) in his 1976 paper, "The Entity-Relationship Model—Toward a Unified View of Data". The ERM was not a competitor to the relational model but rather a complementary conceptual layer designed to re-introduce the high-level semantics that the purely logical relational model had abstracted away. - -### Modeling the Real World: Entities, Attributes, and Explicit Relationships - -Chen's primary motivation was the observation that the relational model, despite its success in achieving data independence, "may lose some important semantic information about the real world". The ERM was proposed to capture this semantic information by adopting a more natural and intuitive view: that the real world consists of "entities" and "relationships" among them. The model is built upon three fundamental concepts, which are typically visualized using an Entity-Relationship Diagram (ERD): - -**Entity**: An entity is defined as a "thing which can be distinctly identified". This can be a physical object (like a person, a car, or a product) or a conceptual object (like a company, a job, or a course). In an ERD, entities are represented by rectangular boxes. A group of entities of the same type is called an entity set. - -**Attribute**: An attribute is a property or characteristic of an entity or a relationship. For example, an EMPLOYEE entity might have attributes like Name, Age, and Salary. In Chen's original notation, attributes are represented by ovals connected to their respective entity or relationship. - -**Relationship**: This is the most significant departure from the pure relational model. A relationship is an explicit "association among entities". For instance, a relationship named Works_In might associate an EMPLOYEE entity with a DEPARTMENT entity. In an ERD, relationships are represented by diamond-shaped boxes, with lines connecting them to the participating entities. This makes the association a first-class citizen of the model, giving it a name and its own properties (attributes). In the relational model, this same association would be implicit, represented only by a foreign key in the EMPLOYEE table referencing the DEPARTMENT table. - -The invention of the ERM was a direct reaction to the perceived semantic limitations of the relational model. Codd's work, published in 1970, provided the mathematical engine for databases, but by 1976, the need for a more human-centric design tool was evident. Chen's paper explicitly aimed to bridge this gap by incorporating the semantic information that he and others felt was being lost during the process of database design. The elevation of the "relationship" to a distinct, named concept was the ERM's central innovation. In the relational model, a many-to-many relationship is implemented as just another relation (an association table), which, at the logical level, is indistinguishable from a relation that represents an entity. The ERM, by contrast, creates a clear conceptual distinction between "things" (entities) and the "connections between things" (relationships), which aligns more closely with human intuition. - -### The ERM as a Conceptual Blueprint for Database Design - -The Entity-Relationship Model quickly became the standard for the conceptual phase of database design. It functions as a high-level blueprint, a tool for communication between database designers, developers, and non-technical business stakeholders. The typical database design workflow evolved into a two-stage process: - -**Conceptual Modeling**: The designer first works with domain experts to understand the business requirements. They identify the key entities, their attributes, and the relationships between them, capturing this understanding in an ERD. This model is at a high level of abstraction and is independent of any specific database technology. - -**Logical Design (Translation)**: Once the conceptual model is finalized and validated, it is translated into a logical model, typically a relational schema. This involves a more-or-less mechanical process of mapping the ERD constructs to relational constructs: entities become tables, attributes become columns, and relationships are implemented using primary and foreign keys. - -This separation of concerns was highly effective. The ERM provides a "user-friendly" semantic layer that allows for clear and intuitive modeling of the problem domain. It allows designers to focus on the "what" (what data needs to be stored and how it relates) before getting bogged down in the "how" (how it will be implemented in a specific RDBMS). Thus, the ERM can be understood not as an alternative to the relational model, but as an essential precursor—a semantic framework built upon the relational model's powerful logical foundation, designed to make the creation of robust and meaningful databases a more structured and less error-prone endeavor. - -## The Conceptual Gap: An Impedance Mismatch in Database Design - -The two-stage process of database design—conceptual modeling with the Entity-Relationship Model followed by logical implementation in the Relational Model—became a cornerstone of software engineering. However, the very existence of a "translation" step between these two models belies a fundamental disconnect. This disconnect, often referred to as a "conceptual gap" or an "impedance mismatch," arises because the process of mapping a rich, semantic ER model onto a purely logical relational schema is not lossless. Important semantic information explicitly captured in the ERM becomes implicit, fragmented, or obscured in the final relational implementation. This gap has profound practical consequences, shifting the burden of maintaining semantic consistency from the database system itself to the application developer, who must mentally bridge this gap with every query they write. - -### The Translation Problem: From Conceptual Model to Logical Schema - -The conversion of an ER diagram into a set of relational tables, while guided by a set of established rules, is not an entirely trivial process. The standard procedure involves the following mappings: - -- Each strong entity set in the ERD becomes a table in the relational schema. -- The simple attributes of the entity become columns in the corresponding table. -- The primary key of the entity is designated as the primary key of the table. -- Relationships are implemented using foreign keys. For a one-to-many relationship, the primary key of the "one" side is added as a foreign key column to the table on the "many" side. -- Many-to-many relationships require the creation of a new, intermediate table (often called a "junction," "linking," or "association" table) that holds the foreign keys from both participating entities. - -While this process seems straightforward, it is inherently a process of transformation where the high-level conceptual constructs of the ERM are flattened into the uniform structure of relations. This flattening is where the loss of semantic fidelity begins. - -### Loss of Semantics: How Explicit Relationships Become Implicit Constraints - -The most significant loss of semantic information occurs in the representation of relationships. In the ERM, a relationship is a first-class citizen: a named, explicit association between two or more entities, visualized as a diamond in an ERD. It carries clear semantic weight; for example, the Enrolls relationship connects Student and Course entities, clearly stating the nature of their association. - -During the translation to the relational model, this explicit, named construct vanishes. It is replaced by an implicit link embodied by a foreign key constraint. The Enrolls relationship might be implemented by adding a `student_id` column to the Course table (or more likely, in a junction table between them). From the perspective of the relational database and its query engine, the "relationship" is nothing more than a rule stating that the values in `Enrolls.student_id` must exist in `Student.student_id`. The semantic meaning—the verb "enrolls"—is lost to the system. It is relegated to documentation or the institutional knowledge of the developers. The query algebra operates on tables and columns, not on the conceptual relationships they were designed to represent. - -### The Fragmentation of Entities Through Normalization - -A second source of conceptual dissonance is the process of normalization. Normalization is a formal technique in relational database design for organizing tables to minimize data redundancy and prevent data manipulation anomalies (such as insertion, update, and deletion anomalies). While essential for maintaining data integrity, normalization often has the side effect of fragmenting what a user would consider a single, cohesive real-world entity across multiple tables. - -Consider a simple conceptual entity like a Customer Order. In the real world, an order is a single unit of thought: it has a customer, a date, a shipping address, and a list of items, each with a quantity and price. To represent this correctly in a normalized relational database, this single conceptual entity must be decomposed into several tables: a Customers table, an Orders table (with customer ID and date), an OrderItems table (linking orders to products with quantities), and a Products table (with product details and prices). - -The holistic concept of an "Order" no longer exists as a single structure within the database. It has been systematically dismantled and distributed across the schema. To reconstitute this conceptual entity—for example, to display a complete order to a user—an application developer must write a complex query involving multiple joins across these fragmented tables. The logic of the query becomes less about the business concept ("get the order") and more about the implementation details of the relational schema ("join Orders to Customers on customer_id, then join Orders to OrderItems on order_id, and finally join OrderItems to Products on product_id"). - -### The Ambiguity of the Natural Join in Standard Relational Implementations - -This burden of reconstruction is further complicated by the nature of the join operators in standard SQL. The `NATURAL JOIN` operator, which is conceptually closest to Codd's original join, operates by automatically joining two tables on all columns that share the same name. While this can be convenient, it is also notoriously fragile and prone to error. If two tables happen to share a column name that is not intended to represent a semantic relationship—for example, if both a Sessions table and a LogMessages table have a column named `timestamp` or `user_id` for different purposes—a `NATURAL JOIN` will produce incorrect and often nonsensical results by joining on this incidental name collision. - -The more explicit `INNER JOIN... ON` syntax mitigates this ambiguity by forcing the user to specify the exact join columns. However, it is entirely permissive. It allows a user to join any two tables on any arbitrary condition, regardless of whether a formal foreign key relationship has been defined in the schema. This flexibility places the entire burden of semantic correctness on the user. The query engine will not prevent a developer from joining Employees to Products on `hire_date = launch_date`, even though such a query is semantically meaningless. The database's structural knowledge (its foreign key constraints) is divorced from the operational logic of its query language. - -This collection of issues—the demotion of relationships to implicit constraints, the fragmentation of entities through normalization, and the semantic ambiguity of join operations—constitutes the conceptual gap. It is an "impedance mismatch" between the high-level, object-oriented way humans conceptualize a problem domain and the low-level, set-oriented, and fragmented way it is represented in a logical database schema. The practical result is that queries become more complex, more difficult to write, more prone to error, and less intuitive. The developer is forced to constantly perform a mental translation between the conceptual model they have in their head and the logical schema they must query, a cognitive load that increases complexity and reduces productivity. - -## The DataJoint Model: A Principled Refinement - -The conceptual gap between the Entity-Relationship Model and the Relational Model is not an academic curiosity; it is a persistent source of complexity and error in practical database programming. DataJoint was designed from the ground up with a clear understanding of this gap, and its data model represents a principled effort to bridge it. Rather than inventing entirely new concepts, DataJoint introduces a "conceptual refinement of the relational data model" that also draws heavily on the principles of the ERM. By enforcing the best practices of conceptual modeling at the logical level of its query algebra, DataJoint creates a more intuitive, robust, and semantically coherent framework for database programming. - -### The Core Philosophy: Entity Normalization as a First Principle - -The unifying principle at the heart of the DataJoint model is **Entity Normalization**. This is a crucial refinement of the classical concept of normalization in relational theory. While classical normal forms (like Boyce-Codd Normal Form) are defined in terms of functional dependencies and are aimed at preventing data anomalies, DataJoint's entity normalization is a more conceptual and overarching principle. It states that all data, whether stored in base tables or derived as the result of a query, must be represented as well-formed entity sets. - -A well-formed entity set in DataJoint must satisfy a strict set of criteria that directly reflect the ideals of conceptual modeling: - -**Represents a Single Entity Type**: All elements (tuples) within the set must belong to the same well-defined and readily identifiable entity type from the modeled world (e.g., Mouse, ExperimentalSession, SpikeTrain). - -**Attributes are Directly Applicable**: All attributes (columns) must be properties that apply directly to each entity in the set. - -**Unique Identification via Primary Key**: All elements must be distinguishable from one another by the same primary key. - -**Non-Null Primary Key**: The values of the attributes that form the primary key cannot be missing or set to NULL. - -This principle effectively takes the conceptual ideal of the ERM—that a table should represent a distinct, well-defined set of real-world entities—and elevates it to a mandatory, computationally enforced rule for all data within the system. - -### From Static Schema to Dynamic Workflow - -A key innovation in DataJoint's philosophy is its departure from the static view of a database schema common in ERM-based design. DataJoint treats the database as a dynamic data pipeline or workflow. In this paradigm, each entity set (table) is not merely a passive container for data but represents an active step in a larger process. The schema itself becomes a directed acyclic graph (DAG) where nodes are entity sets and the directed edges are dependencies representing the flow of data and computation. - -This workflow-centric view fundamentally reframes the concept of a "relationship." In the ERM, a relationship is a static, named association between entities. In DataJoint, what might be modeled as a relationship set in an ERD is instead viewed as a computational step that requires the association of entities created upstream in the workflow. For example, an Enrollment table doesn't just represent a static link between Student and Course; it represents a step in the workflow where a student is enrolled in a course, a step that depends on the prior existence of both the student and the course entities. This makes computational dependencies a first-class citizen of the data model, integrating the structure of the data with the process of its creation and analysis. - -### Integrating Conceptual Design into the Logical Model - -DataJoint's design philosophy encourages a much tighter integration between the conceptual and logical models than is typical in standard SQL-based development. This begins with its schema definition language, which is more expressive and less error-prone than SQL's Data Definition Language (DDL). - -In DataJoint, dependencies (which implement foreign key relationships) are a primary construct in the table definition syntax, denoted by a simple arrow (`->`). This small syntactic choice has a large conceptual impact. It makes the relationships between entities a central and highly visible part of the schema definition, encouraging designers to think in terms of a dependency graph of interconnected entities, much like an ERD. This contrasts with SQL, where foreign key constraints are often added as an afterthought at the end of a `CREATE TABLE` statement. By making dependencies explicit and central, DataJoint's language guides the designer toward a schema that more faithfully represents the conceptual model. - -## DataJoint's Query Algebra: Bridging the Conceptual Gap - -Having established DataJoint's philosophical foundations and schema design principles, we now turn to the heart of its innovation: the query algebra. DataJoint's operators are specifically designed to preserve semantic coherence while providing the full expressive power needed for complex data analysis. This section examines how the algebra maintains entity integrity and how two key operators—the semantic join and binary aggregation—directly address the problems identified in the conceptual gap. - -### Terminology: A Rosetta Stone for Data Models - -To ensure clarity in the subsequent technical discussion, the following table provides a mapping of key concepts across different data models. This lexicon grounds the analysis in consistent vocabulary and highlights how DataJoint synthesizes ideas from multiple traditions. - -**Table 1: Comparison of Data Model Terminology** - -| Formal Relational Model | Entity-Relationship Model | SQL Implementation | DataJoint Model | -|-------------------------|---------------------------|-------------------|-----------------| -| Relation | Entity Set / Relationship Set | Table | Entity Set / Table | -| Tuple | Entity / Relationship | Row | Entity / Tuple | -| Attribute | Attribute | Column / Field | Attribute | -| Domain | Value Set | Data Type | Data Type | -| Primary Key | Primary Key | PRIMARY KEY | Primary Key | -| - | Relationship | FOREIGN KEY constraint | Dependency (`->`) | -| Derived Relation | - | View / Query Result | Query Expression | - -### Core Properties: Algebraic Closure and Entity Integrity Preservation - -The principle of entity normalization is not merely a guideline for schema design; it is a strict constraint on DataJoint's query language. DataJoint implements a complete relational algebra with five primary operators: restrict (`&`), join (`*`), project (`proj`), aggregate (`aggr`), and union (`+`). This algebra is designed around two critical properties that work in concert to maintain semantic cohesion: - -**Algebraic Closure**: Like classical relational algebra, DataJoint's algebra possesses the closure property. All operators take entity sets as input and produce a valid entity set as output. This allows for the seamless composition and nesting of query expressions. - -**Entity Integrity Preservation**: This is DataJoint's crucial extension to the closure property. The output of every operator is not just any relation; it is guaranteed to be a well-formed entity set with a well-defined primary key. This is a much stronger guarantee than that provided by standard SQL, where a query (e.g., one involving `GROUP BY` or a projection that removes key attributes) can easily produce a result that is not a proper entity set because its rows are not uniquely identified. - -This unwavering commitment to preserving entity integrity throughout every step of a query is the foundational mechanism by which DataJoint bridges the conceptual gap. An operator is only considered valid within the DataJoint algebra if its application results in a conceptually sound entity set. This represents a fundamental shift from a query language that is agnostic to the conceptual meaning of the data to one that respects and preserves the semantic structure established during schema design. - -#### The Trade-off: Semantic Clarity Over Relational Completeness - -DataJoint's strict adherence to entity integrity leads to a deliberate departure from the classical definition of relational completeness. Recall that a language is relationally complete if it can express any query that Codd's original relational algebra can. DataJoint cannot reproduce certain classical operators precisely because it deems their outputs to be semantically incoherent or in violation of entity normalization. - -Consider the projection operator. In Codd's relational algebra, projection selects a subset of columns and then removes any duplicate rows that result from this selection. This operation can easily produce a result where the original primary key is removed, leading to a set of tuples that are no longer uniquely identifiable—a violation of entity normalization. - -DataJoint's projection operator (`proj`) is intentionally more restrictive. It prohibits projecting out attributes that are part of the primary key, thereby guaranteeing that the output always has the same number of entities as the input, with every entity remaining unique and identifiable by the original primary key. The entity type and its primary key are preserved. - -This design choice reflects a core philosophy: **semantic clarity and the preservation of entity integrity are prioritized over the ability to perform operations that, while mathematically valid in pure set theory, can lead to conceptually ambiguous or meaningless results in a structured data model.** - -### The Semantic Join: Restoring Relationship Semantics - -The join operator is where the conceptual gap is most acutely felt. As discussed earlier, the translation from ERM to relational schemas demotes explicit relationships to implicit foreign key constraints, and SQL's join operators fail to enforce these semantic connections. DataJoint's join operator (`*`) directly addresses this problem by enforcing **semantic matching**—ensuring that joins follow the meaningful relationships defined in the schema. - -#### How It Works: From Matching Names to Matching Semantics - -The DataJoint join operator, written as `A * B`, is defined as an operation that "combines the matching information in A and B". The result contains all matching combinations of entities from both operands. At its core, it performs an equijoin on all namesake attributes, similar in spirit to a natural join. However, its behavior is governed by a critical set of rules that distinguish it sharply from its SQL counterparts. - -Standard SQL offers two primary join paradigms. The first, `NATURAL JOIN`, is dangerously implicit; it joins tables on all columns that happen to share the same name, which can lead to spurious and incorrect joins if column names are reused for different semantic purposes across tables. The second, `INNER JOIN... ON`, is explicit but overly permissive; it allows the user to specify any join condition, providing no systemic safeguard against joining on attributes that do not represent a valid, schema-defined relationship. DataJoint's join operator was designed to find a principled middle ground: to be as simple and implicit as a natural join, but as safe and semantically rigorous as a schema-enforced constraint. - -#### The Semantic Matching Principle - -The power and safety of DataJoint's join stem from the principle of **Semantic Matching**. For two tables (or query expressions) A and B to be joinable, their common attributes must satisfy a crucial condition: **these shared attributes must be part of a primary key or a foreign key in at least one of the operand tables, and should ultimately derive from the same original source attribute through the dependency graph.** - -This rule has profound implications: - -1. **Schema-Enforced Relationships**: A join is only permitted if the database schema has explicitly defined a semantic link between the entities involved—either through a direct dependency (foreign key) or by sharing a common primary key attribute that originated from an upstream entity. - -2. **Active Constraints**: Foreign keys are elevated from passive integrity constraints (that merely prevent orphaned records) to active preconditions for join operations. The schema's intended meaning directly governs query behavior. - -3. **Semantic Query Interpretation**: The expression `A * B` is not merely asking "find rows in A and B with matching values in their common columns." Instead, it asks a semantic question: "Find the entities in A and B that are related to each other according to the schema's defined dependency structure, and combine their information." - -This fundamentally changes the nature of querying, aligning the operational logic of the query language with the conceptual model of the data. The join operation becomes a constrained traversal along the directed acyclic graph of dependencies that constitutes the schema, not a free-form text-matching exercise on column names. - -#### A Comparative Analysis: DataJoint's `*` Operator vs. SQL NATURAL JOIN and INNER JOIN - -The practical benefit of this semantic constraint is best illustrated with a concrete example. Consider a neuroscience data pipeline with a Session table, representing an experimental session, and a SpikeSorting table, containing the results of spike sorting for that session. Both tables might logically include a non-key attribute named `timestamp` to record when the entry was created or last modified. - -- **Session table**: `(session_id, ..., timestamp)` where `session_id` is the primary key. -- **SpikeSorting table**: `(session_id, ..., timestamp)` where `session_id` is the primary key and a foreign key referencing Session. - -Now, consider the following join operations: - -**SQL NATURAL JOIN:** -```sql -SELECT * FROM Session NATURAL JOIN SpikeSorting; -``` - -This query would attempt to join on all common columns: `session_id` AND `timestamp`. This is semantically incorrect. It would only return results where the session entry and the spike sorting entry were created at the exact same microsecond, which is almost certainly not the user's intent. The query fails due to the incidental name collision of the `timestamp` attribute. - -**DataJoint Join (`*`):** -```python -Session * SpikeSorting -``` - -This query would be evaluated against the semantic matching rule. The common attributes are `session_id` and `timestamp`. `session_id` is a primary key in Session and part of a foreign key in SpikeSorting, so it satisfies the rule. However, the `timestamp` attribute is a secondary attribute in both tables; it is not part of any primary or foreign key. Therefore, the operation fails the semantic matching check, and DataJoint will raise an error, preventing the semantically meaningless join from executing. The system actively protects the user from making a logical error. To perform the correct join, the user would first need to project away the ambiguous attribute. - -This comparison reveals the DataJoint join as an operator that is not only powerful but also inherently safe, guiding the user toward queries that are consistent with the intended semantics of the database schema. The following table provides a systematic comparison of the different join paradigms. - -**Table 2: A Comparative Overview of Join Operations** - -| Feature | SQL INNER JOIN... ON | SQL NATURAL JOIN | DataJoint Join (`*`) | -|---------|---------------------|------------------|---------------------| -| Join Condition | Explicitly specified by user in ON clause. | Implicitly defined on all columns with matching names. | Implicitly defined on all common attributes. | -| Precondition | None. Can join on any columns of compatible types. | None. Will join if any columns share names. | Semantic Matching: Common attributes must be part of a primary or foreign key and share a common origin. | -| Semantic Guarantee | Low. Relies entirely on user correctness. High risk of error. | Low. Prone to spurious joins on incidental name matches. | High. The join is guaranteed to follow a path defined in the schema's dependency graph. | -| Example Behavior | `A JOIN B ON A.x = B.y` is valid even if x and y are unrelated. | `A NATURAL JOIN B` will fail or produce wrong results if A and B share an unrelated column name (e.g., timestamp). | `A * B` will raise an error if A and B share an unrelated column name, enforcing semantic correctness. | - -By enforcing semantic matching, DataJoint's `*` operator effectively restores the explicit nature of relationships from the ERM at the query level. It ensures that joins are not arbitrary combinations of data but meaningful compositions of related entities, thereby bridging a critical part of the conceptual gap. - -### Binary Aggregation: Reassembling Fragmented Entities - -The second major challenge arising from the conceptual gap is entity fragmentation through normalization. As discussed earlier, a single real-world concept (like a complete customer order) often must be decomposed into multiple tables for proper normalization. Standard SQL's `GROUP BY` clause can summarize this fragmented data, but at the cost of creating new, transformed entity sets that lose their connection to the original conceptual entities. - -DataJoint's binary aggregation operator (`aggr`) takes a fundamentally different approach. Instead of transforming entities, it **annotates** them, adding summary information while preserving their identity. This directly addresses the fragmentation problem, allowing users to progressively enrich a primary entity with information from its constituent parts. - -#### The Problem: SQL's GROUP BY Transforms Entity Identity - -While SQL's `GROUP BY` clause is powerful for summarization, it fundamentally transforms the entity set being queried. Consider this query to count students per course: - -```sql -SELECT course_id, COUNT(student_id) AS num_students -FROM Enrollment -GROUP BY course_id; -``` - -The output is **not** a set of Course entities. It's a new entity set of "course enrollment counts"—a summary report with a different primary key, different structure, and different meaning from the original Course table. The conceptual identity is broken. Any subsequent operations must work with this derived entity set, which has lost its direct connection to the original Course entities. - -This pattern repeats throughout SQL usage: to answer questions about fragmented entities, users must create intermediate summary tables that are conceptually disconnected from the entities they care about. - -#### DataJoint's Solution: Annotation, Not Transformation - -DataJoint's binary `aggr` operator takes the form `A.aggr(B, ...)` where A is the target entity set to be annotated and B is the entity set containing the information to be aggregated. For example: - -```python -Section.aggr(Enroll, n='count(*)') -``` - -This query says: "Take the existing Section entity set and add a new attribute `n` to each Section entity, where `n` is calculated by counting the matching entries in the Enroll table." - -The result is still a set of Section entities—same primary key, same entity type, same number of rows—just enriched with additional information. This distinction is crucial for maintaining conceptual coherence. - -#### How It Works: Implicit Grouping by Primary Key - -The `aggr` operator achieves its annotation behavior through a clever transpilation to SQL. The expression `A.aggr(B, ...)` generates SQL that: - -1. Performs a `NATURAL LEFT JOIN` from A to B (ensuring all entities from A are included) -2. Applies `GROUP BY` using **A's primary key** (not the user-specified columns) -3. Projects A's attributes along with the computed aggregate attributes - -The key insight is step 2: by always grouping by A's primary key, the operation guarantees exactly one output row for every entity in A. The original entity set serves as the scaffold for the result. Aggregation functions operate on the matched entities from B for each unique entity in A, but the result remains a set of A entities, just enriched with new information. - -#### Practical Implications - -This annotation-based approach has profound consequences: - -**Entity Identity Preservation**: The result of `A.aggr(B, ...)` has the same entity class, same primary key, and same number of elements as A. It's still a well-formed entity set of type A. - -**Seamless Composability**: Because conceptual identity is preserved, the result can be immediately used in subsequent operations. For example, `Section.aggr(Enroll, n='count(*)')` remains a valid Section entity set that can be directly joined with Course: `(Section.aggr(Enroll, n='count(*)')) * Course`. - -**Progressive Enrichment**: Users can start with a primary entity (e.g., Course) and progressively annotate it with summary information from its dependent parts (Section, Enrollment) without ever losing the conceptual integrity of the Course entities. The query logic follows the user's conceptual model of the world, directly bridging the gap between the holistic real-world entity and its fragmented logical representation. - -The following table summarizes the fundamental differences between the two aggregation paradigms. - -**Table 3: A Comparative Overview of Aggregation Operations** - -| Feature | SQL GROUP BY | DataJoint `aggr` Operator | -|---------|--------------|--------------------------| -| Operation Type | Unary (operates on the result of a FROM clause). | Binary (`A.aggr(B, ...)`). | -| Grouping Basis | Explicitly specified columns in the GROUP BY clause. | Implicitly the primary key of the first operand (A). | -| Output Primary Key | The set of columns in the GROUP BY clause. Often different from any input table's primary key. | The primary key of the first operand (A). Always preserved. | -| Output Entity Set | A new entity set representing the grouped aggregates. The original entity set is lost. | The same entity set as the first operand (A), annotated with new attributes. | -| Conceptual Effect | Transformation/Summarization. Creates a new kind of result. | Annotation/Enrichment. Adds information to an existing set of entities. | -| Algebraic Consequence | The result often loses its original entity identity, making further semantic joins difficult. | The result retains its entity identity, allowing seamless use in subsequent joins and other operations. | - ---- - -## Conclusion: A Unified View of Data and Queries - -The history of relational databases is a story of abstraction. Edgar F. Codd's relational model achieved the revolutionary goal of data independence by abstracting the logical representation of data from its physical storage, grounding it in the rigorous mathematics of set theory. This created a powerful and flexible foundation for data management but also introduced a semantic void. Peter Chen's Entity-Relationship Model emerged to fill this void, providing a conceptual framework that aligned more closely with human intuition about real-world entities and their explicit relationships. The translation from the conceptual ERM to the logical RM, however, created a persistent "conceptual gap," where the rich semantics of the design phase were lost or obscured in the final implementation, placing a significant cognitive burden on developers and analysts. - -### How DataJoint Bridges the Gap - -DataJoint represents a significant step in the evolution of the relational model by systematically addressing the conceptual gap. Rather than abandoning relational principles, it refines them, enforcing the conceptual clarity of the ERM at the very core of its query algebra. The unifying principle of Entity Normalization—requiring that all data, whether stored or derived, must be a well-formed entity set—serves as the foundation for this refinement. - -This analysis has demonstrated how DataJoint's query operators directly address the specific problems identified in the conceptual gap: - -**Problem: Loss of Relationship Semantics → Solution: Semantic Join (`*`)** - -The semantic join re-establishes the primacy of schema-defined relationships. By enforcing semantic matching, it constrains joins to traverse only paths explicitly defined through foreign key dependencies. This transforms the join from an ambiguous value-matching operation into a safe, schema-aware traversal of the entity graph, effectively restoring the explicit "relationship" construct of the ERM within the query language itself. - -**Problem: Entity Fragmentation → Solution: Binary Aggregation (`aggr`)** - -The binary aggregation operator counteracts conceptual fragmentation caused by normalization. By reframing aggregation as annotation rather than transformation, it allows users to enrich entity sets with summary information from constituent parts without destroying entity identity. The operator's guarantee to preserve the primary key and entity type ensures that entity integrity is maintained throughout the query, enabling users to conceptually reassemble fragmented entities in an intuitive and algebraically sound manner. - -Together, these operators create a query language that is more than just a set of instructions for data manipulation—it is a **system for semantic inquiry**. The algebra itself understands and respects the conceptual structure of the database, guiding users toward queries that are not only syntactically correct but also semantically meaningful. - -### Practical Impact: Scientific Data Pipelines - -The theoretical advantages of DataJoint's model translate directly into practical benefits, particularly in its primary domain: large-scale scientific data pipelines. Modern scientific research, especially in fields like neuroscience, involves complex, multi-stage workflows generating vast and heterogeneous datasets. In this environment, where data integrity, reproducibility, and collaboration are paramount, DataJoint's semantic guarantees are not mere conveniences—they are essential. - -**Reproducibility**: Unambiguous queries directly tied to the schema's defined logic reduce the risk of subtle errors that can compromise the reproducibility of scientific findings. - -**Collaboration**: A semantically coherent query language makes data pipeline logic transparent and accessible to all team members, from experimentalists to computational analysts. The code more closely reflects the scientific logic of the workflow, facilitating communication and reducing onboarding time for new researchers. - -**Integrity and Correctness**: Systemic enforcement of entity integrity and schema-defined relationships provides strong defense against data corruption, ensuring that complex analyses rest on a foundation of consistent and correctly associated data. - -In essence, DataJoint creates a **unified view** where the conceptual model of an experiment, the logical structure of the database, and the operational queries performed on it are all aligned. The query language becomes more than a tool for data retrieval—it becomes an active participant in enforcing the scientific logic and integrity of the entire research workflow. - -By bridging the historical gap between conceptual modeling and logical implementation, DataJoint provides a more powerful, intuitive, and reliable framework for the future of data-intensive science. - diff --git a/book/50-queries/010-operators.ipynb b/book/50-queries/010-operators.ipynb index 3bd0350..430592d 100644 --- a/book/50-queries/010-operators.ipynb +++ b/book/50-queries/010-operators.ipynb @@ -4,516 +4,259 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Five Query Operators\n", + "# The Query Operators\n", "\n", - "## Clarity in Complexity: Why DataJoint's Five Query Operators Are All You Need\n", + "DataJoint provides five operators for building queries. These operators form a complete query language—any question you can ask of your data can be expressed using these five tools.\n", "\n", - "Navigating complex data demands tools that offer both power and clarity. **DataJoint** is designed for building and managing scientific data pipelines. The upcoming release of **DataJoint Specs 2.0** marks the first time DataJoint will be developed against a formal, open specification document, embodying the philosophy of an open standard and capturing its core theoretical concepts to ensure consistent implementations across different platforms.\n", - "\n", - "While many are familiar with **SQL**, the *lingua franca* of relational databases, DataJoint's query language, as defined in these new specs, employs a remarkably concise set of just **five core operators**. This naturally begs the question: in a world accustomed to SQL's extensive vocabulary, can just five operators truly be enough?\n", - "\n", - "This chapter argues an emphatic **\"yes\"**—not despite their small number, but precisely because of their rigorous design and unwavering commitment to fundamental relational principles.\n", - "\n", - "---\n", - "\n", - "## The Theoretical Bedrock: From Codd to Chen to SQL\n", - "\n", - "To appreciate DataJoint's approach, we must first understand the foundations.\n", - "\n", - "### Relational Database Theory\n", - "\n", - "**Relational database theory**, pioneered by **Edgar F. Codd** in the late 1960s and early 1970s, is built on rigorous mathematics. Codd introduced two fundamental formalisms:\n", - "\n", - "* **Relational algebra**: A procedural language where operators like selection, projection, and join manipulate tables (relations) to produce new tables.\n", - "* **Relational calculus**: A declarative language allowing users to specify what data they want.\n", - "\n", - "Codd proved these two formalisms were equivalent in power, establishing the concept of **relational completeness**—any query expressible in one formalism could be expressed in the other.\n", - "\n", - "### The Entity-Relationship Model\n", - "\n", - "In 1976, a pivotal moment arrived with **Peter Chen's** introduction of the **Entity-Relationship Model (ERM)**. Chen proposed modeling data in terms of:\n", - "\n", - "* **Entities**: Distinguishable \"things\" like a student, a course, or an experiment\n", - "* **Relationships**: Connections between entities, like a student \"enrolling\" in a course\n", - "\n", - "The ERM provided **ER diagrams**—a powerful visual language that became incredibly influential for database schema design and for communication between designers, domain experts, and stakeholders. It offered an intuitive framework for translating real-world scenarios into structured data models, naturally leading to well-normalized schemas.\n", - "\n", - "### The Disconnect with SQL\n", - "\n", - "A significant disconnect emerged: while ERM became a standard for conceptual design, its elegant, entity-centric syntax was **never directly mirrored** in SQL's Data Definition Language (DDL) or its Data Query Language (DQL).\n", - "\n", - "* SQL's `CREATE TABLE` defines columns and foreign keys (which implement ERM relationships), but doesn't speak the direct language of \"entity sets\" and \"relationship sets\" in the way ERM diagrams do.\n", - "* SQL's `JOIN` syntax, while powerful, doesn't inherently guide users to join tables based on the semantically defined relationships from an ERM perspective.\n", - "\n", - "This left a gap between the clarity of the conceptual design and the often more intricate, attribute-level syntax of SQL implementation and querying.\n", - "\n", - "SQL itself emerged as a practical implementation drawing from both relational algebra and calculus. Its `SELECT... FROM... WHERE` structure has a declarative feel, whereas `JOIN` is a relational algebra operator. While SQL's early vision aspired to be a natural language interface, aiming for queries that read like English prose, this came at the cost of the explicit operator sequencing and rigorous composability found in more formal algebraic systems.\n", - "\n", - "Through its evolution, SQL accumulated **\"conceptual baggage\"**—layers of complexity and ambiguity that can obscure the underlying simplicity of relational operations.\n", - "\n", - "---\n", - "\n", - "## The Cornerstone: Well-Defined Query Results\n", - "\n", - "A central tenet of the DataJoint philosophy, crystallized in the new Specs 2.0, is that all data, whether stored in base tables or derived through queries, must represent **well-formed entity sets (or relations)**.\n", - "\n", - "In practice, this means every table—including any intermediate or final query result—must:\n", - "\n", - "* Clearly represent a single, identifiable type of entity (e.g., \"Students,\" \"Experiments,\" \"MeasurementEvents\")\n", - "* Have a **well-defined primary key**—a set of attributes whose values uniquely identify each entity (row) within that set\n", - "* Ensure that all its attributes properly describe the entity identified by that primary key\n", - "\n", - "This commitment is upheld through what the DataJoint Specs refer to as **algebraic closure**. Each of DataJoint's query operators is designed such that if you give it well-formed relations as input, it will **always produce another well-formed relation as output**, complete with its own clear primary key and entity type.\n", - "\n", - "> **Algebraic Closure**: The result of any relational operation is itself a valid relation with a well-defined primary key and entity type.\n", - "\n", - "This principle enables unlimited composition—you can chain operations indefinitely, and each intermediate result remains a meaningful, well-formed entity set.\n", - "\n", - "---\n", + "| Operator | Symbol | Purpose |\n", + "|----------|--------|--------|\n", + "| **Restriction** | `&`, `-` | Filter entities by conditions |\n", + "| **Projection** | `.proj()` | Select and compute attributes |\n", + "| **Join** | `*` | Combine related entities |\n", + "| **Aggregation** | `.aggr()` | Summarize related data |\n", + "| **Union** | `+` | Combine entity sets of the same type |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Core Principles\n", "\n", - "## DataJoint's \"Fab Five\": A Modern Interface to Relational Power\n", + "Before diving into the operators, two principles guide how DataJoint queries work:\n", "\n", - "DataJoint, guided by its new Specs 2.0, proposes a refined, modern set of five operators designed for clarity and power:\n", + "### Entity Integrity\n", "\n", - "### 1. **Restriction** (`&`, `-`)\n", + "Every query result is a well-formed entity set with:\n", + "- A clear entity type (what kind of things are in the result)\n", + "- A defined primary key (how entities are uniquely identified)\n", + "- Attributes that directly describe each entity\n", "\n", - "This is your precision filter. It selects a subset of rows from a table based on specified conditions without altering the table's structure or primary key. The resulting table contains the same type of entities and the same primary key.\n", + "This means you always know what you're working with. A query on `Student` entities returns `Student` entities—not some ambiguous collection of data.\n", "\n", - "**Syntax:**\n", - "```python\n", - "# Positive restriction (semijoin)\n", - "Table & restriction_condition\n", + "### Algebraic Closure\n", "\n", - "# Negative restriction (antijoin)\n", - "Table - restriction_condition\n", - "```\n", + "The output of any operator is itself a valid entity set that can be used as input to another operator. This enables unlimited composition:\n", "\n", - "**Example:**\n", "```python\n", - "# Find all people born after 1990\n", - "young_people = Person & 'date_of_birth > \"1990-01-01\"'\n", - "\n", - "# Find people who speak English (semijoin)\n", - "english_speakers = Person & (Fluency & {'lang_code': 'en'})\n", - "\n", - "# Find people who don't speak English (antijoin)\n", - "non_english_speakers = Person - (Fluency & {'lang_code': 'en'})\n", + "# Chain operators freely\n", + "result = ((Student & 'gpa > 3.5') * Enrollment).proj('course_name')\n", "```\n", "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "-- Positive restriction\n", - "SELECT * FROM person WHERE date_of_birth > '1990-01-01';\n", - "\n", - "-- Semijoin\n", - "SELECT * FROM person \n", - "WHERE person_id IN (\n", - " SELECT person_id FROM fluency WHERE lang_code = 'en'\n", - ");\n", - "\n", - "-- Antijoin\n", - "SELECT * FROM person \n", - "WHERE person_id NOT IN (\n", - " SELECT person_id FROM fluency WHERE lang_code = 'en'\n", - ");\n", - "```\n", + "Each intermediate step produces a valid result you can inspect, debug, or use further." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Quick Reference\n", "\n", - "### 2. **Projection** (`.proj()`)\n", + "Here's a summary of what each operator does and when to use it:\n", "\n", - "This operator reshapes your view of a table by selecting specific attributes, renaming them, or computing new attributes from existing ones. Crucially, the primary key of the original table is preserved, ensuring the identity of the entities remains intact.\n", + "### Restriction (`&`, `-`)\n", "\n", - "**Syntax:**\n", - "```python\n", - "Table.proj(kept_attr1, kept_attr2, new_attr='expression', ...)\n", - "```\n", + "**Use when:** You want to filter entities based on conditions.\n", "\n", - "**Example:**\n", "```python\n", - "# Select specific attributes\n", - "names = Person.proj('name', 'date_of_birth')\n", - "\n", - "# Compute new attributes\n", - "people_with_age = Person.proj(\n", - " 'name',\n", - " age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'\n", - ")\n", - "\n", - "# Rename attributes\n", - "renamed = Person.proj(birth_date='date_of_birth')\n", - "```\n", - "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "-- Select specific attributes (primary key always included)\n", - "SELECT person_id, name, date_of_birth FROM person;\n", + "# Keep entities matching a condition\n", + "young_mice = Mouse & 'age < 30'\n", "\n", - "-- Compute new attributes\n", - "SELECT person_id, name, \n", - " TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age\n", - "FROM person;\n", + "# Keep entities matching values from another table (semijoin)\n", + "mice_with_sessions = Mouse & Session\n", "\n", - "-- Rename attributes\n", - "SELECT person_id, date_of_birth AS birth_date FROM person;\n", + "# Exclude entities (antijoin)\n", + "mice_without_sessions = Mouse - Session\n", "```\n", "\n", - "**Key Insight:** Unlike SQL's `SELECT`, which can arbitrarily choose columns and potentially lose entity identity, DataJoint's projection always preserves the primary key, maintaining entity integrity.\n", + "**Result:** Same entity type, same primary key, fewer entities.\n", "\n", - "### 3. **Join** (`*`)\n", + "---\n", "\n", - "This operator combines information from two tables. It's a **\"semantic join\"** that ensures the resulting table represents a meaningful fusion of entities, with a clearly defined primary key derived from its operands.\n", + "### Projection (`.proj()`)\n", "\n", - "**Syntax:**\n", - "```python\n", - "Table1 * Table2\n", - "```\n", + "**Use when:** You want to select, rename, or compute attributes.\n", "\n", - "**Example:**\n", "```python\n", - "# Join Person with Fluency (via foreign key)\n", - "person_fluency = Person * Fluency\n", + "# Select specific attributes\n", + "Mouse.proj('sex', 'date_of_birth')\n", "\n", - "# The result has primary key (person_id, lang_code)\n", - "# and contains all attributes from both tables\n", - "```\n", + "# Rename an attribute \n", + "Mouse.proj(dob='date_of_birth')\n", "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "SELECT p.*, f.*\n", - "FROM person p\n", - "JOIN fluency f ON p.person_id = f.person_id;\n", + "# Compute a new attribute\n", + "Mouse.proj(age_days='DATEDIFF(NOW(), date_of_birth)')\n", "```\n", "\n", - "**Semantic Matching:** DataJoint institutionalizes semantic matching for its join operator. For attributes to be matched, they must not only share the same name but also trace their lineage through an uninterrupted chain of foreign keys to the same original attribute definition. If identically named attributes don't meet this criterion, it's a \"collision,\" and DataJoint raises an error, compelling the user to explicitly rename attributes using projection before the join.\n", + "**Result:** Same entity type, same primary key, same number of entities, different attributes.\n", "\n", - "This prevents the dangerous behavior of SQL's `NATURAL JOIN`, which blindly matches on any identically named columns.\n", + "---\n", "\n", - "### 4. **Aggregation** (`.aggr()`)\n", + "### Join (`*`)\n", "\n", - "This operator is an advanced form of projection. It can calculate new attributes for each entity in table A by summarizing related data from table B. The resulting table still has A's primary key and represents entities of type A, now augmented with new information.\n", + "**Use when:** You want to combine information from related tables.\n", "\n", - "**Syntax:**\n", "```python\n", - "TableA.aggr(TableB, summary_attr='AGG_FUNC(expression)', ...)\n", - "```\n", - "\n", - "**Example:**\n", - "```python\n", - "# Count the number of languages each person speaks\n", - "language_counts = Person.aggr(\n", - " Fluency, \n", - " n_languages='COUNT(*)'\n", - ")\n", + "# Combine mouse info with their sessions\n", + "Mouse * Session\n", "\n", - "# Count students enrolled in each section\n", - "section_counts = Section.aggr(\n", - " Enroll,\n", - " n_students='COUNT(*)'\n", - ")\n", - "\n", - "# Average grade per student\n", - "avg_grades = Student.aggr(\n", - " Grade,\n", - " gpa='AVG(grade_value)'\n", - ")\n", + "# Chain multiple joins\n", + "Mouse * Session * Scan\n", "```\n", "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "-- Count languages per person\n", - "SELECT p.*, COUNT(f.lang_code) AS n_languages\n", - "FROM person p\n", - "LEFT JOIN fluency f ON p.person_id = f.person_id\n", - "GROUP BY p.person_id;\n", - "\n", - "-- Count students per section\n", - "SELECT s.*, COUNT(e.student_id) AS n_students\n", - "FROM section s\n", - "LEFT JOIN enroll e USING (course_id, section_id)\n", - "GROUP BY s.course_id, s.section_id;\n", - "```\n", + "**Result:** Combined entity type, combined primary key, only matching combinations.\n", "\n", - "**Key Insight:** The `.aggr()` operator cleanly achieves what SQL often requires a `LEFT OUTER JOIN` with `GROUP BY` to accomplish. It preserves the primary entity type (A) while augmenting it with summaries from related entities (B).\n", + "---\n", "\n", - "### 5. **Union** (`+`)\n", + "### Aggregation (`.aggr()`)\n", "\n", - "This operator combines rows from two tables, A and B. For this to be valid, A and B must represent the same type of entity and share the same primary key structure; the result inherits this structure.\n", + "**Use when:** You want to summarize related data for each entity.\n", "\n", - "**Syntax:**\n", "```python\n", - "TableA + TableB\n", - "```\n", + "# Count sessions per mouse\n", + "Mouse.aggr(Session, n_sessions='COUNT(*)')\n", "\n", - "**Example:**\n", - "```python\n", - "# Combine English and Spanish speakers\n", - "bilingual_subset = (\n", - " (Person & (Fluency & {'lang_code': 'en'})) +\n", - " (Person & (Fluency & {'lang_code': 'es'}))\n", - ")\n", + "# Average score per student\n", + "Student.aggr(Grade, avg_grade='AVG(score)')\n", "```\n", "\n", - "**SQL Equivalent:**\n", - "```sql\n", - "SELECT * FROM person \n", - "WHERE person_id IN (SELECT person_id FROM fluency WHERE lang_code = 'en')\n", - "UNION\n", - "SELECT * FROM person \n", - "WHERE person_id IN (SELECT person_id FROM fluency WHERE lang_code = 'es');\n", - "```\n", - "\n", - "These five operators, through their strict adherence to producing well-defined results, form the backbone of DataJoint's expressive power.\n", + "**Result:** Same entity type as the first operand, enriched with summary attributes.\n", "\n", "---\n", "\n", - "## Untangling SQL: Where Simplicity Meets Complexity\n", - "\n", - "### SQL's Operator Count—A Fuzzy Number\n", - "\n", - "It's notoriously hard to quantify how many \"operators\" SQL effectively has because many distinct logical operations are bundled into the complex `SELECT` statement. A single `SELECT` can perform filtering (restriction), column selection and computation (projection), table combination (join), grouping, and ordering, all intertwined.\n", - "\n", - "Furthermore, seemingly simple modifiers can act like entirely new, transformative operators:\n", - "\n", - "* Adding **`DISTINCT`** to a `SELECT` query fundamentally changes the resulting relation, implying a new primary key based on all the selected columns.\n", - "* Aggregate functions like `COUNT()` or `AVG()` with a **`GROUP BY`** clause transform the output into a new type of entity (e.g., \"summary per department\"), with the grouping columns forming the new primary key.\n", - "\n", - "If every distinct transformation SQL can perform were \"unrolled,\" the operator count would be vastly larger and far more entangled than DataJoint's explicit five.\n", - "\n", - "### The SELECT Statement's Hidden Logic\n", - "\n", - "The order in which SQL clauses are written (`SELECT`, `FROM`, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`) **doesn't reflect their logical execution order**. SQL actually processes clauses in this order:\n", - "\n", - "1. `FROM` (including joins)\n", - "2. `WHERE`\n", - "3. `GROUP BY`\n", - "4. `HAVING`\n", - "5. `SELECT`\n", - "6. `ORDER BY`\n", - "7. `LIMIT`\n", - "\n", - "This \"hidden logic\" often confuses users, especially when trying to reference computed columns in `WHERE` clauses (which fails because `WHERE` executes before `SELECT`).\n", - "\n", - "DataJoint's explicit, sequential application of operators avoids this ambiguity entirely. What you write is what executes.\n", - "\n", - "### The Labyrinth of SQL Joins\n", + "### Union (`+`)\n", "\n", - "SQL offers various join implementations:\n", + "**Use when:** You want to combine entities from compatible tables.\n", "\n", - "* `INNER JOIN`\n", - "* `LEFT/RIGHT/FULL OUTER JOIN`\n", - "* `CROSS JOIN`\n", - "* With modifiers: `NATURAL`, `USING`, and `ON `\n", - "\n", - "**The Danger of NATURAL JOIN:** SQL's `NATURAL JOIN` (matching on identically named columns) can be treacherous, as it may join attributes that share a name but have completely different meanings. The ERM guided that meaningful joins should occur on foreign keys between related tables.\n", - "\n", - "**DataJoint's Solution:** DataJoint has one join operator (`*`) that enforces semantic matching. Attributes must share not just a name, but a semantic lineage through foreign keys. This prevents accidental joins on coincidentally named columns.\n", - "\n", - "### Semijoin and Antijoin: The Misnamed \"Joins\"\n", - "\n", - "Relational algebra textbooks discuss **semijoin** (⋉) and **antijoin** (▷):\n", - "\n", - "* A **semijoin** returns rows from table A for which there is at least one matching row in table B, but it only includes columns from table A.\n", - "* An **antijoin** returns rows from table A for which there are no matching rows in table B, again only including columns from table A.\n", - "\n", - "While called \"joins,\" they fundamentally act as **filters on table A** based on the existence (or non-existence) of related records in table B. They don't combine attributes from both tables to form a new, wider entity. This is precisely the definition of a **restriction** operation.\n", - "\n", - "In SQL, these are implemented using subqueries with `EXISTS`, `NOT EXISTS`, `IN`, and `NOT IN` operators. DataJoint correctly categorizes these operations under its versatile **Restriction operator** (`&` for semijoin, `-` for antijoin).\n", - "\n", - "### SQL's OUTER JOINs: A Mix of Entity Types\n", - "\n", - "SQL's `OUTER JOIN` variants often create results that are a jumble of entity types. Some rows might represent a complete pairing, while others represent only one entity, padded with `NULL`s. The resulting table doesn't have a clear, consistent primary key or entity type.\n", - "\n", - "DataJoint's Specs 2.0 clearly state that it effectively has **no direct \"outer join\" operator** because such an operation typically violates the principle of yielding a single, well-defined entity set with a consistent primary key.\n", - "\n", - "Instead, the **`.aggr()` operator** cleanly achieves the common goal of augmenting one entity set with summaries from another, preserving the primary entity's type and identity.\n", - "\n", - "### Redundancy in Restriction\n", - "\n", - "SQL uses multiple clauses for filtering:\n", - "\n", - "* `WHERE` (filters rows before grouping)\n", - "* `ON` (in joins)\n", - "* `HAVING` (filters groups after aggregation)\n", - "* `LIMIT`/`OFFSET` (limits result sets)\n", - "\n", - "DataJoint streamlines this with its single, powerful **Restriction operator** (`&` and its complement `-`), which works consistently across all contexts.\n", - "\n", - "---\n", - "\n", - "## Illustrative Examples: DataJoint vs. SQL\n", - "\n", - "Let's use a simplified university database with `Student`, `Course`, `Section`, `Enroll`, and `Grade` tables.\n", - "\n", - "### Example 1: Finding Students Enrolled in Any Class\n", - "\n", - "| **DataJoint** | **SQL** |\n", - "|---------------|---------|\n", - "| `Student & Enroll` | `SELECT * FROM Student WHERE student_id IN (SELECT student_id FROM Enroll);` |\n", - "| **Result:** A well-defined set of `Student` entities | **Result:** Rows from the `Student` table, but the logic is more verbose |\n", - "\n", - "The DataJoint version is concise and clearly expresses the intent: \"Students who are in the Enroll table.\"\n", - "\n", - "### Example 2: Counting Enrolled Students per Section\n", - "\n", - "| **DataJoint** | **SQL** |\n", - "|---------------|---------|\n", - "| `Section.aggr(Enroll, n_students='COUNT(*)')` | `SELECT s.*, COUNT(e.student_id) AS n_students FROM Section s LEFT JOIN Enroll e USING (course_id, section_id) GROUP BY s.course_id, s.section_id;` |\n", - "| **Result:** `Section` entities, augmented with `n_students` | **Result:** Requires explicit join and grouping by all parts of `Section`'s primary key |\n", - "\n", - "The DataJoint version clearly states: \"For each Section, count the related Enroll records.\" The SQL version requires understanding of joins, grouping, and careful specification of grouping columns.\n", - "\n", - "### Example 3: Building Complex Queries Step by Step\n", - "\n", - "**Task:** Find students over 21 who have a GPA above 3.5\n", - "\n", - "**DataJoint Approach:**\n", "```python\n", - "# Step 1: Add computed age\n", - "students_with_age = Student.proj(\n", - " 'name',\n", - " age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'\n", - ")\n", - "\n", - "# Step 2: Filter by age\n", - "adult_students = students_with_age & 'age > 21'\n", - "\n", - "# Step 3: Add GPA\n", - "students_with_gpa = adult_students.aggr(\n", - " Grade,\n", - " gpa='AVG(grade_value)'\n", - ")\n", - "\n", - "# Step 4: Filter by GPA\n", - "result = students_with_gpa & 'gpa > 3.5'\n", - "```\n", - "\n", - "**SQL Approach:**\n", - "```sql\n", - "SELECT s.student_id, s.name, AVG(g.grade_value) AS gpa\n", - "FROM student s\n", - "JOIN grade g ON s.student_id = g.student_id\n", - "WHERE TIMESTAMPDIFF(YEAR, s.date_of_birth, NOW()) > 21\n", - "GROUP BY s.student_id, s.name\n", - "HAVING AVG(g.grade_value) > 3.5;\n", + "# Combine two groups of mice\n", + "experimental_mice = Mouse & 'group=\"experimental\"'\n", + "control_mice = Mouse & 'group=\"control\"'\n", + "all_selected = experimental_mice + control_mice\n", "```\n", "\n", - "The DataJoint approach builds incrementally, with each step producing a valid entity set. The SQL approach requires understanding the execution order and carefully placing conditions in either `WHERE` or `HAVING`.\n", - "\n", - "---\n", - "\n", - "## The DataJoint Advantage: Why These Five Excel\n", - "\n", - "DataJoint's design philosophy demonstrates that true power comes from a concise set of orthogonal, well-defined operators that compose reliably.\n", - "\n", - "### 1. Consistently Well-Defined Results (Algebraic Closure)\n", - "\n", - "Every operation yields a predictable, valid table with a defined primary key and entity type. This means:\n", - "\n", - "* You always know what kind of entities you're working with\n", - "* Intermediate results are meaningful and inspectable\n", - "* Query composition is guaranteed to work\n", - "\n", - "### 2. Semantic Precision\n", - "\n", - "Binary operations like join are based on meaningful relational links, not just coincidental name matches. This prevents:\n", - "\n", - "* Accidental joins on unrelated columns\n", - "* Silent errors from name collisions\n", - "* Confusion about what data is being combined\n", - "\n", - "### 3. Composability\n", - "\n", - "Simple, reliable steps can be combined to build sophisticated queries. Each operator:\n", - "\n", - "* Has a clear, single purpose\n", - "* Works predictably with other operators\n", - "* Produces output suitable for further operations\n", - "\n", - "### 4. Interpretability\n", - "\n", - "The nature of the data remains clear at every stage of the query. You can:\n", - "\n", - "* Inspect intermediate results\n", - "* Understand what entities you're working with\n", - "* Debug queries step by step\n", + "**Result:** Same entity type, combined entities (duplicates removed)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How DataJoint Differs from SQL\n", "\n", - "### 5. Entity-Oriented Focus\n", + "If you're familiar with SQL, here are the key differences:\n", "\n", - "The operators encourage thinking in terms of whole entities and their relationships, aligning well with conceptual modeling principles championed by the ERM. This bridges the gap between:\n", + "| Aspect | SQL | DataJoint |\n", + "|--------|-----|----------|\n", + "| **Results** | Can produce arbitrary column sets | Always produces well-formed entity sets |\n", + "| **Joins** | Any columns can be joined | Only semantically related attributes |\n", + "| **Aggregation** | Transforms entity type | Enriches existing entities |\n", + "| **Order of operations** | Hidden execution order | Explicit, left-to-right |\n", + "| **Primary key** | Can be lost in queries | Always preserved or well-defined |\n", "\n", - "* How we think about data (entities and relationships)\n", - "* How we query data (operations on tables)\n", - "* How we store data (normalized tables with foreign keys)\n", + "### Semantic Joins\n", "\n", - "---\n", + "DataJoint's join operator (`*`) only matches attributes that share a semantic relationship—they must trace back to the same source through the schema's dependencies. This prevents accidental joins on coincidentally named columns.\n", "\n", - "## Practical Implications: Order of Operations\n", + "```python\n", + "# DataJoint: Only joins on schema-defined relationships\n", + "Session * Scan # Works: scan depends on session\n", "\n", - "One of the most powerful aspects of DataJoint's approach is how it handles computed attributes and the order of operations.\n", + "# SQL NATURAL JOIN: Joins on ANY matching column names\n", + "# Can produce wrong results if tables share unrelated column names\n", + "```\n", "\n", - "### The Problem in SQL\n", + "### Aggregation Preserves Identity\n", "\n", - "In SQL, you cannot reference a computed column alias in the `WHERE` clause:\n", + "SQL's `GROUP BY` creates a new entity type. DataJoint's `.aggr()` enriches existing entities:\n", "\n", - "```sql\n", - "-- THIS FAILS\n", - "SELECT person_id, TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age\n", - "FROM person\n", - "WHERE age > 25; -- Error: Unknown column 'age'\n", + "```python\n", + "# DataJoint: Still Mouse entities, now with session counts\n", + "Mouse.aggr(Session, n='COUNT(*)')\n", "\n", - "-- YOU MUST REPEAT THE CALCULATION\n", - "SELECT person_id, TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age\n", - "FROM person\n", - "WHERE TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) > 25;\n", - "```\n", + "# SQL: Creates \"mouse summary\" entities, losing direct connection to Mouse\n", + "# SELECT mouse_id, COUNT(*) FROM session GROUP BY mouse_id\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Building Complex Queries\n", "\n", - "This is because `WHERE` executes before `SELECT` in SQL's logical processing order.\n", + "The power of these operators comes from composition. Build complex queries step by step:\n", "\n", - "### The Solution in DataJoint\n", + "```python\n", + "# Goal: Find adult mice with more than 5 sessions, showing their average session duration\n", "\n", - "DataJoint's explicit operator sequencing solves this elegantly:\n", + "# Step 1: Filter to adult mice\n", + "adults = Mouse & 'age > 90'\n", "\n", - "```python\n", - "# Step 1: Compute the attribute\n", - "people_with_age = Person.proj(\n", - " age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'\n", + "# Step 2: Add session statistics\n", + "with_stats = adults.aggr(\n", + " Session,\n", + " n_sessions='COUNT(*)',\n", + " avg_duration='AVG(duration)'\n", ")\n", "\n", - "# Step 2: Use the computed attribute in a restriction\n", - "adults = people_with_age & 'age > 25'\n", + "# Step 3: Filter to mice with many sessions\n", + "result = with_stats & 'n_sessions > 5'\n", "```\n", "\n", - "Each step produces a valid relation. The second step operates on a relation that already has the `age` attribute, so there's no ambiguity or need for repetition.\n", + "Each step produces a valid entity set you can examine:\n", "\n", - "---\n", - "\n", - "## Conclusion: A Clearer Lens for Data Discovery\n", - "\n", - "SQL's position as a foundational data language is secure, and its contributions are undeniable. However, for the complex, high-stakes data work found in scientific research and other demanding domains, a query interface that prioritizes conceptual clarity, predictability, and semantic integrity can be transformative.\n", - "\n", - "DataJoint, as guided by its new Specs 2.0, isn't about minimalism for its own sake. It's about providing a complete and conceptually sound set of query operators that empower users. By ensuring every operation results in a well-defined entity set and by enforcing semantic integrity in operations like joins, DataJoint aims to strip away ambiguity and allow researchers to interact with their data with greater confidence and insight.\n", + "```python\n", + "adults # Check: which mice are adults?\n", + "with_stats # Check: what are the session counts?\n", + "result # Final answer\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Next Steps\n", "\n", - "It's a compelling case that sometimes, to see further, we need not more tools, but **clearer lenses**.\n", + "The following chapters explore each operator in detail:\n", "\n", - "---\n", + "- **Restriction** — Filtering with conditions, semijoins, and antijoins\n", + "- **Projection** — Selecting, renaming, and computing attributes\n", + "- **Join** — Combining related tables\n", + "- **Union** — Merging compatible entity sets\n", + "- **Aggregation** — Computing summaries across related data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", "\n", - "## Key Takeaways\n", + "DataJoint's five query operators provide a complete, composable query language:\n", "\n", - "1. **Five operators are sufficient**: Restriction, Projection, Join, Aggregation, and Union provide complete query expressiveness\n", - "2. **Algebraic closure ensures composability**: Every operation produces a valid relation, enabling unlimited chaining\n", - "3. **Entity integrity is paramount**: All query results have well-defined primary keys and entity types\n", - "4. **Semantic matching prevents errors**: Joins work on meaningful relationships, not coincidental name matches\n", - "5. **Explicit ordering avoids confusion**: Operations execute in the order written, with no hidden logic\n", - "6. **Entity-oriented thinking**: DataJoint bridges the gap between ERM conceptual design and practical querying\n", + "1. **Restriction** (`&`, `-`) — Filter entities\n", + "2. **Projection** (`.proj()`) — Shape attributes\n", + "3. **Join** (`*`) — Combine related data\n", + "4. **Aggregation** (`.aggr()`) — Summarize relationships\n", + "5. **Union** (`+`) — Merge entity sets\n", "\n", - "Master these five operators, understand their principles, and you'll have a powerful, clear framework for expressing any database query." + "Every operation preserves entity integrity, ensuring results are always meaningful and can be used in further operations. This makes queries predictable, debuggable, and composable." ] } ], "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { - "name": "python" + "name": "python", + "version": "3.11.0" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } From 24e6262a41466100d673b4a3a00b6606d260475b Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 12 Dec 2025 23:54:19 +0000 Subject: [PATCH 18/18] Merge subqueries chapter into Restriction and rename Query Operators - Combined subquery patterns into the Restriction chapter since subqueries are just a special case of restrictions (semijoins/antijoins) - Added Pattern 6 (Universal Quantification), Pattern 7 (Reverse Perspective), Self-Referencing Patterns, and Building Queries Systematically sections - Added Summary of Patterns table for quick reference - Added more practice exercises covering advanced subquery patterns - Renamed "The Query Operators" to "Query Operators" - Deleted the now-redundant 080-subqueries.ipynb chapter --- book/50-queries/010-operators.ipynb | 16 +- book/50-queries/020-restriction.ipynb | 390 +---------------- book/50-queries/080-subqueries.ipynb | 582 -------------------------- 3 files changed, 4 insertions(+), 984 deletions(-) delete mode 100644 book/50-queries/080-subqueries.ipynb diff --git a/book/50-queries/010-operators.ipynb b/book/50-queries/010-operators.ipynb index 430592d..862f046 100644 --- a/book/50-queries/010-operators.ipynb +++ b/book/50-queries/010-operators.ipynb @@ -3,19 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# The Query Operators\n", - "\n", - "DataJoint provides five operators for building queries. These operators form a complete query language—any question you can ask of your data can be expressed using these five tools.\n", - "\n", - "| Operator | Symbol | Purpose |\n", - "|----------|--------|--------|\n", - "| **Restriction** | `&`, `-` | Filter entities by conditions |\n", - "| **Projection** | `.proj()` | Select and compute attributes |\n", - "| **Join** | `*` | Combine related entities |\n", - "| **Aggregation** | `.aggr()` | Summarize related data |\n", - "| **Union** | `+` | Combine entity sets of the same type |" - ] + "source": "# Query Operators\n\nDataJoint provides five operators for building queries. These operators form a complete query language—any question you can ask of your data can be expressed using these five tools.\n\n| Operator | Symbol | Purpose |\n|----------|--------|--------|\n| **Restriction** | `&`, `-` | Filter entities by conditions |\n| **Projection** | `.proj()` | Select and compute attributes |\n| **Join** | `*` | Combine related entities |\n| **Aggregation** | `.aggr()` | Summarize related data |\n| **Union** | `+` | Combine entity sets of the same type |" }, { "cell_type": "markdown", @@ -259,4 +247,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file diff --git a/book/50-queries/020-restriction.ipynb b/book/50-queries/020-restriction.ipynb index 17bc6ec..852fa99 100644 --- a/book/50-queries/020-restriction.ipynb +++ b/book/50-queries/020-restriction.ipynb @@ -3,398 +3,12 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Operator: Restriction\n", - "\n", - "The **restriction operator** is one of the fundamental operations in relational algebra. It selects rows from a table that satisfy specific conditions, allowing you to filter data based on criteria you define.\n", - "\n", - "## Understanding Restriction\n", - "\n", - "Restriction **selects rows** (not columns) from a table based on conditions. It's the \"WHERE clause\" equivalent in SQL, but in DataJoint it's represented by the `&` operator for inclusion and `-` operator for exclusion.\n", - "\n", - "### Key Concepts\n", - "\n", - "- **Restriction never changes the primary key** - the result still has the same entity type as the input\n", - "- **Algebraic closure** - the result of restriction is still a valid relation that can be used in further operations\n", - "- **Entity integrity** - restriction preserves the one-to-one correspondence between records and real-world entities\n", - "\n", - "### Basic Syntax\n", - "\n", - "```python\n", - "# Include rows matching condition\n", - "result = Table & condition\n", - "\n", - "# Exclude rows matching condition \n", - "result = Table - condition\n", - "```\n", - "\n", - "## Types of Restriction Conditions\n", - "\n", - "### 1. Dictionary Conditions (Equality)\n", - "\n", - "Use dictionaries for exact equality matches:\n", - "\n", - "```python\n", - "# Create example database from the lecture\n", - "import datajoint as dj\n", - "schema = dj.Schema('languages_demo')\n", - "\n", - "@schema\n", - "class Person(dj.Manual):\n", - " definition = \"\"\"\n", - " person_id : int\n", - " ---\n", - " name : varchar(60)\n", - " date_of_birth : date\n", - " \"\"\"\n", - "\n", - "# Restrict by primary key (returns 0 or 1 record)\n", - "person_1 = Person & {'person_id': 1}\n", - "\n", - "# Restrict by secondary attributes (may return multiple records)\n", - "millennials = Person & {'name': 'John Doe', date_of_birth': '1990-01-01'}\n", - "```\n", - "\n", - "**Key principle**: Restricting by primary key always returns at most one record because primary keys are unique.\n", - "\n", - "### 2. String Conditions (Inequalities and Ranges)\n", - "\n", - "Use strings for more complex conditions:\n", - "\n", - "```python\n", - "# Range conditions\n", - "gen_z = Person & 'date_of_birth BETWEEN \"2000-01-01\" AND \"2013-12-31\"'\n", - "\n", - "# Inequality conditions \n", - "adults = Person & 'date_of_birth < \"2005-01-01\"'\n", - "\n", - "# Pattern matching\n", - "j_names = Person & 'name LIKE \"J%\"'\n", - "```\n", - "\n", - "### 3. Subquery Conditions\n", - "\n", - "The most powerful form - restrict one table based on another:\n", - "\n", - "```python\n", - "@schema\n", - "class Language(dj.Lookup):\n", - " definition = \"\"\"\n", - " lang_code : char(4)\n", - " ---\n", - " language : varchar(30)\n", - " \"\"\"\n", - " contents = [\n", - " ('en', 'English'),\n", - " ('es', 'Spanish'), \n", - " ('ja', 'Japanese')\n", - " ]\n", - "\n", - "@schema \n", - "class Fluency(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Person\n", - " -> Language \n", - " ---\n", - " fluency_level : enum('beginner', 'intermediate', 'fluent')\n", - " \"\"\"\n", - "\n", - "# Find people who speak English\n", - "english_speakers = Person & (Fluency & {'lang_code': 'en'})\n", - "```\n", - "\n", - "## Systematic Query Patterns\n", - "\n", - "Following the lecture approach, let's examine systematic patterns for building complex restrictions.\n", - "\n", - "### Pattern 1: Basic Subquery (IN)\n", - "\n", - "**Goal**: Find all people who speak English\n", - "\n", - "```python\n", - "english_speakers = Person & (Fluency & {'lang_code': 'en'})\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT DISTINCT p.*\n", - "FROM person p \n", - "WHERE p.person_id IN (\n", - " SELECT f.person_id\n", - " FROM fluency f\n", - " WHERE f.lang_code = 'en'\n", - ");\n", - "```\n", - "\n", - "**Analysis**: This selects people whose `person_id` appears in the fluency table with English.\n", - "\n", - "### Pattern 2: Negated Subquery (NOT IN)\n", - "\n", - "**Goal**: Find people who do NOT speak English\n", - "\n", - "```python\n", - "non_english_speakers = Person - (Fluency & {'lang_code': 'en'})\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT DISTINCT p.*\n", - "FROM person p\n", - "WHERE p.person_id NOT IN (\n", - " SELECT f.person_id \n", - " FROM fluency f\n", - " WHERE f.lang_code = 'en'\n", - ");\n", - "```\n", - "\n", - "### Pattern 3: Multiple Conditions (AND)\n", - "\n", - "**Goal**: Find people who speak BOTH English AND Spanish\n", - "\n", - "```python\n", - "english_speakers = Person & (Fluency & {'lang_code': 'en'})\n", - "spanish_speakers = Person & (Fluency & {'lang_code': 'es'})\n", - "bilingual = english_speakers & spanish_speakers\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT DISTINCT p.*\n", - "FROM person p\n", - "WHERE p.person_id IN (\n", - " SELECT f.person_id FROM fluency f WHERE f.lang_code = 'en'\n", - ")\n", - "AND p.person_id IN (\n", - " SELECT f.person_id FROM fluency f WHERE f.lang_code = 'es' \n", - ");\n", - "```\n", - "\n", - "**Key insight**: When you need \"both conditions\", use separate subqueries connected with AND.\n", - "\n", - "### Pattern 4: Multiple Conditions (OR)\n", - "\n", - "**Goal**: Find people who speak English OR Spanish\n", - "\n", - "```python\n", - "# Method 1: Using DataJoint\n", - "english_or_spanish = Person & (Fluency & 'lang_code IN (\"en\", \"es\")')\n", - "\n", - "# Method 2: More explicit\n", - "english_speakers = Person & (Fluency & {'lang_code': 'en'})\n", - "spanish_speakers = Person & (Fluency & {'lang_code': 'es'}) \n", - "either_language = english_speakers.proj() + spanish_speakers.proj()\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT DISTINCT p.*\n", - "FROM person p\n", - "WHERE p.person_id IN (\n", - " SELECT f.person_id\n", - " FROM fluency f \n", - " WHERE f.lang_code IN ('en', 'es')\n", - ");\n", - "```\n", - "\n", - "### Pattern 5: Complex Negation\n", - "\n", - "**Goal**: Find people who speak Japanese but NOT fluently\n", - "\n", - "```python\n", - "japanese_speakers = Person & (Fluency & {'lang_code': 'ja'})\n", - "fluent_japanese = Person & (Fluency & {'lang_code': 'ja', 'fluency_level': 'fluent'})\n", - "japanese_non_fluent = japanese_speakers - fluent_japanese\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT DISTINCT p.*\n", - "FROM person p\n", - "WHERE p.person_id IN (\n", - " SELECT f.person_id FROM fluency f WHERE f.lang_code = 'ja'\n", - ")\n", - "AND p.person_id NOT IN (\n", - " SELECT f.person_id FROM fluency f \n", - " WHERE f.lang_code = 'ja' AND f.fluency_level = 'fluent'\n", - ");\n", - "```\n", - "\n", - "## Advanced Examples from the Lecture\n", - "\n", - "### Example 1: Languages Without Fluent Speakers\n", - "\n", - "**Goal**: Find languages that no one speaks fluently\n", - "\n", - "```python\n", - "fluent_speakers = Fluency & {'fluency_level': 'fluent'}\n", - "languages_with_fluent_speakers = Language & fluent_speakers\n", - "languages_without_fluent_speakers = Language - languages_with_fluent_speakers\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT l.*\n", - "FROM language l\n", - "WHERE l.lang_code NOT IN (\n", - " SELECT f.lang_code\n", - " FROM fluency f\n", - " WHERE f.fluency_level = 'fluent'\n", - ");\n", - "```\n", - "\n", - "### Example 2: Generational Filtering\n", - "\n", - "```python\n", - "# Define generations by birth year ranges\n", - "gen_z = Person & 'date_of_birth BETWEEN \"2000-01-01\" AND \"2013-12-31\"'\n", - "millennials = Person & 'date_of_birth BETWEEN \"1981-01-01\" AND \"1999-12-31\"'\n", - "\n", - "# Find Gen Z English speakers\n", - "gen_z_english = gen_z & (Fluency & {'lang_code': 'ENG'})\n", - "```\n", - "\n", - "## Understanding Query Execution\n", - "\n", - "### Order of Operations\n", - "\n", - "Unlike SQL where SELECT and WHERE are in one statement, DataJoint separates concerns:\n", - "\n", - "1. **DataJoint approach**: \n", - " ```python\n", - " result = Person & condition # Restriction first\n", - " result = result.proj(...) # Projection second\n", - " ```\n", - "\n", - "2. **SQL approach**:\n", - " ```sql\n", - " SELECT columns -- Projection \n", - " FROM table \n", - " WHERE condition -- Restriction (executed first internally)\n", - " ```\n", - "\n", - "### Primary Key Preservation\n", - "\n", - "**Critical concept**: Restriction never changes the primary key or entity type.\n", - "\n", - "```python\n", - "# All of these have the same primary key: person_id\n", - "people = Person # Primary key: person_id\n", - "english_speakers = Person & (...) # Primary key: person_id \n", - "gen_z = Person & (...) # Primary key: person_id\n", - "```\n", - "\n", - "This enables **algebraic closure** - you can chain restrictions infinitely:\n", - "\n", - "```python\n", - "result = Person & condition1 & condition2 & condition3 # Still a Person table\n", - "```\n", - "\n", - "## Best Practices from the Lecture\n", - "\n", - "### 1. Think in Sets and Logic\n", - "\n", - "When designing restrictions, think about:\n", - "- What set am I starting with?\n", - "- What subset do I want?\n", - "- How do I express that mathematically?\n", - "\n", - "### 2. Build Complex Queries Incrementally\n", - "\n", - "```python\n", - "# Start simple\n", - "english_speakers = Person & (Fluency & {'lang_code': 'ENG'})\n", - "print(f\"English speakers: {len(english_speakers)}\")\n", - "\n", - "# Add complexity \n", - "fluent_english = english_speakers & (Fluency & {'fluency_level': 'fluent'})\n", - "print(f\"Fluent English speakers: {len(fluent_english)}\")\n", - "\n", - "# Add more conditions\n", - "gen_z_fluent_english = fluent_english & 'date_of_birth > \"2000-01-01\"'\n", - "```\n", - "\n", - "### 3. Understand Foreign Key Relationships\n", - "\n", - "Subqueries work because of foreign key relationships:\n", - "\n", - "```python\n", - "# This works because Fluency.person_id references Person.person_id\n", - "english_speakers = Person & (Fluency & {'lang_code': 'ENG'})\n", - "```\n", - "\n", - "The restriction automatically matches on the shared attributes (foreign key relationships).\n", - "\n", - "### 4. Test Your Logic\n", - "\n", - "For complex queries, verify your logic:\n", - "\n", - "```python\n", - "# Test: People who speak both English and Spanish\n", - "english = Person & (Fluency & {'lang_code': 'ENG'})\n", - "spanish = Person & (Fluency & {'lang_code': 'SPA'})\n", - "both = english & spanish\n", - "\n", - "# Verify: Should be subset of both individual sets\n", - "assert len(both) <= len(english)\n", - "assert len(both) <= len(spanish)\n", - "```\n", - "\n", - "## SQL Translation Patterns\n", - "\n", - "Every DataJoint restriction follows predictable SQL patterns:\n", - "\n", - "### Dictionary Restriction\n", - "```python\n", - "# DataJoint\n", - "Person & {'person_id': 1}\n", - "\n", - "# SQL \n", - "SELECT * FROM person WHERE person_id = 1;\n", - "```\n", - "\n", - "### String Restriction\n", - "```python\n", - "# DataJoint\n", - "Person & 'age > 25'\n", - "\n", - "# SQL\n", - "SELECT * FROM person WHERE age > 25;\n", - "```\n", - "\n", - "### Subquery Restriction\n", - "```python\n", - "# DataJoint\n", - "Person & (Fluency & {'lang_code': 'ENG'})\n", - "\n", - "# SQL\n", - "SELECT DISTINCT p.* \n", - "FROM person p\n", - "WHERE p.person_id IN (\n", - " SELECT f.person_id \n", - " FROM fluency f \n", - " WHERE f.lang_code = 'ENG'\n", - ");\n", - "```\n", - "\n", - "## Summary\n", - "\n", - "The restriction operator is fundamental to database querying. Key takeaways:\n", - "\n", - "1. **Restriction selects rows** based on conditions\n", - "2. **Primary key is preserved** - algebraic closure is maintained \n", - "3. **Three condition types**: dictionaries (equality), strings (inequalities), subqueries (relationships)\n", - "4. **Build systematically**: Start simple, add complexity incrementally\n", - "5. **Think in sets**: Use mathematical logic to design queries\n", - "6. **Foreign keys enable subqueries**: Relationships between tables drive complex restrictions\n", - "\n", - "Master these patterns and you can answer any query that asks \"find records where...\"\n", - "\n" - ] + "source": "# Operator: Restriction\n\nThe **restriction operator** is one of the fundamental operations in relational algebra. It selects rows from a table that satisfy specific conditions, allowing you to filter data based on criteria you define.\n\n## Understanding Restriction\n\nRestriction **selects rows** (not columns) from a table based on conditions. It's the \"WHERE clause\" equivalent in SQL, but in DataJoint it's represented by the `&` operator for inclusion and `-` operator for exclusion.\n\n### Key Concepts\n\n- **Restriction never changes the primary key** - the result still has the same entity type as the input\n- **Algebraic closure** - the result of restriction is still a valid relation that can be used in further operations\n- **Entity integrity** - restriction preserves the one-to-one correspondence between records and real-world entities\n\n### Basic Syntax\n\n```python\n# Include rows matching condition\nresult = Table & condition\n\n# Exclude rows matching condition \nresult = Table - condition\n```\n\n## Types of Restriction Conditions\n\n### 1. Dictionary Conditions (Equality)\n\nUse dictionaries for exact equality matches:\n\n```python\n# Create example database from the lecture\nimport datajoint as dj\nschema = dj.Schema('languages_demo')\n\n@schema\nclass Person(dj.Manual):\n definition = \"\"\"\n person_id : int\n ---\n name : varchar(60)\n date_of_birth : date\n \"\"\"\n\n# Restrict by primary key (returns 0 or 1 record)\nperson_1 = Person & {'person_id': 1}\n\n# Restrict by secondary attributes (may return multiple records)\nmillennials = Person & {'name': 'John Doe', date_of_birth': '1990-01-01'}\n```\n\n**Key principle**: Restricting by primary key always returns at most one record because primary keys are unique.\n\n### 2. String Conditions (Inequalities and Ranges)\n\nUse strings for more complex conditions:\n\n```python\n# Range conditions\ngen_z = Person & 'date_of_birth BETWEEN \"2000-01-01\" AND \"2013-12-31\"'\n\n# Inequality conditions \nadults = Person & 'date_of_birth < \"2005-01-01\"'\n\n# Pattern matching\nj_names = Person & 'name LIKE \"J%\"'\n```\n\n### 3. Subquery Conditions (Semijoins and Antijoins)\n\nThe most powerful form of restriction uses one query expression to restrict another. This creates **subqueries**—queries nested inside other queries. In DataJoint, subqueries emerge naturally when you use query expressions as restriction conditions, effectively creating a **semijoin** (with `&`) or **antijoin** (with `-`).\n\n```python\n@schema\nclass Language(dj.Lookup):\n definition = \"\"\"\n lang_code : char(4)\n ---\n language : varchar(30)\n \"\"\"\n contents = [\n ('en', 'English'),\n ('es', 'Spanish'), \n ('ja', 'Japanese')\n ]\n\n@schema \nclass Fluency(dj.Manual):\n definition = \"\"\"\n -> Person\n -> Language \n ---\n fluency_level : enum('beginner', 'intermediate', 'fluent')\n \"\"\"\n\n# Find people who speak English (semijoin)\nenglish_speakers = Person & (Fluency & {'lang_code': 'en'})\n\n# Find people who do NOT speak English (antijoin)\nnon_english_speakers = Person - (Fluency & {'lang_code': 'en'})\n```\n\nThe inner query `(Fluency & {'lang_code': 'en'})` acts as a subquery—its primary key values determine which rows from `Person` are included in or excluded from the result.\n\n## Systematic Query Patterns\n\nFollowing the lecture approach, let's examine systematic patterns for building complex restrictions.\n\n### Pattern 1: Existence Check (IN)\n\nFind entities that have related records in another table.\n\n```python\n# Find A where matching B exists\nresult = A & B\n```\n\n**Example: Students with Enrollments**\n\n```python\n# Find all students who are enrolled in at least one course\nenrolled_students = Student & Enroll\n```\n\n**SQL Equivalent**:\n```sql\nSELECT * FROM student\nWHERE student_id IN (SELECT student_id FROM enroll);\n```\n\n**Example: Students with Math Majors**\n\n```python\n# Find students majoring in math\nmath_students = Student & (StudentMajor & {'dept': 'MATH'})\n```\n\n### Pattern 2: Non-Existence Check (NOT IN)\n\nFind entities that do NOT have related records in another table.\n\n```python\n# Find A where no matching B exists\nresult = A - B\n```\n\n**Example: Students Without Enrollments**\n\n```python\n# Find students who are not enrolled in any course\nunenrolled_students = Student - Enroll\n```\n\n**SQL Equivalent**:\n```sql\nSELECT * FROM student\nWHERE student_id NOT IN (SELECT student_id FROM enroll);\n```\n\n### Pattern 3: Multiple Conditions (AND)\n\nFind entities that satisfy multiple conditions simultaneously.\n\n```python\n# Find A where both B1 and B2 conditions are met\nresult = A & B1 & B2\n```\n\n**Example: Students Speaking Both Languages**\n\n```python\n# Find people who speak BOTH English AND Spanish\nenglish_speakers = Person & (Fluency & {'lang_code': 'en'})\nspanish_speakers = Person & (Fluency & {'lang_code': 'es'})\nbilingual = english_speakers & spanish_speakers\n```\n\n**SQL Equivalent**:\n```sql\nSELECT * FROM person\nWHERE person_id IN (\n SELECT person_id FROM fluency WHERE lang_code = 'en'\n)\nAND person_id IN (\n SELECT person_id FROM fluency WHERE lang_code = 'es'\n);\n```\n\n### Pattern 4: Either/Or Conditions (OR)\n\nFind entities that satisfy at least one of multiple conditions.\n\n**Using List Restriction** (for simple OR on the same attribute):\n\n```python\n# Find A where condition1 OR condition2\nresult = A & [condition1, condition2]\n```\n\n**Using Union** (for OR across different relationships):\n\n```python\n# Find A where B1 OR B2 condition is met\nresult = (A & B1) + (A & B2)\n```\n\n**Example: Students in Multiple States**\n\n```python\n# Find students from California OR New York (simple OR)\ncoastal_students = Student & [{'home_state': 'CA'}, {'home_state': 'NY'}]\n\n# Or using SQL syntax\ncoastal_students = Student & 'home_state IN (\"CA\", \"NY\")'\n```\n\n### Pattern 5: Exclusion with Condition\n\nFind entities that have some relationship but NOT a specific variant of it.\n\n```python\n# Find A where B exists but B with specific condition does not\nresult = (A & B) - (B & specific_condition)\n```\n\n**Example: Non-Fluent Speakers**\n\n```python\n# Find people who speak Japanese but are NOT fluent\njapanese_speakers = Person & (Fluency & {'lang_code': 'ja'})\nfluent_japanese = Person & (Fluency & {'lang_code': 'ja', 'fluency_level': 'fluent'})\nnon_fluent_japanese = japanese_speakers - fluent_japanese\n```\n\n**SQL Equivalent**:\n```sql\nSELECT * FROM person\nWHERE person_id IN (\n SELECT person_id FROM fluency WHERE lang_code = 'ja'\n)\nAND person_id NOT IN (\n SELECT person_id FROM fluency \n WHERE lang_code = 'ja' AND fluency_level = 'fluent'\n);\n```\n\n### Pattern 6: All-or-Nothing (Universal Quantification)\n\nFind entities where ALL related records meet a condition, or where NO related records fail a condition.\n\n```python\n# Find A where ALL related B satisfy condition\n# Equivalent to: A with B, minus A with B that doesn't satisfy condition\nresult = (A & B) - (B - condition)\n```\n\n**Example: All-A Students**\n\n```python\n# Find students who have received ONLY 'A' grades (no non-A grades)\nstudents_with_grades = Student & Grade\nstudents_with_non_a = Student & (Grade - {'grade': 'A'})\nall_a_students = students_with_grades - students_with_non_a\n```\n\n**SQL Equivalent**:\n```sql\nSELECT * FROM student\nWHERE student_id IN (SELECT student_id FROM grade)\nAND student_id NOT IN (\n SELECT student_id FROM grade WHERE grade <> 'A'\n);\n```\n\n### Pattern 7: Reverse Perspective\n\nSometimes you need to flip the perspective—instead of asking about entities, ask about their related entities.\n\n**Example: Languages Without Speakers**\n\n```python\n# Find languages that no one speaks\nlanguages_spoken = Language & Fluency\nunspoken_languages = Language - languages_spoken\n```\n\n**Example: Courses Without Enrollments**\n\n```python\n# Find courses with no students enrolled this term\ncourses_with_enrollment = Course & (Enroll & CurrentTerm)\nempty_courses = Course - courses_with_enrollment\n```\n\n## Self-Referencing Patterns\n\nSome tables reference themselves through foreign keys, creating hierarchies like management structures or prerequisite chains.\n\n### Management Hierarchy Example\n\nConsider a schema where employees can report to other employees:\n\n```python\n@schema\nclass Employee(dj.Manual):\n definition = \"\"\"\n employee_id : int\n ---\n name : varchar(60)\n \"\"\"\n\n@schema\nclass ReportsTo(dj.Manual):\n definition = \"\"\"\n -> Employee\n ---\n -> Employee.proj(manager_id='employee_id')\n \"\"\"\n```\n\n### Finding Managers\n\n```python\n# Employees who have direct reports (are managers)\nmanagers = Employee & ReportsTo.proj(employee_id='manager_id')\n```\n\n### Finding Top-Level Managers\n\n```python\n# Employees who don't report to anyone\ntop_managers = Employee - ReportsTo\n```\n\n### Finding Non-Managers\n\n```python\n# Employees with no direct reports\nnon_managers = Employee - ReportsTo.proj(employee_id='manager_id')\n```\n\n## Building Queries Systematically\n\nComplex queries are best built incrementally. Follow this approach:\n\n### Step 1: Identify the Target Entity\n\nWhat type of entity do you want in your result?\n\n### Step 2: List the Conditions\n\nWhat criteria must the entities satisfy?\n\n### Step 3: Build Each Condition as a Query\n\nCreate separate query expressions for each condition.\n\n### Step 4: Combine with Appropriate Operators\n\n- Use `&` for AND conditions\n- Use `-` for NOT conditions\n- Use `+` for OR conditions across different paths\n\n### Step 5: Test Incrementally\n\nVerify each intermediate result.\n\n### Example: Building a Complex Query\n\n**Goal**: Find CS majors who are enrolled this term but haven't received any grades yet.\n\n```python\n# Step 1: Target entity is Student\n# Step 2: Conditions:\n# - Has CS major\n# - Enrolled in current term\n# - No grades in current term\n\n# Step 3: Build each condition\ncs_majors = Student & (StudentMajor & {'dept': 'CS'})\nenrolled_current = Student & (Enroll & CurrentTerm)\ngraded_current = Student & (Grade & CurrentTerm)\n\n# Step 4: Combine\nresult = cs_majors & enrolled_current - graded_current\n\n# Step 5: Verify counts\nprint(f\"CS majors: {len(cs_majors)}\")\nprint(f\"Enrolled current term: {len(enrolled_current)}\")\nprint(f\"CS majors enrolled, no grades: {len(result)}\")\n```\n\n## Summary of Patterns\n\n| Pattern | DataJoint | SQL Equivalent |\n|---------|-----------|----------------|\n| Existence (IN) | `A & B` | `WHERE id IN (SELECT ...)` |\n| Non-existence (NOT IN) | `A - B` | `WHERE id NOT IN (SELECT ...)` |\n| AND (both conditions) | `A & B1 & B2` | `WHERE ... AND ...` |\n| OR (either condition) | `(A & B1) + (A & B2)` | `WHERE ... OR ...` |\n| Exclusion | `(A & B) - B_condition` | `WHERE IN (...) AND NOT IN (...)` |\n| Universal (all match) | `(A & B) - (B - condition)` | `WHERE IN (...) AND NOT IN (NOT condition)` |\n\nKey principles:\n1. **Build incrementally** — construct complex queries from simpler parts\n2. **Test intermediate results** — verify each step before combining\n3. **Think in sets** — restriction filters sets, not individual records\n4. **Primary key is preserved** — restrictions never change the entity type\n\n## Understanding Query Execution\n\n### Order of Operations\n\nUnlike SQL where SELECT and WHERE are in one statement, DataJoint separates concerns:\n\n1. **DataJoint approach**: \n ```python\n result = Person & condition # Restriction first\n result = result.proj(...) # Projection second\n ```\n\n2. **SQL approach**:\n ```sql\n SELECT columns -- Projection \n FROM table \n WHERE condition -- Restriction (executed first internally)\n ```\n\n### Primary Key Preservation\n\n**Critical concept**: Restriction never changes the primary key or entity type.\n\n```python\n# All of these have the same primary key: person_id\npeople = Person # Primary key: person_id\nenglish_speakers = Person & (...) # Primary key: person_id \ngen_z = Person & (...) # Primary key: person_id\n```\n\nThis enables **algebraic closure** - you can chain restrictions infinitely:\n\n```python\nresult = Person & condition1 & condition2 & condition3 # Still a Person table\n```\n\n## Best Practices\n\n### 1. Think in Sets and Logic\n\nWhen designing restrictions, think about:\n- What set am I starting with?\n- What subset do I want?\n- How do I express that mathematically?\n\n### 2. Build Complex Queries Incrementally\n\n```python\n# Start simple\nenglish_speakers = Person & (Fluency & {'lang_code': 'ENG'})\nprint(f\"English speakers: {len(english_speakers)}\")\n\n# Add complexity \nfluent_english = english_speakers & (Fluency & {'fluency_level': 'fluent'})\nprint(f\"Fluent English speakers: {len(fluent_english)}\")\n\n# Add more conditions\ngen_z_fluent_english = fluent_english & 'date_of_birth > \"2000-01-01\"'\n```\n\n### 3. Understand Foreign Key Relationships\n\nSubqueries work because of foreign key relationships:\n\n```python\n# This works because Fluency.person_id references Person.person_id\nenglish_speakers = Person & (Fluency & {'lang_code': 'ENG'})\n```\n\nThe restriction automatically matches on the shared attributes (foreign key relationships).\n\n### 4. Test Your Logic\n\nFor complex queries, verify your logic:\n\n```python\n# Test: People who speak both English and Spanish\nenglish = Person & (Fluency & {'lang_code': 'ENG'})\nspanish = Person & (Fluency & {'lang_code': 'SPA'})\nboth = english & spanish\n\n# Verify: Should be subset of both individual sets\nassert len(both) <= len(english)\nassert len(both) <= len(spanish)\n```\n\n## Summary\n\nThe restriction operator is fundamental to database querying. Key takeaways:\n\n1. **Restriction selects rows** based on conditions\n2. **Primary key is preserved** - algebraic closure is maintained \n3. **Three condition types**: dictionaries (equality), strings (inequalities), subqueries (relationships)\n4. **Build systematically**: Start simple, add complexity incrementally\n5. **Think in sets**: Use mathematical logic to design queries\n6. **Foreign keys enable subqueries**: Relationships between tables drive complex restrictions\n\nMaster these patterns and you can answer any query that asks \"find records where...\"" }, { "cell_type": "markdown", "metadata": {}, - "source": "## Practice Exercises: Systematic Query Building\n\nLet's work through practical examples using the languages database from the lecture. These exercises will help you develop systematic thinking about restriction queries.\n\n### Setup: Languages Database\n\n```python\nimport datajoint as dj\nschema = dj.Schema('languages_practice')\n\n@schema\nclass Language(dj.Lookup):\n definition = \"\"\"\n lang_code : char(4)\n ---\n language : varchar(30)\n \"\"\"\n contents = [\n ('ENG', 'English'),\n ('SPA', 'Spanish'),\n ('JPN', 'Japanese'),\n ('TAG', 'Tagalog'),\n ('MAN', 'Mandarin'),\n ('POR', 'Portuguese')\n ]\n\n@schema\nclass Person(dj.Manual):\n definition = \"\"\"\n person_id : int\n ---\n name : varchar(60)\n date_of_birth : date\n \"\"\"\n\n@schema\nclass Fluency(dj.Manual):\n definition = \"\"\"\n -> Person\n -> Language\n ---\n fluency_level : enum('beginner', 'intermediate', 'fluent')\n \"\"\"\n\n# Populate with sample data...\n```\n\n### Exercise 1: Basic Restrictions\n\n**Question**: How would you find person with ID 5?\n\n**Solution**:\n```python\nperson_5 = Person & {'person_id': 5}\n```\n\n**Key insight**: Primary key restrictions return 0 or 1 record.\n\n**Question**: How would you find all people born after 2000?\n\n**Solution**:\n```python\ngen_z = Person & 'date_of_birth > \"2000-01-01\"'\n```\n\n### Exercise 2: Simple Subqueries\n\n**Question**: Find all people who speak English.\n\n**Step-by-step thinking**:\n1. I want people (start with `Person` table)\n2. Who speak English (condition in `Fluency` table)\n3. English speakers are those whose `person_id` appears in `Fluency` with `lang_code = 'ENG'`\n\n**Solution**:\n```python\nenglish_speakers = Person & (Fluency & {'lang_code': 'ENG'})\n```\n\n**SQL equivalent**:\n```sql\nSELECT DISTINCT p.*\nFROM person p\nWHERE p.person_id IN (\n SELECT f.person_id\n FROM fluency f\n WHERE f.lang_code = 'ENG'\n);\n```\n\n### Exercise 3: Negation\n\n**Question**: Find people who do NOT speak English.\n\n**Step-by-step thinking**:\n1. I want people (start with `Person` table)\n2. Who do NOT speak English (exclude those in the English speakers set)\n3. Use subtraction operator `-`\n\n**Solution**:\n```python\nnon_english_speakers = Person - (Fluency & {'lang_code': 'ENG'})\n```\n\n**SQL equivalent**:\n```sql\nSELECT DISTINCT p.*\nFROM person p\nWHERE p.person_id NOT IN (\n SELECT f.person_id\n FROM fluency f\n WHERE f.lang_code = 'ENG'\n);\n```\n\n### Exercise 4: Multiple Conditions (AND)\n\n**Question**: Find people who speak BOTH English AND Spanish.\n\n**Step-by-step thinking**:\n1. I want people who speak English AND Spanish\n2. This means they must be in BOTH sets\n3. Create each set separately, then intersect with `&`\n\n**Solution**:\n```python\nenglish_speakers = Person & (Fluency & {'lang_code': 'ENG'})\nspanish_speakers = Person & (Fluency & {'lang_code': 'SPA'})\nbilingual = english_speakers & spanish_speakers\n```\n\n**SQL equivalent**:\n```sql\nSELECT DISTINCT p.*\nFROM person p\nWHERE p.person_id IN (\n SELECT f.person_id FROM fluency f WHERE f.lang_code = 'ENG'\n)\nAND p.person_id IN (\n SELECT f.person_id FROM fluency f WHERE f.lang_code = 'SPA'\n);\n```\n\n### Exercise 5: Multiple Conditions (OR)\n\n**Question**: Find people who speak English OR Spanish.\n\n**Solution Method 1** (using IN):\n```python\nenglish_or_spanish = Person & (Fluency & 'lang_code IN (\"ENG\", \"SPA\")')\n```\n\n**Solution Method 2** (explicit union):\n```python\nenglish_speakers = Person & (Fluency & {'lang_code': 'ENG'})\nspanish_speakers = Person & (Fluency & {'lang_code': 'SPA'})\n# Note: Union removes duplicates automatically\n```\n\n### Exercise 6: Complex Negation\n\n**Question**: Find people who speak Japanese but NOT fluently.\n\n**Step-by-step thinking**:\n1. I want Japanese speakers (those in Fluency with Japanese)\n2. But NOT fluent ones (exclude those with fluency_level = 'fluent')\n3. Japanese speakers MINUS fluent Japanese speakers\n\n**Solution**:\n```python\njapanese_speakers = Person & (Fluency & {'lang_code': 'JPN'})\nfluent_japanese = Person & (Fluency & {'lang_code': 'JPN', 'fluency_level': 'fluent'})\njapanese_non_fluent = japanese_speakers - fluent_japanese\n```\n\n**Alternative solution** (direct):\n```python\njapanese_non_fluent = Person & (Fluency & {'lang_code': 'JPN'}) - \\\n (Fluency & {'lang_code': 'JPN', 'fluency_level': 'fluent'})\n```\n\n### Exercise 7: Reverse Perspective\n\n**Question**: Find languages that are NOT spoken by anyone fluently.\n\n**Step-by-step thinking**:\n1. I want languages (start with `Language` table)\n2. That are NOT spoken fluently (exclude those that appear in fluent records)\n3. Languages MINUS languages with fluent speakers\n\n**Solution**:\n```python\nfluent_records = Fluency & {'fluency_level': 'fluent'}\nlanguages_with_fluent_speakers = Language & fluent_records\nlanguages_without_fluent_speakers = Language - languages_with_fluent_speakers\n```\n\n### Exercise 8: Chaining Restrictions\n\n**Question**: Find Gen Z people who speak English fluently.\n\n**Step-by-step thinking**:\n1. Start with all people\n2. Restrict to Gen Z (born after 2000)\n3. Further restrict to English speakers\n4. Further restrict to fluent level\n\n**Solution**:\n```python\ngen_z = Person & 'date_of_birth > \"2000-01-01\"'\ngen_z_english = gen_z & (Fluency & {'lang_code': 'ENG'})\ngen_z_english_fluent = gen_z_english & (Fluency & {'fluency_level': 'fluent'})\n\n# Or in one line:\nresult = Person & 'date_of_birth > \"2000-01-01\"' & \\\n (Fluency & {'lang_code': 'ENG', 'fluency_level': 'fluent'})\n```\n\n## Debugging and Verification Techniques\n\n### Test Your Logic\n\n```python\n# Always verify your logic makes sense\nenglish = Person & (Fluency & {'lang_code': 'ENG'})\nspanish = Person & (Fluency & {'lang_code': 'SPA'})\nboth = english & spanish\n\n# Sanity checks:\nprint(f\"English speakers: {len(english)}\")\nprint(f\"Spanish speakers: {len(spanish)}\")\nprint(f\"Bilingual: {len(both)}\")\n\n# Both should be <= each individual set\nassert len(both) <= len(english)\nassert len(both) <= len(spanish)\n```\n\n### Build Incrementally\n\n```python\n# Start simple and add complexity\nstep1 = Person\nprint(f\"All people: {len(step1)}\")\n\nstep2 = step1 & (Fluency & {'lang_code': 'ENG'})\nprint(f\"English speakers: {len(step2)}\")\n\nstep3 = step2 & (Fluency & {'fluency_level': 'fluent'})\nprint(f\"Fluent English speakers: {len(step3)}\")\n```\n\n### Common Patterns Summary\n\n1. **Basic inclusion**: `Table & condition`\n2. **Basic exclusion**: `Table - condition`\n3. **Logicial Donjunction (AND-list)**: `Table & cond1 & cond2 & cond3`\n4. **Logical Disjunction (OR-list)**: `Table & [cond1, cond2, cond3]`\n\nThese patterns form the building blocks for any restriction query you'll encounter.\n\n## Further Practice\n\n:::{seealso}\nFor comprehensive query examples covering restriction and all other operators on a realistic academic database, see the [University Queries](../80-examples/016-university-queries.ipynb) example, which demonstrates these patterns with 2,000 students, multiple departments, course enrollments, and grade tracking.\n:::" + "source": "## Practice Exercises\n\n### Setup: Languages Database\n\n```python\nimport datajoint as dj\nschema = dj.Schema('languages_practice')\n\n@schema\nclass Language(dj.Lookup):\n definition = \"\"\"\n lang_code : char(4)\n ---\n language : varchar(30)\n \"\"\"\n contents = [\n ('ENG', 'English'),\n ('SPA', 'Spanish'),\n ('JPN', 'Japanese'),\n ('TAG', 'Tagalog'),\n ('MAN', 'Mandarin'),\n ('POR', 'Portuguese')\n ]\n\n@schema\nclass Person(dj.Manual):\n definition = \"\"\"\n person_id : int\n ---\n name : varchar(60)\n date_of_birth : date\n \"\"\"\n\n@schema\nclass Fluency(dj.Manual):\n definition = \"\"\"\n -> Person\n -> Language\n ---\n fluency_level : enum('beginner', 'intermediate', 'fluent')\n \"\"\"\n\n# Populate with sample data...\n```\n\n### Exercise 1: Basic Restrictions\n\n**Question**: How would you find person with ID 5?\n\n**Solution**:\n```python\nperson_5 = Person & {'person_id': 5}\n```\n\n**Question**: How would you find all people born after 2000?\n\n**Solution**:\n```python\ngen_z = Person & 'date_of_birth > \"2000-01-01\"'\n```\n\n### Exercise 2: Simple Subqueries\n\n**Question**: Find all people who speak English.\n\n**Solution**:\n```python\nenglish_speakers = Person & (Fluency & {'lang_code': 'ENG'})\n```\n\n### Exercise 3: Negation\n\n**Question**: Find people who do NOT speak English.\n\n**Solution**:\n```python\nnon_english_speakers = Person - (Fluency & {'lang_code': 'ENG'})\n```\n\n### Exercise 4: Multiple Conditions (AND)\n\n**Question**: Find people who speak BOTH English AND Spanish.\n\n**Solution**:\n```python\nenglish_speakers = Person & (Fluency & {'lang_code': 'ENG'})\nspanish_speakers = Person & (Fluency & {'lang_code': 'SPA'})\nbilingual = english_speakers & spanish_speakers\n```\n\n### Exercise 5: Multiple Conditions (OR)\n\n**Question**: Find people who speak English OR Spanish.\n\n**Solution**:\n```python\nenglish_or_spanish = Person & (Fluency & 'lang_code IN (\"ENG\", \"SPA\")')\n```\n\n### Exercise 6: Complex Negation\n\n**Question**: Find people who speak Japanese but NOT fluently.\n\n**Solution**:\n```python\njapanese_speakers = Person & (Fluency & {'lang_code': 'JPN'})\nfluent_japanese = Person & (Fluency & {'lang_code': 'JPN', 'fluency_level': 'fluent'})\njapanese_non_fluent = japanese_speakers - fluent_japanese\n```\n\n### Exercise 7: Reverse Perspective\n\n**Question**: Find languages that are NOT spoken by anyone fluently.\n\n**Solution**:\n```python\nfluent_records = Fluency & {'fluency_level': 'fluent'}\nlanguages_with_fluent_speakers = Language & fluent_records\nlanguages_without_fluent_speakers = Language - languages_with_fluent_speakers\n```\n\n### Exercise 8: All-or-Nothing\n\n**Question**: Find students who have received only 'A' grades.\n\n**Solution**:\n```python\nhas_grades = Student & Grade\nhas_non_a = Student & (Grade - {'grade': 'A'})\nall_a = has_grades - has_non_a\n```\n\n### Exercise 9: Chaining Restrictions\n\n**Question**: Find Gen Z people who speak English fluently.\n\n**Solution**:\n```python\nresult = Person & 'date_of_birth > \"2000-01-01\"' & \\\n (Fluency & {'lang_code': 'ENG', 'fluency_level': 'fluent'})\n```\n\n## Debugging and Verification Techniques\n\n### Test Your Logic\n\n```python\n# Always verify your logic makes sense\nenglish = Person & (Fluency & {'lang_code': 'ENG'})\nspanish = Person & (Fluency & {'lang_code': 'SPA'})\nboth = english & spanish\n\n# Sanity checks:\nprint(f\"English speakers: {len(english)}\")\nprint(f\"Spanish speakers: {len(spanish)}\")\nprint(f\"Bilingual: {len(both)}\")\n\n# Both should be <= each individual set\nassert len(both) <= len(english)\nassert len(both) <= len(spanish)\n```\n\n### Build Incrementally\n\n```python\n# Start simple and add complexity\nstep1 = Person\nprint(f\"All people: {len(step1)}\")\n\nstep2 = step1 & (Fluency & {'lang_code': 'ENG'})\nprint(f\"English speakers: {len(step2)}\")\n\nstep3 = step2 & (Fluency & {'fluency_level': 'fluent'})\nprint(f\"Fluent English speakers: {len(step3)}\")\n```\n\n### Common Patterns Summary\n\n1. **Basic inclusion**: `Table & condition`\n2. **Basic exclusion**: `Table - condition`\n3. **Logical Conjunction (AND-list)**: `Table & cond1 & cond2 & cond3`\n4. **Logical Disjunction (OR-list)**: `Table & [cond1, cond2, cond3]`\n\nThese patterns form the building blocks for any restriction query you'll encounter.\n\n:::{seealso}\nFor comprehensive query examples covering restriction and all other operators on a realistic academic database, see the [University Queries](../80-examples/016-university-queries.ipynb) example, which demonstrates these patterns with 2,000 students, multiple departments, course enrollments, and grade tracking.\n:::" } ], "metadata": { diff --git a/book/50-queries/080-subqueries.ipynb b/book/50-queries/080-subqueries.ipynb deleted file mode 100644 index a88d084..0000000 --- a/book/50-queries/080-subqueries.ipynb +++ /dev/null @@ -1,582 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Subqueries and Query Patterns\n", - "\n", - "**Subqueries** are queries nested inside other queries. In DataJoint, subqueries emerge naturally when you use query expressions as restriction conditions. This chapter explores common patterns for answering complex questions using composed queries.\n", - "\n", - "## Understanding Subqueries in DataJoint\n", - "\n", - "In DataJoint, you create subqueries by using one query expression to restrict another. The restriction operator (`&` or `-`) accepts query expressions as conditions, effectively creating a semijoin or antijoin.\n", - "\n", - "### Basic Concept\n", - "\n", - "```python\n", - "# Outer query restricted by inner query (subquery)\n", - "result = OuterTable & InnerQuery\n", - "```\n", - "\n", - "The `InnerQuery` acts as a subquery—its primary key values determine which rows from `OuterTable` are included in the result." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Pattern 1: Existence Check (IN)\n", - "\n", - "Find entities that have related records in another table.\n", - "\n", - "### Pattern\n", - "\n", - "```python\n", - "# Find A where matching B exists\n", - "result = A & B\n", - "```\n", - "\n", - "### Example: Students with Enrollments\n", - "\n", - "```python\n", - "# Find all students who are enrolled in at least one course\n", - "enrolled_students = Student & Enroll\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT * FROM student\n", - "WHERE student_id IN (SELECT student_id FROM enroll);\n", - "```\n", - "\n", - "### Example: Students with Math Majors\n", - "\n", - "```python\n", - "# Find students majoring in math\n", - "math_students = Student & (StudentMajor & {'dept': 'MATH'})\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT * FROM student\n", - "WHERE student_id IN (\n", - " SELECT student_id FROM student_major WHERE dept = 'MATH'\n", - ");\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Pattern 2: Non-Existence Check (NOT IN)\n", - "\n", - "Find entities that do NOT have related records in another table.\n", - "\n", - "### Pattern\n", - "\n", - "```python\n", - "# Find A where no matching B exists\n", - "result = A - B\n", - "```\n", - "\n", - "### Example: Students Without Enrollments\n", - "\n", - "```python\n", - "# Find students who are not enrolled in any course\n", - "unenrolled_students = Student - Enroll\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT * FROM student\n", - "WHERE student_id NOT IN (SELECT student_id FROM enroll);\n", - "```\n", - "\n", - "### Example: Students Without Math Courses\n", - "\n", - "```python\n", - "# Find students who have never taken a math course\n", - "no_math_students = Student - (Enroll & {'dept': 'MATH'})\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT * FROM student\n", - "WHERE student_id NOT IN (\n", - " SELECT student_id FROM enroll WHERE dept = 'MATH'\n", - ");\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Pattern 3: Multiple Conditions (AND)\n", - "\n", - "Find entities that satisfy multiple conditions simultaneously.\n", - "\n", - "### Pattern\n", - "\n", - "```python\n", - "# Find A where both B1 and B2 conditions are met\n", - "result = (A & B1) & B2\n", - "# Or equivalently\n", - "result = A & B1 & B2\n", - "```\n", - "\n", - "### Example: Students Speaking Both Languages\n", - "\n", - "```python\n", - "# Find people who speak BOTH English AND Spanish\n", - "english_speakers = Person & (Fluency & {'lang_code': 'en'})\n", - "spanish_speakers = Person & (Fluency & {'lang_code': 'es'})\n", - "bilingual = english_speakers & spanish_speakers\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT * FROM person\n", - "WHERE person_id IN (\n", - " SELECT person_id FROM fluency WHERE lang_code = 'en'\n", - ")\n", - "AND person_id IN (\n", - " SELECT person_id FROM fluency WHERE lang_code = 'es'\n", - ");\n", - "```\n", - "\n", - "### Example: Students with Major AND Current Enrollment\n", - "\n", - "```python\n", - "# Find students who have declared a major AND are enrolled this term\n", - "active_declared = (Student & StudentMajor) & (Enroll & CurrentTerm)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Pattern 4: Either/Or Conditions (OR)\n", - "\n", - "Find entities that satisfy at least one of multiple conditions.\n", - "\n", - "### Pattern Using List Restriction\n", - "\n", - "For simple OR on the same attribute:\n", - "\n", - "```python\n", - "# Find A where condition1 OR condition2\n", - "result = A & [condition1, condition2]\n", - "```\n", - "\n", - "### Pattern Using Union\n", - "\n", - "For OR across different relationships:\n", - "\n", - "```python\n", - "# Find A where B1 OR B2 condition is met\n", - "result = (A & B1) + (A & B2)\n", - "```\n", - "\n", - "### Example: Students in Multiple States\n", - "\n", - "```python\n", - "# Find students from California OR New York (simple OR)\n", - "coastal_students = Student & [{'home_state': 'CA'}, {'home_state': 'NY'}]\n", - "\n", - "# Or using SQL syntax\n", - "coastal_students = Student & 'home_state IN (\"CA\", \"NY\")'\n", - "```\n", - "\n", - "### Example: Students Speaking Either Language\n", - "\n", - "```python\n", - "# Find people who speak English OR Spanish (cross-relationship OR)\n", - "english_speakers = Person & (Fluency & {'lang_code': 'en'})\n", - "spanish_speakers = Person & (Fluency & {'lang_code': 'es'})\n", - "either_language = english_speakers + spanish_speakers\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Pattern 5: Exclusion with Condition\n", - "\n", - "Find entities that have some relationship but NOT a specific variant of it.\n", - "\n", - "### Pattern\n", - "\n", - "```python\n", - "# Find A where B exists but B with specific condition does not\n", - "result = (A & B) - (B & specific_condition)\n", - "```\n", - "\n", - "### Example: Non-Fluent Speakers\n", - "\n", - "```python\n", - "# Find people who speak Japanese but are NOT fluent\n", - "japanese_speakers = Person & (Fluency & {'lang_code': 'ja'})\n", - "fluent_japanese = Person & (Fluency & {'lang_code': 'ja', 'fluency_level': 'fluent'})\n", - "non_fluent_japanese = japanese_speakers - fluent_japanese\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT * FROM person\n", - "WHERE person_id IN (\n", - " SELECT person_id FROM fluency WHERE lang_code = 'ja'\n", - ")\n", - "AND person_id NOT IN (\n", - " SELECT person_id FROM fluency \n", - " WHERE lang_code = 'ja' AND fluency_level = 'fluent'\n", - ");\n", - "```\n", - "\n", - "### Example: Students with Incomplete Grades\n", - "\n", - "```python\n", - "# Find students enrolled in current term without grades yet\n", - "currently_enrolled = Student & (Enroll & CurrentTerm)\n", - "graded_this_term = Student & (Grade & CurrentTerm)\n", - "awaiting_grades = currently_enrolled - graded_this_term\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Pattern 6: All-or-Nothing (Universal Quantification)\n", - "\n", - "Find entities where ALL related records meet a condition, or where NO related records fail a condition.\n", - "\n", - "### Pattern: All Match\n", - "\n", - "```python\n", - "# Find A where ALL related B satisfy condition\n", - "# Equivalent to: A with B, minus A with B that doesn't satisfy condition\n", - "result = (A & B) - (B - condition)\n", - "```\n", - "\n", - "### Example: All-A Students\n", - "\n", - "```python\n", - "# Find students who have received ONLY 'A' grades (no non-A grades)\n", - "students_with_grades = Student & Grade\n", - "students_with_non_a = Student & (Grade - {'grade': 'A'})\n", - "all_a_students = students_with_grades - students_with_non_a\n", - "```\n", - "\n", - "**SQL Equivalent**:\n", - "```sql\n", - "SELECT * FROM student\n", - "WHERE student_id IN (SELECT student_id FROM grade)\n", - "AND student_id NOT IN (\n", - " SELECT student_id FROM grade WHERE grade <> 'A'\n", - ");\n", - "```\n", - "\n", - "### Example: Languages with Only Fluent Speakers\n", - "\n", - "```python\n", - "# Find languages where all speakers are fluent (no non-fluent speakers)\n", - "languages_with_speakers = Language & Fluency\n", - "languages_with_non_fluent = Language & (Fluency - {'fluency_level': 'fluent'})\n", - "all_fluent_languages = languages_with_speakers - languages_with_non_fluent\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Pattern 7: Reverse Perspective\n", - "\n", - "Sometimes you need to flip the perspective—instead of asking about entities, ask about their related entities.\n", - "\n", - "### Example: Languages Without Speakers\n", - "\n", - "```python\n", - "# Find languages that no one speaks\n", - "languages_spoken = Language & Fluency\n", - "unspoken_languages = Language - languages_spoken\n", - "```\n", - "\n", - "### Example: Courses Without Enrollments\n", - "\n", - "```python\n", - "# Find courses with no students enrolled this term\n", - "courses_with_enrollment = Course & (Enroll & CurrentTerm)\n", - "empty_courses = Course - courses_with_enrollment\n", - "```\n", - "\n", - "### Example: Departments Without Majors\n", - "\n", - "```python\n", - "# Find departments that have no declared majors\n", - "departments_with_majors = Department & StudentMajor\n", - "departments_without_majors = Department - departments_with_majors\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Examples from the University Database\n", - "\n", - "### Example 1: Students with Ungraded Enrollments\n", - "\n", - "Find students enrolled in the current term who haven't received grades yet:\n", - "\n", - "```python\n", - "# Students enrolled this term\n", - "enrolled_current = Student & (Enroll & CurrentTerm)\n", - "\n", - "# Students with grades this term\n", - "graded_current = Student & (Grade & CurrentTerm)\n", - "\n", - "# Students awaiting grades\n", - "awaiting_grades = enrolled_current - graded_current\n", - "```\n", - "\n", - "### Example 2: Students in Specific Courses\n", - "\n", - "```python\n", - "# Students enrolled in Introduction to CS (CS 1410)\n", - "cs_intro_students = Student & (Enroll & {'dept': 'CS', 'course': 1410})\n", - "\n", - "# Students who have taken both CS 1410 and CS 2420\n", - "cs_1410 = Student & (Enroll & {'dept': 'CS', 'course': 1410})\n", - "cs_2420 = Student & (Enroll & {'dept': 'CS', 'course': 2420})\n", - "both_courses = cs_1410 & cs_2420\n", - "```\n", - "\n", - "### Example 3: High-Performing Students\n", - "\n", - "```python\n", - "# Students with only A or B grades (no C or below)\n", - "students_with_grades = Student & Grade\n", - "students_with_low_grades = Student & (Grade & 'grade NOT IN (\"A\", \"B\")')\n", - "honor_roll = students_with_grades - students_with_low_grades\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Self-Referencing Patterns\n", - "\n", - "Some tables reference themselves through foreign keys, creating hierarchies like management structures or prerequisite chains.\n", - "\n", - "### Management Hierarchy Example\n", - "\n", - "Consider a schema where employees can report to other employees:\n", - "\n", - "```python\n", - "@schema\n", - "class Employee(dj.Manual):\n", - " definition = \"\"\"\n", - " employee_id : int\n", - " ---\n", - " name : varchar(60)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class ReportsTo(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Employee\n", - " ---\n", - " -> Employee.proj(manager_id='employee_id')\n", - " \"\"\"\n", - "```\n", - "\n", - "### Finding Managers\n", - "\n", - "```python\n", - "# Employees who have direct reports (are managers)\n", - "managers = Employee & ReportsTo.proj(employee_id='manager_id')\n", - "```\n", - "\n", - "### Finding Top-Level Managers\n", - "\n", - "```python\n", - "# Employees who don't report to anyone\n", - "top_managers = Employee - ReportsTo\n", - "```\n", - "\n", - "### Finding Non-Managers\n", - "\n", - "```python\n", - "# Employees with no direct reports\n", - "non_managers = Employee - ReportsTo.proj(employee_id='manager_id')\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Building Queries Systematically\n", - "\n", - "Complex queries are best built incrementally. Follow this approach:\n", - "\n", - "### Step 1: Identify the Target Entity\n", - "\n", - "What type of entity do you want in your result?\n", - "\n", - "### Step 2: List the Conditions\n", - "\n", - "What criteria must the entities satisfy?\n", - "\n", - "### Step 3: Build Each Condition as a Query\n", - "\n", - "Create separate query expressions for each condition.\n", - "\n", - "### Step 4: Combine with Appropriate Operators\n", - "\n", - "- Use `&` for AND conditions\n", - "- Use `-` for NOT conditions\n", - "- Use `+` for OR conditions across different paths\n", - "\n", - "### Step 5: Test Incrementally\n", - "\n", - "Verify each intermediate result.\n", - "\n", - "### Example: Building a Complex Query\n", - "\n", - "**Goal**: Find CS majors who are enrolled this term but haven't received any grades yet.\n", - "\n", - "```python\n", - "# Step 1: Target entity is Student\n", - "# Step 2: Conditions:\n", - "# - Has CS major\n", - "# - Enrolled in current term\n", - "# - No grades in current term\n", - "\n", - "# Step 3: Build each condition\n", - "cs_majors = Student & (StudentMajor & {'dept': 'CS'})\n", - "enrolled_current = Student & (Enroll & CurrentTerm)\n", - "graded_current = Student & (Grade & CurrentTerm)\n", - "\n", - "# Step 4: Combine\n", - "result = cs_majors & enrolled_current - graded_current\n", - "\n", - "# Step 5: Verify counts\n", - "print(f\"CS majors: {len(cs_majors)}\")\n", - "print(f\"Enrolled current term: {len(enrolled_current)}\")\n", - "print(f\"CS majors enrolled, no grades: {len(result)}\")\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Summary of Patterns\n", - "\n", - "| Pattern | DataJoint | SQL Equivalent |\n", - "|---------|-----------|----------------|\n", - "| Existence (IN) | `A & B` | `WHERE id IN (SELECT ...)` |\n", - "| Non-existence (NOT IN) | `A - B` | `WHERE id NOT IN (SELECT ...)` |\n", - "| AND (both conditions) | `A & B1 & B2` | `WHERE ... AND ...` |\n", - "| OR (either condition) | `(A & B1) + (A & B2)` | `WHERE ... OR ...` |\n", - "| Exclusion | `(A & B) - B_condition` | `WHERE IN (...) AND NOT IN (...)` |\n", - "| Universal (all match) | `(A & B) - (B - condition)` | `WHERE IN (...) AND NOT IN (NOT condition)` |\n", - "\n", - "Key principles:\n", - "1. **Build incrementally** — construct complex queries from simpler parts\n", - "2. **Test intermediate results** — verify each step before combining\n", - "3. **Think in sets** — restriction filters sets, not individual records\n", - "4. **Primary key is preserved** — restrictions never change the entity type" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Practice Exercises\n", - "\n", - "### Exercise 1: Existence\n", - "\n", - "**Task**: Find all departments that have at least one student major.\n", - "\n", - "```python\n", - "active_departments = Department & StudentMajor\n", - "```\n", - "\n", - "### Exercise 2: Non-Existence\n", - "\n", - "**Task**: Find students who have never taken a biology course.\n", - "\n", - "```python\n", - "no_bio = Student - (Enroll & {'dept': 'BIOL'})\n", - "```\n", - "\n", - "### Exercise 3: AND Conditions\n", - "\n", - "**Task**: Find students who major in MATH AND have taken at least one CS course.\n", - "\n", - "```python\n", - "math_majors = Student & (StudentMajor & {'dept': 'MATH'})\n", - "took_cs = Student & (Enroll & {'dept': 'CS'})\n", - "math_majors_with_cs = math_majors & took_cs\n", - "```\n", - "\n", - "### Exercise 4: All-A Students\n", - "\n", - "**Task**: Find students who have received only 'A' grades.\n", - "\n", - "```python\n", - "has_grades = Student & Grade\n", - "has_non_a = Student & (Grade - {'grade': 'A'})\n", - "all_a = has_grades - has_non_a\n", - "```\n", - "\n", - "### Exercise 5: Complex Query\n", - "\n", - "**Task**: Find departments where all students have a GPA above 3.0.\n", - "\n", - "```python\n", - "# Students with GPA (computed via aggregation)\n", - "student_gpa = Student.aggr(\n", - " Course * Grade * LetterGrade,\n", - " gpa='SUM(points * credits) / SUM(credits)'\n", - ")\n", - "\n", - "# Students with low GPA\n", - "low_gpa_students = student_gpa & 'gpa < 3.0'\n", - "\n", - "# Departments with low-GPA students\n", - "depts_with_low_gpa = Department & (StudentMajor & low_gpa_students)\n", - "\n", - "# Departments where all students have GPA >= 3.0\n", - "all_high_gpa_depts = (Department & StudentMajor) - depts_with_low_gpa\n", - "```\n", - "\n", - ":::{seealso}\n", - "For more subquery examples, see the [University Queries](../80-examples/016-university-queries.ipynb) example.\n", - ":::" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.11.0" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}