diff --git a/book/20-concepts/00-databases.md b/book/20-concepts/00-databases.md index f2c4f13..894fd71 100644 --- a/book/20-concepts/00-databases.md +++ b/book/20-concepts/00-databases.md @@ -13,14 +13,14 @@ The database not only tracks the current state of the enterprise's processes but **Key traits of databases**: - Structured data reflects the logic of the enterprise's operations - Supports the organization's operations by reflecting and enforcing its rules and constraints (data integrity) -- **Precise access control ensures only authorized users can view or modify specific data** +- Precise access control ensures only authorized users can view or modify specific data - Ability to evolve over time - Facilitates distributed, concurrent access by multiple users - Centralized data consistency, appearing as a single source of data even if physically distributed, reflecting all changes - Allows specific and precise queries through various interfaces for different users ``` -Databases are crucial for the smooth and organized operation of various entities, from hotels and airlines to universities, banks, and research projects. They ensure that processes are accurately tracked, essential rules are enforced, only valid transactions are allowed, and **sensitive data is protected** from unauthorized access. This combination of data integrity and data security makes databases indispensable for any operation where data reliability and confidentiality matter. +Databases are crucial for the smooth and organized operation of various entities, from hotels and airlines to universities, banks, and research projects. They ensure that processes are accurately tracked, essential rules are enforced, only valid transactions are allowed, and sensitive data is protected from unauthorized access. This combination of data integrity and data security makes databases indispensable for any operation where data reliability and confidentiality matter. ## Database Management Systems (DBMS) @@ -29,20 +29,20 @@ A Database Management System (DBMS) is a software system that serves as the comp It defines and enforces the structure of the data, ensuring that the organization's rules are consistently applied. A DBMS manages data storage and efficiently executes data updates and queries while safeguarding the data's structure and integrity, particularly in environments with multiple concurrent users. -**Critically, a DBMS also manages user authentication and authorization**, controlling who can access which data and what operations they can perform. +Critically, a DBMS also manages user authentication and authorization, controlling who can access which data and what operations they can perform. ``` Consider an airline's database for flight schedules and ticket bookings. The airline must adhere to several key rules: * A seat cannot be booked by two passengers for the same flight * A seat is considered reserved only after all details are verified and payment is processed -* **Only authorized ticketing agents can modify reservations** -* **Passengers can view only their own booking information** -* **Financial data is accessible only to accounting staff** +* Only authorized ticketing agents can modify reservations +* Passengers can view only their own booking information +* Financial data is accessible only to accounting staff A robust DBMS enforces such rules reliably, ensuring smooth operations while interacting with multiple users and systems at once. The same system that prevents double-booking also prevents unauthorized access to passenger records. -Databases are dynamic, with data continuously updated by both users and systems. Even in the face of disruptions like power outages, errors, or cyberattacks, the DBMS ensures that the system recovers quickly and returns to a stable state. For users, the database should function seamlessly, allowing actions to be performed without interference from others working on the system simultaneously—**while ensuring they can only perform actions they're authorized to do**. +Databases are dynamic, with data continuously updated by both users and systems. Even in the face of disruptions like power outages, errors, or cyberattacks, the DBMS ensures that the system recovers quickly and returns to a stable state. For users, the database should function seamlessly, allowing actions to be performed without interference from others working on the system simultaneously—while ensuring they can only perform actions they're authorized to do. ## Data Security and Access Management @@ -50,7 +50,7 @@ One of the most critical features distinguishing databases from simple file stor ### Authentication and Authorization -Before you can work with a database, you must **authentication**—prove your identity with a username and password. Once authenticated, the database enforces **authorization** rules that determine what you can do: +Before you can work with a database, you must authenticate—prove your identity with a username and password. Once authenticated, the database enforces authorization rules that determine what you can do: - **Read**: View specific tables or columns - **Write**: Add new data to certain tables @@ -109,10 +109,19 @@ This book focuses on **DataJoint**, a framework that extends relational database The relational data model—introduced by Edgar F. Codd in 1970—revolutionized data management by organizing data into tables with well-defined relationships. This model has dominated database systems for over five decades due to its mathematical rigor and versatility. Modern relational databases like MySQL and PostgreSQL continue to evolve, incorporating new capabilities for scalability and security while maintaining the core principles that make them reliable and powerful. The following chapters build the conceptual foundation you need to understand DataJoint's approach: -- **Data Models**: What data models are and why schemas matter for scientific work -- **Relational Theory**: The mathematical foundations that make relational databases powerful -- **Relational Practice**: Hands-on experience with database operations -- **Relational Workflows**: How DataJoint extends relational theory for computational pipelines -- **Scientific Data Pipelines**: How workflows scale into complete research data operations systems +- [Data Models](01-models.md): What data models are and why schemas matter for scientific work +- [Relational Theory](02-relational.md): The mathematical foundations that make relational databases powerful +- [Data Integrity](04-integrity.md): Hands-on experience with database operations +- [Relational Workflows](05-workflows.md): How DataJoint extends relational theory for computational pipelines +- [Scientific Data Pipelines](06-pipelines.md): How workflows scale into complete research data operations systems By the end, you'll understand both the mathematical foundations and their practical application to your research. + +## Links + +- [MySQL](https://www.mysql.com/) — Popular open-source relational database management system +- [PostgreSQL](https://www.postgresql.org/) — Advanced open-source relational database +- [SQLite](https://www.sqlite.org/) — Embedded relational database engine +- [Google Spanner](https://cloud.google.com/spanner) — Distributed relational database service +- [CockroachDB](https://www.cockroachlabs.com/) — Distributed SQL database +- [DataJoint](https://datajoint.com/) — Framework for scientific data pipelines diff --git a/book/20-concepts/concepts-quiz.md b/book/20-concepts/concepts-quiz.md deleted file mode 100644 index 4b7fddb..0000000 --- a/book/20-concepts/concepts-quiz.md +++ /dev/null @@ -1,1298 +0,0 @@ -# Knowledge Check: Concepts - -This assessment covers Chapters 0-5 of the Database Concepts section. Questions include both single-answer and multiple-answer formats. - -**Instructions:** -- **Single-answer questions [SA]**: Select the ONE best answer -- **Multiple-answer questions [MA]**: Select ALL that apply -- Click "Show Answer" to reveal the correct answer and explanation - -**Scoring:** 82 points maximum -- Single Answer: 46 questions (1 point each) -- Multiple Answer: 18 questions (2 points each if all correct) - ---- - -## Chapter 0: Databases - -### Question 1.1 [SA] -What is the primary distinguishing feature of a database compared to simple file storage? - -A) Databases are larger in size -B) Databases enforce business rules and ensure data integrity -C) Databases use binary file formats -D) Databases require internet connectivity - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The key distinction is that databases actively enforce rules and maintain integrity, not just store data. They ensure valid transactions and prevent inconsistencies. -``` - ---- - -### Question 1.2 [MA] -Which of the following are key traits of databases? (Select all that apply) - -A) Structured data reflects the logic of operations -B) Data cannot be modified once entered -C) Supports distributed, concurrent access by multiple users -D) Allows specific and precise queries -E) Requires all data to be numeric - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D - -**Explanation:** Databases organize data logically (A), support concurrent access (C), and enable precise querying (D). However, data can be modified (B is false), and databases handle all data types, not just numeric (E is false). -``` - ---- - -### Question 1.3 [SA] -What role does a Database Management System (DBMS) play? - -A) It's the physical hardware that stores data -B) It's a backup system for databases -C) It's the software engine that defines structure, enforces rules, and executes queries -D) It's a user interface for data entry - -```{admonition} Show Answer -:class: dropdown - -**Answer:** C - -**Explanation:** A DBMS is the computational engine that defines and enforces data structure, manages storage, and executes queries while maintaining integrity. -``` - ---- - -### Question 1.4 [SA] -In the airline booking example, why is a DBMS essential? - -A) To make the website look attractive -B) To enforce rules like "a seat cannot be double-booked" reliably -C) To store passenger names alphabetically -D) To print boarding passes - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The DBMS actively enforces business rules (like preventing double-booking) automatically, which is critical for operational integrity. -``` - ---- - -## Chapter 1: Data Models - -### Question 2.1 [SA] -What is a data model? - -A) A physical database server -B) A conceptual framework defining how data is organized, represented, and transformed -C) A specific database implementation -D) A programming language for databases - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** A data model is the conceptual framework—the principles and constructs—for organizing and working with data, not the implementation itself. -``` - ---- - -### Question 2.2 [SA] -What is a schema? - -A) A sample of actual data -B) A database query -C) A formal specification of data structure that exists separately from the data -D) A programming language - -```{admonition} Show Answer -:class: dropdown - -**Answer:** C - -**Explanation:** A schema is the formal specification of structure (tables, columns, types, relationships) defined separately from any actual data instances. -``` - ---- - -### Question 2.3 [MA] -Which statements correctly describe the difference between structured and schemaless data models? (Select all that apply) - -A) Structured models define schema before storing data -B) Schemaless models have no structure at all -C) Schemaless models embed structure within each data instance -D) Structured models validate data against predefined rules -E) Schemaless models are always better for scientific research - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D - -**Explanation:** Structured models use predefined schemas (A) that validate data (D). Schemaless models do have structure, but it's self-describing within the data (C), not enforced externally. Neither approach is universally "better"—they serve different purposes (E is false). -``` - ---- - -### Question 2.4 [SA] -What is the key difference between metadata and schemas? - -A) Metadata is newer technology than schemas -B) Metadata describes relationships externally; schemas enforce them actively -C) Schemas are only used for small datasets -D) They are the same thing with different names - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Metadata provides descriptive information but relies on external interpretation. Schemas actively enforce structure and relationships through the database system itself. -``` - ---- - -### Question 2.5 [SA] -Using the passenger/luggage analogy from the book, what does a schema represent? - -A) Destination tags on luggage -B) The passenger's travel preferences -C) The assigned seat that's guaranteed and can't be double-booked -D) The passenger's name tag - -```{admonition} Show Answer -:class: dropdown - -**Answer:** C - -**Explanation:** The schema (assigned seat) provides active enforcement—preventing double-booking and ensuring the passenger and luggage travel together. Tags (metadata) just provide information. -``` - ---- - -### Question 2.6 [MA] -Which data models were discussed as essential examples in the book? (Select all that apply) - -A) Binary files -B) Spreadsheets -C) Relational databases -D) JSON/Document databases -E) Quantum databases - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, B, C, D - -**Explanation:** The book covers binary files (baseline), spreadsheets (familiar), relational (focus), and JSON (modern alternative). Quantum databases were not discussed. -``` - ---- - -### Question 2.7 [SA] -What is a major limitation of spreadsheets for complex scientific workflows? - -A) They can't display numbers -B) They have no referential integrity or workflow enforcement -C) They're too expensive -D) They only work on Windows - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Spreadsheets lack referential integrity—formulas can break when rows are deleted, and there's no enforcement of computational dependencies. -``` - ---- - -### Question 2.8 [MA] -According to the book, when should you use structured, schema-enforced approaches? (Select all that apply) - -A) When data integrity is non-negotiable -B) When relationships must remain valid as data evolves -C) When exploring completely unknown data structures -D) When provenance and reproducibility are essential -E) When rapid prototyping with no quality concerns - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, B, D - -**Explanation:** Structured approaches are essential when integrity (A), valid relationships (B), and provenance (D) matter. For pure exploration (C) or when quality doesn't matter (E), flexible approaches may suffice. -``` - ---- - -### Question 2.9 [SA] -What analogy did the book use to compare AI working with unstructured vs. structured data? - -A) A teacher grading random vs. organized essays -B) A detective with disorganized vs. organized evidence -C) A chef with mixed vs. separated ingredients -D) A musician with random vs. sheet music - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The book compared unstructured data to a detective sifting through a disorganized crime scene, versus structured data like organized evidence logs and reports. -``` - ---- - -### Question 2.10 [MA] -What challenges do scientists face with flexible, unstructured data approaches? (Select all that apply) - -A) Heterogeneous datasets lacking consistency -B) Difficulty sharing and publishing data -C) Inability to store any data -D) Need for "data standards" imposed afterward -E) Too much structure limiting creativity - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, B, D - -**Explanation:** Flexible approaches lead to heterogeneous data (A) that's hard to share (B), requiring standards after the fact (D). They can store data fine (C is false), and the problem is too little structure, not too much (E is false). -``` - ---- - -### Question 2.11 [SA] -What distinguishes DataJoint from traditional relational databases? - -A) DataJoint uses a completely different type of database -B) DataJoint treats computational dependencies as first-class schema elements -C) DataJoint doesn't use SQL at all -D) DataJoint is only for spreadsheets - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** DataJoint extends relational theory by making computational dependencies explicit in the schema, not just data relationships. -``` - ---- - -## Chapter 2: Relational Model - -### Question 3.1 [SA] -In relational theory, what is a relation? - -A) A foreign key constraint -B) A subset of a Cartesian product of sets (a set of tuples) -C) A database query -D) A connection between two databases - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Mathematically, a relation is a subset of a Cartesian product—a set of tuples where each tuple is an ordered combination of values from the participating domains. -``` - ---- - -### Question 3.2 [SA] -What is the cardinality of a relation? - -A) The number of domains (attributes) in the relation -B) The number of tuples (rows) in the relation -C) The size in megabytes -D) The number of foreign keys - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Cardinality refers to the number of tuples (rows) in the relation. The number of domains is called the "order" or "degree." -``` - ---- - -### Question 3.3 [MA] -Which mathematicians laid the foundations for relational theory? (Select all that apply) - -A) Augustus De Morgan -B) Albert Einstein -C) Georg Cantor -D) Edgar F. Codd -E) Isaac Newton - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D - -**Explanation:** De Morgan developed early relational concepts, Cantor formalized set theory, and Codd applied these to database theory. Einstein and Newton worked in physics, not database foundations. -``` - ---- - -### Question 3.4 [SA] -What was Edgar F. Codd's major contribution? - -A) Inventing the computer -B) Translating mathematical relational theory into a practical data management system -C) Creating the first spreadsheet -D) Developing the Internet - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Codd formalized the relational data model by applying set theory and predicate logic to data management, creating a rigorous mathematical foundation for databases. -``` - ---- - -### Question 3.5 [SA] -What is algebraic closure in relational algebra? - -A) Databases must be closed on weekends -B) Operations on relations produce relations, enabling composition -C) Tables must have fixed sizes -D) Queries must complete quickly - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Algebraic closure means operators take relations as inputs and produce relations as outputs, allowing operations to be composed into complex expressions. -``` - ---- - -### Question 3.6 [MA] -Which are examples of relational algebra operators? (Select all that apply) - -A) Selection (σ) -B) Projection (π) -C) Compilation -D) Join (⋈) -E) Debugging - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, B, D - -**Explanation:** Selection, projection, and join are fundamental relational algebra operators. Compilation and debugging are programming concepts, not relational operations. -``` - ---- - -### Question 3.7 [SA] -Who introduced the Entity-Relationship Model (ERM)? - -A) Edgar F. Codd -B) Peter Chen -C) Bill Gates -D) Tim Berners-Lee - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Peter Chen introduced the Entity-Relationship Model in 1976 as a conceptual approach to database design. -``` - ---- - -### Question 3.8 [SA] -What problem does the Entity-Relationship Model solve? - -A) Making databases faster -B) Providing an intuitive, visual way to design databases before implementation -C) Replacing relational databases -D) Eliminating the need for SQL - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The ERM provides a conceptual modeling layer with visual diagrams (ERDs) that help designers think about entities and relationships before implementing in SQL. -``` - ---- - -### Question 3.9 [MA] -What are the "three levels of abstraction" in relational database thinking? (Select all that apply) - -A) Mathematical foundation (Codd) -B) Physical hardware -C) Conceptual modeling (Chen - ERM) -D) Implementation language (SQL) -E) User interface design - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D - -**Explanation:** The three levels are: mathematical theory (Codd's relational algebra), conceptual design (Chen's ERM), and implementation (SQL). Hardware and UI are not part of this framework. -``` - ---- - -### Question 3.10 [SA] -Why does the book argue that structured approaches emerged from "mathematical rigor, not rigidity"? - -A) To justify bureaucracy -B) To show that schemas provide provable properties and formal guarantees -C) To make databases more complex -D) To eliminate all flexibility - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The mathematical foundations enable provable query optimization, formal integrity guarantees, and principled evolution—practical benefits, not arbitrary restrictions. -``` - ---- - -### Question 3.11 [MA] -What practical benefits do mathematical foundations provide for scientific research? (Select all that apply) - -A) Query optimizers can prove query equivalence -B) Constraints provide guaranteed integrity -C) Eliminates need for any planning -D) Declarative queries map to scientific questions -E) Schemas can be evolved with mathematical backing - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, B, D, E - -**Explanation:** Mathematical foundations enable proven optimization (A), guaranteed integrity (B), declarative expression (D), and principled evolution (E). They don't eliminate planning needs (C). -``` - ---- - -### Question 3.12 [SA] -What is referential integrity? - -A) Making sure column names are spelled correctly -B) Ensuring relationships between tables remain valid (foreign keys exist) -C) Backing up the database regularly -D) Running queries quickly - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Referential integrity, enforced by foreign keys, ensures that references between tables remain valid—you can't have orphaned records. -``` - ---- - -### Question 3.13 [MA] -According to the book, what capabilities are missing from traditional relational databases for computational workflows? (Select all that apply) - -A) "This result was computed FROM this input" semantics -B) Storing large amounts of data -C) Automatic recomputation when inputs change -D) Running queries -E) Tracking which code version produced results - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, E - -**Explanation:** Traditional databases can store data (B) and run queries (D) fine, but lack computational dependency semantics (A), automatic recomputation (C), and built-in provenance tracking (E). -``` - ---- - -## Chapter 3: Relational Databases in Practice - -### Question 4.1 [SA] -In the research lab database example, what does a foreign key in the Experiment table referencing Researcher accomplish? - -A) It stores the researcher's email -B) It ensures every experiment is linked to an existing researcher -C) It makes queries run faster -D) It deletes old experiments - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The foreign key constraint ensures referential integrity—you can't create an experiment referencing a non-existent researcher. -``` - ---- - -### Question 4.2 [SA] -What does the `ON DELETE CASCADE` clause do? - -A) Speeds up delete operations -B) Prevents any deletions -C) Automatically removes dependent records when parent is deleted -D) Sends an email notification - -```{admonition} Show Answer -:class: dropdown - -**Answer:** C - -**Explanation:** CASCADE means when you delete a parent record, all child records that depend on it are automatically deleted too. -``` - ---- - -### Question 4.3 [SA] -In the quality control scenario where Recording #1 had incorrect amplifier gain, what problem did the traditional database approach reveal? - -A) The database was too slow -B) There's no automatic tracking that NeuralUnit depends computationally on Recording -C) The database couldn't store the correction -D) SQL syntax was too complex - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The database knows NeuralUnit references Recording (foreign key), but doesn't know the spike rates were *computed from* the recording and need recomputation when it changes. -``` - ---- - -### Question 4.4 [MA] -What operations did the book demonstrate in the SQL section? (Select all that apply) - -A) SELECT with WHERE clauses -B) JOIN to combine related tables -C) INSERT to add new records -D) TIME TRAVEL to past states -E) UPDATE to modify existing data - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, B, C, E - -**Explanation:** The chapter covered SELECT (A), JOIN (B), INSERT (C), and UPDATE (E). Time travel is not a standard SQL operation (D). -``` - ---- - -### Question 4.5 [SA] -Why is UPDATE problematic for derived data in scientific workflows? - -A) UPDATE is too slow -B) UPDATE requires special permissions -C) UPDATE can modify computed results without recomputing, breaking provenance -D) UPDATE only works on small tables - -```{admonition} Show Answer -:class: dropdown - -**Answer:** C - -**Explanation:** You can UPDATE a computed result without actually recomputing it, silently breaking the connection between the result and its source data. -``` - ---- - -### Question 4.6 [MA] -What's missing from traditional relational databases for scientific workflows? (Select all that apply) - -A) Temporal semantics (when/how data was created) -B) The ability to store data -C) Computational dependencies (this was derived from that) -D) Automatic execution when inputs are ready -E) Sending emails - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D - -**Explanation:** Traditional databases lack temporal awareness (A), computational dependency tracking (C), and automatic execution (D). They can store data fine (B), and email is not a database feature (E). -``` - ---- - -### Question 4.7 [SA] -In an Entity-Relationship Diagram using Crow's Foot notation, what does `||--o{` mean? - -A) One-to-one relationship -B) Many-to-many relationship -C) One-to-many relationship (one on left, many on right) -D) The database is broken - -```{admonition} Show Answer -:class: dropdown - -**Answer:** C - -**Explanation:** In Crow's Foot notation, `||` means "exactly one" and `o{` means "zero or many," indicating a one-to-many relationship. -``` - ---- - -### Question 4.8 [SA] -What is the proper order for inserting data when foreign keys exist? - -A) Any order is fine -B) Parent entities must be inserted before child entities -C) Child entities must be inserted before parent entities -D) All data must be inserted simultaneously - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Foreign key constraints require that referenced records exist, so parents must be inserted before children. -``` - ---- - -## Chapter 4: Relational Workflows - -### Question 5.1 [SA] -What is the core innovation of the Relational Workflow Model? - -A) Replacing SQL with a new language -B) Making databases faster -C) Treating the database schema as an executable workflow specification -D) Eliminating all constraints - -```{admonition} Show Answer -:class: dropdown - -**Answer:** C - -**Explanation:** The schema itself specifies the workflow—what depends on what, what's computed how, creating an executable specification, not just a data structure. -``` - ---- - -### Question 5.2 [MA] -What are the four fundamental concepts of the Relational Workflow Model? (Select all that apply) - -A) Workflow Entity -B) Shopping Cart -C) Workflow Dependencies -D) Workflow Steps -E) Directed Acyclic Graph (DAG) - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D, E - -**Explanation:** The four concepts are: Workflow Entity (A), Workflow Dependencies (C), Workflow Steps (D), and DAG structure (E). Shopping Cart is not a database concept (B). -``` - ---- - -### Question 5.3 [SA] -What distinguishes a "workflow entity" from a traditional entity? - -A) Workflow entities are larger -B) Workflow entities are created at a specific step in a workflow -C) Workflow entities cannot have foreign keys -D) Workflow entities are only text - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Workflow entities are artifacts of workflow execution—they exist because a specific workflow step created them. -``` - ---- - -### Question 5.4 [SA] -What is a Directed Acyclic Graph (DAG)? - -A) A graph with cycles -B) A graph with no direction -C) A graph with directed edges and no cycles -D) A type of chart for presentations - -```{admonition} Show Answer -:class: dropdown - -**Answer:** C - -**Explanation:** A DAG has directed edges (arrows showing dependencies) but no cycles (no circular dependencies), ensuring workflows can execute without infinite loops. -``` - ---- - -### Question 5.5 [MA] -What are the four table tiers in DataJoint? (Select all that apply) - -A) Lookup tables -B) Shopping tables -C) Manual tables -D) Imported tables -E) Computed tables - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D, E - -**Explanation:** DataJoint's four tiers are: Lookup (reference data), Manual (human-entered), Imported (from instruments), and Computed (derived). "Shopping tables" is not a tier. -``` - ---- - -### Question 5.6 [SA] -In DataJoint's visual representation, what color represents computed tables? - -A) Green -B) Blue -C) Red -D) Gray - -```{admonition} Show Answer -:class: dropdown - -**Answer:** C - -**Explanation:** Red indicates computed tables (automated processing), green is manual, blue is imported, gray is lookup. -``` - ---- - -### Question 5.7 [SA] -How does DataJoint handle relationships differently from traditional ERM? - -A) DataJoint doesn't allow relationships -B) Relationships emerge from workflow convergence, not explicit junction tables -C) DataJoint requires manual relationship definition -D) Relationships must be defined twice - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** In DataJoint, relationships emerge naturally when workflows converge—you don't need explicit "relationship" concepts or junction tables. -``` - ---- - -### Question 5.8 [SA] -What is "computational validity" in DataJoint? - -A) Code must compile without errors -B) Results must remain consistent with their current inputs -C) Queries must return quickly -D) All tables must be the same size - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Computational validity means if Result R was computed from Input I, then R must correspond to the current state of I (or both must be deleted/recomputed together). -``` - ---- - -### Question 5.9 [SA] -What happens in DataJoint when you delete an upstream entity? - -A) Nothing—the database allows it -B) Only that entity is deleted -C) The operation cascades to delete all dependent downstream entities -D) The database crashes - -```{admonition} Show Answer -:class: dropdown - -**Answer:** C - -**Explanation:** DataJoint enforces computational validity by cascading deletes to remove all dependent entities, preventing orphaned results. -``` - ---- - -### Question 5.10 [SA] -What is the proper way to correct an error in upstream data in DataJoint? - -A) Just UPDATE the wrong value -B) Delete the incorrect data (cascading to dependents), reinsert corrected data, recompute -C) Keep the error and document it -D) Create a duplicate entry - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** To maintain computational validity, delete the error (which cascades), insert the correction, then recompute—ensuring all results reflect corrected inputs. -``` - ---- - -### Question 5.11 [SA] -What does the `populate()` operation do in DataJoint? - -A) Fills tables with random data -B) Automatically identifies missing work and computes results in correct order -C) Deletes old data -D) Backs up the database - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** `populate()` finds what needs computing based on the schema dependencies and executes computations in the correct order automatically. -``` - ---- - -### Question 5.12 [MA] -What are the five core query operators in DataJoint? (Select all that apply) - -A) Restriction (&) -B) Compilation -C) Join (*) -D) Projection (.proj()) -E) Aggregation (.aggr()) -F) Union -G) Deletion - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D, E, F - -**Explanation:** The five operators are: Restriction (A), Join (C), Projection (D), Aggregation (E), and Union (F). Compilation and deletion are not query operators. -``` - ---- - -### Question 5.13 [SA] -Why does DataJoint emphasize immutability by default? - -A) To make the database read-only -B) To preserve workflow execution history and provenance -C) To save disk space -D) To make queries faster - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Immutability preserves provenance—you can always trace results back to inputs. Changes happen via delete-and-reinsert, maintaining computational validity. -``` - ---- - -### Question 5.14 [SA] -What does "schema as executable specification" mean? - -A) The schema includes JavaScript code -B) The schema defines both structure AND how computations flow -C) The schema can be run as a program -D) The schema is written in Python - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The schema specifies not just data structure, but the entire workflow: what depends on what, what's computed how, creating an executable specification. -``` - ---- - -### Question 5.15 [MA] -How does the Relational Workflow Model address ERM's limitations? (Select all that apply) - -A) Adds temporal dimension (when entities are created) -B) Eliminates all foreign keys -C) Treats relationships as workflow convergence -D) Provides unified design-implementation (no translation gap) -E) Makes databases slower but more accurate - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D - -**Explanation:** The Workflow Model adds temporal awareness (A), handles relationships through convergence (C), and unifies design/implementation (D). It keeps foreign keys (B is false) and doesn't sacrifice performance (E is false). -``` - ---- - -### Question 5.16 [SA] -What does the book mean by "from transactions to transformations"? - -A) Databases should process credit cards -B) A shift from storage-centric to transformation-centric thinking -C) SQL should be replaced -D) Databases should transform into spreadsheets - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** This represents the conceptual shift from thinking of databases as storage systems to thinking of them as workflow engines that transform data through computational steps. -``` - ---- - -## Chapter 5: Scientific Data Pipelines - -### Question 6.1 [SA] -What is a scientific data pipeline according to the book? - -A) Just a database with additional tables -B) A comprehensive data operations system managing the complete lifecycle of scientific data -C) A file backup system for research data -D) A programming language for scientists - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** A scientific data pipeline is more than just a database—it's a comprehensive system that manages data from acquisition to publication, integrates diverse tools, and enables collaboration across teams. -``` - ---- - -### Question 6.2 [MA] -What are the three components of the DataJoint open-source core? (Select all that apply) - -A) Relational database -B) Web browser -C) Code repository -D) Object store -E) Spreadsheet software - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D - -**Explanation:** The open-source core consists of the relational database (system of record), code repository (schema definitions and computational methods), and object store (for large scientific data objects). -``` - ---- - -### Question 6.3 [SA] -What role does the relational database play in the DataJoint Platform architecture? - -A) It only stores configuration files -B) It serves as the system of record with structured storage and referential integrity -C) It handles only user authentication -D) It replaces the need for any code - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The relational database serves as the system of record, providing structured tabular storage, referential integrity through foreign keys, ACID-compliant transactions, and declarative query capabilities. -``` - ---- - -### Question 6.4 [MA] -What are the four categories of functional extensions in the DataJoint Platform? (Select all that apply) - -A) Interactions -B) Database migration -C) Infrastructure provisioning -D) Automation -E) Orchestration -F) File compression - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D, E - -**Explanation:** The four functional extension categories are: Interactions (visual tools, ELN, IDE), Infrastructure provisioning (security, compute resources), Automation (AI agents, populate), and Orchestration (ingest, collaboration, publishing). -``` - ---- - -### Question 6.5 [SA] -What is the purpose of the object store in the DataJoint architecture? - -A) To replace the relational database entirely -B) To manage large scientific datasets while maintaining relational integrity -C) To store only text files -D) To provide email functionality - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The object store handles large scientific data (images, recordings, videos) using scalable storage while the database maintains metadata and referential integrity—a hybrid approach combining scalability with query power. -``` - ---- - -### Question 6.6 [SA] -What does the book mean by "the schema is central" in scientific data pipelines? - -A) Schemas are stored in the center of the database -B) The schema defines data structures, dependencies, and computational flow as a single source of truth -C) Only database administrators can access schemas -D) Schemas must be written in a central location - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The schema is central because it's the single source of truth—defining not just data structures but also dependencies and computational flow. Every component from import scripts to dashboards operates on the same schema-defined structures. -``` - ---- - -### Question 6.7 [MA] -According to the comparison table, what advantages does a DataJoint pipeline have over file-based approaches? (Select all that apply) - -A) Data structure is explicit in schema definition -B) Provenance is automatic through referential integrity -C) Files are always smaller -D) Queries use composable algebra -E) Collaboration uses concurrent database access - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, B, D, E - -**Explanation:** DataJoint pipelines offer explicit schemas (A), automatic provenance (B), composable queries (D), and concurrent access (E). File size is not inherently smaller with DataJoint (C is false). -``` - ---- - -### Question 6.8 [SA] -What triggers automated computations in the DataJoint pipeline workflow? - -A) Manual execution of each step -B) The `populate()` mechanism that identifies missing computations -C) Email notifications to researchers -D) Random scheduling - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** The `populate()` mechanism automatically identifies missing computations and executes them in dependency order. When new data enters the system, downstream computations propagate automatically. -``` - ---- - -### Question 6.9 [MA] -What does the "Interactions" category of functional extensions include? (Select all that apply) - -A) Pipeline Navigator for visual exploration -B) Electronic Lab Notebook integration -C) Database backup systems -D) Integrated Development Environment support -E) Visualization dashboards - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, B, D, E - -**Explanation:** Interactions include: Pipeline Navigator (schema exploration), ELN (lab documentation), IDE support (Jupyter, VS Code), and Visualization Dashboards (exploring results). Backups are part of infrastructure, not interactions. -``` - ---- - -### Question 6.10 [SA] -What does the DataJoint Specs 2.0 document define? - -A) Marketing materials for DataJoint -B) Standards, conventions, and best practices for designing DataJoint pipelines -C) Hardware requirements for servers -D) Pricing information - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** DataJoint Specs 2.0 formally defines standards for pipeline structure, table tiers, attribute types, query operators, computation models, and object storage—ensuring interoperability and consistent best practices. -``` - ---- - -## Comprehensive Questions - -### Question 7.1 [MA] -Which statements accurately describe the progression from metadata to schemas to workflows? (Select all that apply) - -A) Metadata describes relationships externally -B) Schemas enforce relationships through the database -C) Workflows add computational dependencies to schemas -D) Each level replaces the previous one -E) All three approaches can work together - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, B, C, E - -**Explanation:** Metadata describes (A), schemas enforce (B), workflows add computational semantics (C), and they complement each other (E). They don't replace each other (D is false)—they serve different purposes. -``` - ---- - -### Question 7.2 [SA] -A scientist discovers that a raw measurement file was corrupted. In a DataJoint workflow, what's the proper response? - -A) UPDATE all dependent results to mark them as questionable -B) Delete the corrupt measurement (cascading to all results), fix the file, reinsert, and repopulate -C) Keep the corrupt data and add a note -D) Only fix the measurement and hope results are still valid - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** DataJoint's approach: delete the corrupt data (cascading to maintain computational validity), insert corrected data, then recompute everything—ensuring all results reflect valid inputs. -``` - ---- - -### Question 7.3 [MA] -What advantages does DataJoint provide over traditional relational databases for scientific computing? (Select all that apply) - -A) Faster raw query performance -B) Explicit computational dependencies in schema -C) Automatic recomputation when inputs change -D) Built-in provenance tracking -E) No need for any documentation - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B, C, D - -**Explanation:** DataJoint adds computational dependencies (B), automatic recomputation (C), and provenance tracking (D) to relational foundations. Raw performance is similar (A), and documentation is still valuable (E is false). -``` - ---- - -### Question 7.4 [SA] -If you see this in a DataJoint diagram: `Session → Recording → Analysis`, what does it mean? - -A) These are three unrelated tables -B) Recording depends on Session; Analysis depends on Recording -C) You must query them in alphabetical order -D) They have the same structure - -```{admonition} Show Answer -:class: dropdown - -**Answer:** B - -**Explanation:** Arrows represent dependencies: Recording is created from Session data, Analysis is computed from Recording data—a workflow pipeline. -``` - ---- - -### Question 7.5 [MA] -Why is the mathematical foundation of relational databases important for science? (Select all that apply) - -A) It enables provable query optimization -B) It makes databases more expensive -C) It provides formal integrity guarantees -D) It supports declarative queries matching scientific questions -E) It eliminates all human judgment - -```{admonition} Show Answer -:class: dropdown - -**Answer:** A, C, D - -**Explanation:** Mathematical foundations enable proven optimization (A), formal guarantees (C), and declarative expression (D). They don't increase cost (B) or eliminate judgment (E). -``` - ---- - -## Scoring Guide - -```{list-table} Grade Scale -:header-rows: 1 - -* - Score Range - - Percentage - - Assessment -* - 74-82 points - - 90-100% - - Excellent mastery of database concepts -* - 66-73 points - - 80-89% - - Good understanding with minor gaps -* - 57-65 points - - 70-79% - - Adequate comprehension, review some topics -* - 49-56 points - - 60-69% - - Basic familiarity, significant review needed -* - Below 49 - - <60% - - Comprehensive review recommended -``` - -**Total Points:** 82 -- Single Answer: 46 questions × 1 point = 46 points -- Multiple Answer: 18 questions × 2 points = 36 points - -**Topic Coverage:** -- Chapter 0 (Databases): 5% (4 questions) -- Chapter 1 (Data Models): 20% (11 questions) -- Chapter 2 (Relational Model): 24% (13 questions) -- Chapter 3 (Practical Implementation): 15% (8 questions) -- Chapter 4 (Workflow Model): 29% (16 questions) -- Chapter 5 (Scientific Data Pipelines): 18% (10 questions) -- Synthesis Questions: 9% (5 questions) diff --git a/book/30-design/010-schema.ipynb b/book/30-design/010-schema.ipynb index 7393534..445d040 100644 --- a/book/30-design/010-schema.ipynb +++ b/book/30-design/010-schema.ipynb @@ -40,113 +40,12 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Using the `schema` Object\n", - "\n", - "The schema object groups related tables together and helps prevent naming conflicts.\n", - "\n", - "By convention, the object created by `dj.Schema` is named `schema`. Typically, only one schema object is used in any given Python namespace, usually at the level of a Python module.\n", - "\n", - "The schema object serves multiple purposes:\n", - "* **Creating Tables**: Used as a *class decorator* (`@schema`) to declare tables within the schema. \n", - "For details, see the next section, [Create Tables](010-table.ipynb)\n", - "* **Visualizing the Schema**: Generates diagrams to illustrate relationships between tables.\n", - "* **Exporting Data**: Facilitates exporting data for external use or backup.\n", - "\n", - "With this foundation, you are ready to begin declaring tables and building your data pipeline." - ] + "source": "# Using the `schema` Object\n\nThe schema object groups related tables together and helps prevent naming conflicts.\n\nBy convention, the object created by `dj.Schema` is named `schema`. Typically, only one schema object is used in any given Python namespace, usually at the level of a Python module.\n\nThe schema object serves multiple purposes:\n* **Creating Tables**: Used as a *class decorator* (`@schema`) to declare tables within the schema. \nFor details, see the next section, [Create Tables](015-table.ipynb)\n* **Visualizing the Schema**: Generates diagrams to illustrate relationships between tables.\n* **Exporting Data**: Facilitates exporting data for external use or backup.\n\nWith this foundation, you are ready to begin declaring tables and building your data pipeline." }, { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Working with Multi-Schema Databases\n", - "\n", - "Organizing larger databases into multiple smaller schemas (or modules) enhances clarity, modularity, and maintainability. In DataJoint, schemas serve as namespaces that group related tables together, while Python modules provide a corresponding organizational structure for the database code.\n", - "\n", - "## Convention: One Database Schema = One Python Module\n", - "\n", - "DataJoint projects are typically organized with each database schema mapped to a single Python module (`.py` file). This convention:\n", - "\n", - "* Promotes modularity by grouping all tables of a schema within one module.\n", - "* Ensures clarity by maintaining a single schema object per module.\n", - "* Avoids naming conflicts and simplifies dependency management.\n", - "\n", - "Each module declares its own schema object and defines all associated tables. Downstream schemas explicitly import upstream schemas to manage dependencies.\n", - "\n", - "## Dependency Management and Acyclic Design\n", - "\n", - "In multi-schema databases, dependencies between tables and schemas must form a Directed Acyclic Graph (DAG). Cyclic dependencies are not allowed. This ensures:\n", - "* Foreign key constraints maintain logical order without forming loops.\n", - "* Python module imports align with the dependency structure of the database.\n", - "\n", - "**Key Principles**:\n", - "1. Tables can reference each other within a schema or across schemas using foreign keys.\n", - "2. Dependencies should be topologically sorted, ensuring upstream schemas are imported into downstream schemas.\n", - "\n", - "# Advantages of Multi-Schema Design\n", - "1. **Modularity**: Each schema focuses on a specific aspect of the pipeline (e.g., acquisition, processing, analysis).\n", - "2. **Separation of Concerns**: Clear boundaries between schemas simplify navigation and troubleshooting.\n", - "3. **Scalability**: Isolated schemas enable easier updates and scaling as projects grow.\n", - "4. **Collaboration**: Teams can work on separate modules independently without conflicts.\n", - "5. **Maintainability**: Modular design facilitates version control and debugging.\n", - "\n", - "# Defining Complex Databases with Multiple Schemas in DataJoint\n", - "\n", - "In DataJoint, defining **multiple schemas across separate Python modules** ensures that large, complex projects remain well-organized, modular, and maintainable. Each schema should be defined in a **dedicated Python module** to adhere to best practices. This structure ensures that every module maintains **only one `schema` object**, and **downstream schemas import upstream schemas** to manage dependencies correctly. This approach improves code clarity, enables better version control, and simplifies collaboration across teams.\n", - "\n", - "The database schema and its Python module usually have similar names, although they need not be identical. \n", - "\n", - "Tables can form foreign key dependencies within modules and but also across modules. \n", - "In DataJoint, Such dependencies must be acyclic within each schema: dependencies cannot form closed cycles, so that the graph of dependences forms a DAG (directed acyclic graph). \n", - "Then also database modules form a directed acyclic graph at a higher level: the python modules should never form cyclic import dependences and their database schemas must be topologically sorted in the same way so that tables cannot make foreign key dependencies into tables that are in downstream schemas.\n", - "\n", - "\n", - "## Why Use Multiple Schemas in Separate Modules?\n", - "\n", - "Using multiple schemas across separate modules offers the following benefits:\n", - "\n", - "1. **Modularity and Code Organization**: Each module contains only the tables relevant to a specific schema, making the codebase easier to manage and navigate.\n", - "2. **Clear Boundaries Between Schemas**: Ensures a separation of concerns, where each schema focuses on a specific aspect of the pipeline (e.g., acquisition, processing, analysis).\n", - "3. **Dependency Management**: Downstream schemas explicitly **import upstream schemas** to manage table dependencies and data flow.\n", - "4. **Collaboration**: Multiple developers or teams can work on separate modules without conflicts.\n", - "5. **Scalability and Maintainability**: Isolating schemas into modules simplifies future updates and troubleshooting.\n", - "\n", - "\n", - "## How to Structure Modules for Multiple Schemas\n", - "\n", - "Below is an example that demonstrates how to organize multiple schemas in separate Python modules.\n", - "\n", - "# Example Project Structure\n", - "\n", - "Here’s an example of how to organize multiple schemas in a DataJoint project:\n", - "\n", - "```\n", - "my_pipeline/\n", - "│\n", - "├── subject.py # Defines subject_management schema\n", - "├── acquisition.py # Defines acquisition schema (depends on subject_management)\n", - "├── processing.py # Defines processing schema (depends on acquisition)\n", - "└── analysis.py # Defines analysis schema (depends on processing)\n", - "```\n", - "\n", - "## Step-by-Step Example\n", - "\n", - "1. `subject.py`:\n", - " * Defines the `subject_management` schema.\n", - " * Contains the Subject table and related entities.\n", - "2. `acquisition.py`:\n", - " * Defines the `acquisition` schema.\n", - " * Depends on subject_management for subject-related data.\n", - "3. `processing.py`:\n", - " * Defines the `processing` schema.\n", - " * Depends on `acquisition` for data to process.\n", - "4. `analysis.py`:\n", - " * Defines the `analysis` schema.\n", - " * Depends on `processing` for processed data to analyze.\n", - "\n", - "By adhering to these principles, large projects remain modular, scalable, and easy to maintain.\n" - ] + "source": "# Multi-Schema Pipelines\n\nAs pipelines grow, you will organize tables into multiple schemas. Each schema groups related tables together—for example, `subject`, `acquisition`, `processing`, and `analysis`.\n\n## Simple Scripts vs. Full Projects\n\nFor **learning, exploration, and simple pipelines**, you can define schemas directly in Python scripts or Jupyter notebooks, just like the examples throughout this book. This is the easiest way to get started:\n\n```python\n# Simple script: my_pipeline.py\nimport datajoint as dj\n\nschema = dj.Schema('my_experiment')\n\n@schema\nclass Subject(dj.Manual):\n definition = \"\"\"\n subject_id : int\n ---\n subject_name : varchar(100)\n \"\"\"\n\n@schema \nclass Session(dj.Manual):\n definition = \"\"\"\n -> Subject\n session_date : date\n ---\n notes : varchar(500)\n \"\"\"\n```\n\nFor **production deployment** with multiple collaborators, version control, and automated workers, you should organize the pipeline as a proper Python package. See [Pipeline Projects](090-pipeline-project.md) for the full project structure including:\n* Standard layout with `src/workflow/`\n* Configuration with `pyproject.toml`\n* Docker deployment\n\n## Convention: One Schema = One Module\n\nWhether using simple scripts or full projects, the fundamental convention is: **one database schema corresponds to one Python module** (or one script/notebook for simple cases).\n\nThis ensures:\n* Each module has exactly one `schema` object\n* Clear dependency management between schemas\n* No circular imports\n\n## Example Schema Module\n\nHere is a typical schema module defining the `subject` schema:" }, { "cell_type": "code", diff --git a/book/30-design/082-indexes.ipynb b/book/30-design/082-indexes.ipynb index cc14d2d..ab892e7 100644 --- a/book/30-design/082-indexes.ipynb +++ b/book/30-design/082-indexes.ipynb @@ -2,61 +2,100 @@ "cells": [ { "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, + "id": "cell-0", + "metadata": {}, "source": [ - "# Indexes\n", + "---\n", + "title: Indexes\n", + "authors:\n", + " - name: Dimitri Yatsenko\n", + "date: 2024-10-22\n", + "---\n", + "\n", + "# Indexes: Accelerating Data Lookups\n", + "\n", + "As tables grow to thousands or millions of records, query performance becomes critical. **Indexes** are data structures that enable fast lookups by specific attributes, dramatically reducing query times from scanning every row to near-instantaneous retrieval.\n", + "\n", + "Think of an index like the index at the back of a textbook: instead of reading every page to find a topic, you look it up in the index and jump directly to the relevant pages. Database indexes work the same way—they create organized lookup structures that point directly to matching records.\n", "\n", - "Table indexes are data structures that allow fast lookups by an indexed attribute or combination of attributes.\n", + "```{admonition} Learning Objectives\n", + ":class: note\n", + "\n", + "By the end of this chapter, you will:\n", + "- Understand how indexes accelerate database queries\n", + "- Recognize the three mechanisms that create indexes in DataJoint\n", + "- Declare explicit secondary indexes for frequently queried attributes\n", + "- Understand composite index ordering and its impact on queries\n", + "- Know when to use regular vs. unique indexes\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "cell-1", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "This chapter assumes familiarity with:\n", + "- [Primary Keys](020-primary-key.md) — Understanding unique entity identification\n", + "- [Foreign Keys](030-foreign-keys.ipynb) — Understanding table relationships\n", + "- [Create Tables](015-table.ipynb) — Basic table declaration syntax" + ] + }, + { + "cell_type": "markdown", + "id": "cell-2", + "metadata": {}, + "source": [ + "## How Indexes Are Created in DataJoint\n", "\n", - "In DataJoint, indexes are created by one of the three mechanisms:\n", + "In DataJoint, indexes are created through three mechanisms:\n", "\n", - "1. Primary key \n", - "2. Foreign key \n", - "3. Explicitly defined indexes\n", + "| Mechanism | Index Type | Purpose |\n", + "|-----------|------------|--------|\n", + "| **Primary key** | Unique index (automatic) | Fast lookups by entity identifier |\n", + "| **Foreign key** | Secondary index (automatic) | Fast joins and referential integrity checks |\n", + "| **Explicit declaration** | Secondary index (manual) | Fast lookups by frequently queried attributes |\n", "\n", - "The first two mechanisms are obligatory. Every table has a primary key, which serves as an unique index. Therefore, restrictions by a primary key are very fast. Foreign keys create additional indexes unless a suitable index already exists." + "The first two mechanisms are **automatic**—every table has a primary key index, and foreign keys create indexes unless a suitable one already exists. The third mechanism gives you control over additional indexes for your specific query patterns." ] }, { "cell_type": "markdown", + "id": "cell-3", "metadata": {}, "source": [ - "Let's test this principle. Let's create a table with a 10,000 entries and compare lookup times:" + "## Demonstrating Index Performance\n", + "\n", + "Let's create a table with many entries and measure the performance difference between indexed and non-indexed lookups." ] }, { "cell_type": "code", "execution_count": null, + "id": "cell-4", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[2024-10-22 23:59:07,736][INFO]: Connecting root@localhost:3306\n", - "[2024-10-22 23:59:07,776][INFO]: Connected root@localhost:3306\n" - ] - } - ], + "outputs": [], "source": [ "import datajoint as dj\n", + "import random\n", + "\n", "schema = dj.Schema('indexes')" ] }, { "cell_type": "markdown", + "id": "cell-5", "metadata": {}, "source": [ - "Let's say a mouse in the lab has a lab-specific ID but it also has a separate id issued by the animal facility." + "Consider a mouse tracking scenario where each mouse has a lab-specific ID (primary key) and a separate tag ID issued by the animal facility:" ] }, { "cell_type": "code", - "execution_count": 20, + "execution_count": null, + "id": "cell-6", "metadata": {}, "outputs": [], "source": [ @@ -71,211 +110,117 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": null, + "id": "cell-7", "metadata": {}, "outputs": [], "source": [ - "import random\n", "def populate_mice(table, n=200_000):\n", - " \"\"\"insert a bunch of mice\"\"\"\n", + " \"\"\"Insert random mouse records for testing.\"\"\"\n", " table.insert(\n", - " ((random.randint(1,1000_000_000), random.randint(1,1000_000_000)) \n", - " for i in range(n)), skip_duplicates=True)" + " ((random.randint(1, 1_000_000_000), random.randint(1, 1_000_000_000)) \n", + " for i in range(n)), \n", + " skip_duplicates=True\n", + " )\n", + "\n", + "populate_mice(Mouse())" ] }, { "cell_type": "code", - "execution_count": 29, + "execution_count": null, + "id": "cell-8", "metadata": {}, "outputs": [], "source": [ - "populate_mice(Mouse())" + "Mouse()" ] }, { - "cell_type": "code", - "execution_count": 30, + "cell_type": "markdown", + "id": "cell-9", "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

mouse_id

\n", - " lab-specific ID\n", - "
\n", - "

tag_id

\n", - " animal facility ID\n", - "
1235611073359
164281899632
1659172762154
1935758844555
2423391592578
2672519030374
4578393772628
5642857624955
5997271256353
7088820229475
7577347597850
882140618313
\n", - "

...

\n", - "

Total: 999506

\n", - " " - ], - "text/plain": [ - "*mouse_id tag_id \n", - "+----------+ +-----------+\n", - "1235 611073359 \n", - "1642 81899632 \n", - "1659 172762154 \n", - "1935 758844555 \n", - "2423 391592578 \n", - "2672 519030374 \n", - "4578 393772628 \n", - "5642 857624955 \n", - "5997 271256353 \n", - "7088 820229475 \n", - "7577 347597850 \n", - "8821 40618313 \n", - " ...\n", - " (Total: 999506)" - ] - }, - "execution_count": 30, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ - "Mouse()" + "### Primary Key Lookup (Fast)\n", + "\n", + "Searching by `mouse_id` uses the primary key index—this is extremely fast:" ] }, { "cell_type": "code", - "execution_count": 32, + "execution_count": null, + "id": "cell-10", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1.41 ms ± 483 μs per loop (mean ± std. dev. of 3 runs, 6 loops each)\n" - ] - } - ], + "outputs": [], "source": [ "%%timeit -n6 -r3\n", "\n", - "# efficient! Uses the primary key\n", + "# Fast: Uses the primary key index\n", "(Mouse() & {'mouse_id': random.randint(0, 999_999)}).fetch()" ] }, { - "cell_type": "code", - "execution_count": 33, + "cell_type": "markdown", + "id": "cell-11", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "210 ms ± 5.51 ms per loop (mean ± std. dev. of 3 runs, 6 loops each)\n" - ] - } - ], "source": [ - "%%timeit -n6 -r3\n", + "### Non-Indexed Lookup (Slow)\n", "\n", - "# inefficient! Requires a full table scan\n", - "(Mouse() & {'tag_id': random.randint(0, 999_999)}).fetch()" + "Searching by `tag_id` requires scanning every row in the table—this is slow:" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, + "id": "cell-12", "metadata": {}, + "outputs": [], "source": [ - "The indexed searches are much faster!" + "%%timeit -n6 -r3\n", + "\n", + "# Slow: Requires a full table scan\n", + "(Mouse() & {'tag_id': random.randint(0, 999_999)}).fetch()" ] }, { "cell_type": "markdown", + "id": "cell-13", "metadata": {}, "source": [ - "To make searches faster on fields other than the primary key or a foreign key, you can add a secondary index explicitly. \n", - "\n", - "Regular indexes are declared as `index(attr1, ..., attrN)` on a separate line anywhere in the table declration (below the primary key divide). \n", + "```{admonition} Performance Impact\n", + ":class: important\n", "\n", - "Indexes can be declared with unique constraint as `unique index (attr1, ..., attrN)`." + "The indexed search is typically **100x faster** than the full table scan. This difference grows even larger as the table size increases. For tables with millions of records, unindexed searches can take seconds or minutes, while indexed searches remain nearly instantaneous.\n", + "```" ] }, { "cell_type": "markdown", + "id": "cell-14", "metadata": {}, "source": [ - "Let's redeclare the table with a unique index on `tag_id`." + "## Declaring Secondary Indexes\n", + "\n", + "To speed up searches on non-primary-key attributes, you can declare **secondary indexes** explicitly in the table definition.\n", + "\n", + "### Syntax\n", + "\n", + "Indexes are declared below the `---` line in the table definition:\n", + "\n", + "```\n", + "index(attr1, ..., attrN) # Regular index\n", + "unique index(attr1, ..., attrN) # Unique index (enforces uniqueness)\n", + "```\n", + "\n", + "### Example: Adding a Unique Index\n", + "\n", + "Since each mouse should have a unique `tag_id`, we can add a unique index:" ] }, { "cell_type": "code", - "execution_count": 34, + "execution_count": null, + "id": "cell-15", "metadata": {}, "outputs": [], "source": [ @@ -285,372 +230,127 @@ " mouse_id : int # lab-specific ID\n", " ---\n", " tag_id : int # animal facility ID\n", - " unique index (tag_id)\n", + " unique index(tag_id)\n", " \"\"\"" ] }, { "cell_type": "code", - "execution_count": 39, + "execution_count": null, + "id": "cell-16", "metadata": {}, "outputs": [], "source": [ "populate_mice(Mouse2())" ] }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

mouse_id

\n", - " lab-specific ID\n", - "
\n", - "

tag_id

\n", - " animal facility ID\n", - "
2662735424844
2851647655500
4056274020
417130761877
4468379575468
4719739577052
4969840248340
5564764263886
5614953650373
7234486537164
7660316951185
7884522730603
\n", - "

...

\n", - "

Total: 998935

\n", - " " - ], - "text/plain": [ - "*mouse_id tag_id \n", - "+----------+ +-----------+\n", - "2662 735424844 \n", - "2851 647655500 \n", - "4056 274020 \n", - "4171 30761877 \n", - "4468 379575468 \n", - "4719 739577052 \n", - "4969 840248340 \n", - "5564 764263886 \n", - "5614 953650373 \n", - "7234 486537164 \n", - "7660 316951185 \n", - "7884 522730603 \n", - " ...\n", - " (Total: 998935)" - ] - }, - "execution_count": 40, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "Mouse2()" - ] - }, { "cell_type": "markdown", + "id": "cell-17", "metadata": {}, "source": [ - "Now both types of searches are equally efficient!" + "Now both types of lookups are equally fast:" ] }, { "cell_type": "code", - "execution_count": 44, + "execution_count": null, + "id": "cell-18", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1.31 ms ± 233 μs per loop (mean ± std. dev. of 3 runs, 6 loops each)\n" - ] - } - ], + "outputs": [], "source": [ "%%timeit -n6 -r3\n", "\n", - "#efficient! Uses the primary key\n", + "# Fast: Uses the primary key index\n", "(Mouse2() & {'mouse_id': random.randint(0, 999_999)}).fetch()" ] }, { "cell_type": "code", - "execution_count": 46, + "execution_count": null, + "id": "cell-19", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1.29 ms ± 403 μs per loop (mean ± std. dev. of 3 runs, 6 loops each)\n" - ] - } - ], + "outputs": [], "source": [ "%%timeit -n6 -r3\n", "\n", - "#efficient! Uses the seconary index on tag_id\n", + "# Fast: Uses the secondary index on tag_id\n", "(Mouse2() & {'tag_id': random.randint(0, 999_999)}).fetch()" ] }, { "cell_type": "markdown", + "id": "cell-20", "metadata": {}, "source": [ - "Let's now imagine that rats in the `Rat` table are identified by the combination of lab the `lab_name` and `rat_id` in each lab:" + "```{admonition} Regular vs. Unique Index\n", + ":class: tip\n", + "\n", + "- **Regular index** `index(attr)`: Speeds up lookups but allows duplicate values\n", + "- **Unique index** `unique index(attr)`: Speeds up lookups AND enforces that all values must be distinct\n", + "\n", + "Use `unique index` when the attribute should be unique (like facility tag IDs), and regular `index` when duplicates are allowed (like dates or categories).\n", + "```" ] }, { - "cell_type": "code", - "execution_count": 7, + "cell_type": "markdown", + "id": "cell-21", "metadata": {}, - "outputs": [], "source": [ - "import random" + "## Composite Index Ordering\n", + "\n", + "When a primary key (or index) contains multiple attributes, the **order matters**. The index can only be used efficiently when searching from the leftmost attribute.\n", + "\n", + "This is analogous to searching in a dictionary that orders words alphabetically:\n", + "- Searching by the **first letters** is easy (use the index)\n", + "- Searching by the **last letters** requires scanning every word\n", + "\n", + "Let's demonstrate with a multi-attribute primary key:" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, + "id": "cell-22", "metadata": {}, "outputs": [], "source": [ "@schema\n", "class Rat(dj.Manual):\n", " definition = \"\"\"\n", - " lab_name : char(16) \n", - " rat_id : int unsigned # lab-specific ID\n", + " lab_name : char(16) # name of the lab\n", + " rat_id : int unsigned # lab-specific rat ID\n", " ---\n", - " date_of_birth = null : date\n", + " date_of_birth = null : date # birth date (optional)\n", " \"\"\"" ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, + "id": "cell-23", "metadata": {}, "outputs": [], "source": [ "def populate_rats(table):\n", + " \"\"\"Insert random rat records for testing.\"\"\"\n", " lab_names = (\"Cajal\", \"Kandel\", \"Moser\", \"Wiesel\")\n", - " for date_of_birth in (None, \"2024-10-01\", \n", - " \"2024-10-02\", \"2024-10-03\", \"2024-10-04\"):\n", - " table.insert((\n", - " (random.choice(lab_names), random.randint(1, 1_000_000_000), date_of_birth) \n", - " for i in range(100_000)), skip_duplicates=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ + " dates = (None, \"2024-10-01\", \"2024-10-02\", \"2024-10-03\", \"2024-10-04\")\n", + " for date_of_birth in dates:\n", + " table.insert(\n", + " ((random.choice(lab_names), random.randint(1, 1_000_000_000), date_of_birth) \n", + " for i in range(100_000)), \n", + " skip_duplicates=True\n", + " )\n", + "\n", "populate_rats(Rat)" ] }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

lab_name

\n", - " \n", - "
\n", - "

rat_id

\n", - " lab-specific ID\n", - "
\n", - "

date_of_birth

\n", - " \n", - "
Cajal6494None
Cajal269402024-10-02
Cajal29126None
Cajal30792None
Cajal344462024-10-04
Cajal379342024-10-03
Cajal406202024-10-01
Cajal597672024-10-02
Cajal660562024-10-02
Cajal73809None
Cajal939842024-10-02
Cajal1032542024-10-02
\n", - "

...

\n", - "

Total: 499979

\n", - " " - ], - "text/plain": [ - "*lab_name *rat_id date_of_birth \n", - "+----------+ +--------+ +------------+\n", - "Cajal 6494 None \n", - "Cajal 26940 2024-10-02 \n", - "Cajal 29126 None \n", - "Cajal 30792 None \n", - "Cajal 34446 2024-10-04 \n", - "Cajal 37934 2024-10-03 \n", - "Cajal 40620 2024-10-01 \n", - "Cajal 59767 2024-10-02 \n", - "Cajal 66056 2024-10-02 \n", - "Cajal 73809 None \n", - "Cajal 93984 2024-10-02 \n", - "Cajal 103254 2024-10-02 \n", - " ...\n", - " (Total: 499979)" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "Rat()" - ] - }, { "cell_type": "code", "execution_count": null, + "id": "cell-24", "metadata": {}, "outputs": [], "source": [ @@ -659,180 +359,100 @@ }, { "cell_type": "markdown", + "id": "cell-25", "metadata": {}, "source": [ - "Note that dispite the fact that `rat_id` is in the index, search by `rat_id` alone are not helped by the index because it is not first in the index. This is similar to search for a word in a dictionary that orders words alphabetically. Searching by the first letters of a word is easy but searching by the last few letters of a word requires scanning the whole dictionary." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this table, the primary key is a unique index on the combination `(lab_id, rat_id)`. Therefore searches on these attributes or on `lab_id` alone are fast. But this index cannot help searches on `rat_id` alone:" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "141 ms ± 8.25 ms per loop (mean ± std. dev. of 10 runs, 2 loops each)\n" - ] - } - ], - "source": [ - "%%timeit -n2 -r10\n", + "The primary key creates an index on `(lab_name, rat_id)`. This means:\n", "\n", - "# inefficient! Requires full table scan.\n", - "(Rat() & {'rat_id': 300}).fetch()" + "| Query Pattern | Uses Index? | Performance |\n", + "|---------------|-------------|-------------|\n", + "| `lab_name` only | Yes | Fast |\n", + "| `lab_name` + `rat_id` | Yes | Fast |\n", + "| `rat_id` only | No | Slow (full scan) |" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, + "id": "cell-26", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1.42 ms ± 482 μs per loop (mean ± std. dev. of 10 runs, 2 loops each)\n" - ] - } - ], + "outputs": [], "source": [ "%%timeit -n2 -r10\n", "\n", - "# efficient! Uses the primary key\n", + "# Fast: Uses the primary key index (both attributes)\n", "(Rat() & {'rat_id': 300, 'lab_name': 'Cajal'}).fetch()" ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, + "id": "cell-27", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "125 ms ± 10.1 ms per loop (mean ± std. dev. of 10 runs, 2 loops each)\n" - ] - } - ], + "outputs": [], "source": [ "%%timeit -n2 -r10\n", "\n", - "# inefficient! Requires a full table scan\n", - "len(Rat & {'rat_id': 500})" + "# Slow: rat_id is not first in the index, requires full table scan\n", + "(Rat() & {'rat_id': 300}).fetch()" ] }, { "cell_type": "markdown", + "id": "cell-28", "metadata": {}, "source": [ - "Pattern searches in strings can benefit from an index when the starting characters are specified." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "480 ms ± 32.8 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)\n" - ] - } - ], - "source": [ - "%%timeit -n2 -r2\n", + "```{admonition} Composite Index Rule\n", + ":class: warning\n", "\n", - "# efficient! Uses the primary key\n", - "len(Rat & 'lab_name=\"Cajal\"')" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "545 ms ± 30.9 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)\n" - ] - } - ], - "source": [ - "%%timeit -n2 -r2\n", + "A composite index on `(A, B, C)` can efficiently search for:\n", + "- `A` alone\n", + "- `A` and `B` together \n", + "- `A`, `B`, and `C` together\n", "\n", - "# inefficient! requires a full table scan\n", - "len(Rat & 'lab_name LIKE \"%jal\"')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Similarly, searching by the date requires an inefficient full-table scan:" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "786 ms ± 14.3 ms per loop (mean ± std. dev. of 6 runs, 3 loops each)\n" - ] - } - ], - "source": [ - "%%timeit -n3 -r6\n", + "But it **cannot** efficiently search for:\n", + "- `B` alone\n", + "- `C` alone\n", + "- `B` and `C` together (without `A`)\n", "\n", - "len(Rat & 'date_of_birth > \"2024-10-02\"')" + "If you frequently search by these patterns, add explicit indexes.\n", + "```" ] }, { "cell_type": "markdown", + "id": "cell-29", "metadata": {}, "source": [ - "To speed up searches by the `rat_id` and `date_of_birth`, we can explicit indexes to `Rat`:" + "### Adding Indexes for Common Query Patterns\n", + "\n", + "If we frequently need to search by `rat_id` alone or by `date_of_birth`, we should add explicit indexes:" ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": null, + "id": "cell-30", "metadata": {}, "outputs": [], "source": [ "@schema\n", "class Rat2(dj.Manual):\n", " definition = \"\"\"\n", - " lab_name : char(16) \n", - " rat_id : int unsigned # lab-specific ID\n", + " lab_name : char(16) # name of the lab\n", + " rat_id : int unsigned # lab-specific rat ID\n", " ---\n", - " date_of_birth = null : date\n", + " date_of_birth = null : date # birth date (optional)\n", "\n", - " index(rat_id)\n", - " index(date_of_birth)\n", + " index(rat_id) # enables fast lookup by rat_id alone\n", + " index(date_of_birth) # enables fast lookup by date\n", " \"\"\"" ] }, { "cell_type": "code", - "execution_count": 20, + "execution_count": null, + "id": "cell-31", "metadata": {}, "outputs": [], "source": [ @@ -841,367 +461,216 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": null, + "id": "cell-32", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2 ms ± 587 μs per loop (mean ± std. dev. of 6 runs, 3 loops each)\n" - ] - } - ], + "outputs": [], "source": [ "%%timeit -n3 -r6\n", "\n", - "# efficient! uses index on rat_id\n", + "# Fast: Uses the secondary index on rat_id\n", "(Rat2() & {'rat_id': 300}).fetch()" ] }, { "cell_type": "code", - "execution_count": 23, + "execution_count": null, + "id": "cell-33", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "398 ms ± 27.9 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)\n" - ] - } - ], + "outputs": [], "source": [ "%%timeit -n2 -r2\n", "\n", - "# efficient! uses index on date_of_birth\n", + "# Fast: Uses the secondary index on date_of_birth\n", "len(Rat2 & 'date_of_birth = \"2024-10-02\"')" ] }, { "cell_type": "markdown", + "id": "cell-34", "metadata": {}, "source": [ - "#### Quiz: How many indexes does the table `Rat` have?" + "## String Pattern Matching and Indexes\n", + "\n", + "Indexes on string columns follow similar rules. Pattern searches with `LIKE` can only use an index when the **starting characters** are specified:" ] }, { "cell_type": "code", "execution_count": null, + "id": "cell-35", "metadata": {}, "outputs": [], "source": [ - "Rat.describe();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Answer\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Three: primary key, rat_id, date_of_birth" + "%%timeit -n2 -r2\n", + "\n", + "# Fast: Exact match uses the index\n", + "len(Rat & 'lab_name=\"Cajal\"')" ] }, { "cell_type": "code", "execution_count": null, + "id": "cell-36", "metadata": {}, "outputs": [], "source": [ - "# To re-run the notebook, drop the schema to create anew\n", - "# schema.drop() " + "%%timeit -n2 -r2\n", + "\n", + "# Slow: Wildcard at start prevents index use\n", + "len(Rat & 'lab_name LIKE \"%jal\"')" ] }, { "cell_type": "markdown", + "id": "cell-37", "metadata": {}, "source": [ - "# Indexes in SQL" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [], - "source": [ - "import pymysql\n", - "import os\n", - "pymysql.install_as_MySQLdb()\n", + "```{admonition} String Pattern Matching\n", + ":class: tip\n", "\n", - "connection_string = \"mysql://{user}:{password}@{host}\".format(\n", - " user=os.environ['DJ_USER'],\n", - " host=os.environ['DJ_HOST'],\n", - " password=os.environ['DJ_PASS']\n", - ")\n", + "- `LIKE \"Caj%\"` — **Can use index** (known prefix)\n", + "- `LIKE \"%jal\"` — **Cannot use index** (unknown prefix, requires full scan)\n", + "- `LIKE \"%aja%\"` — **Cannot use index** (unknown prefix)\n", "\n", - "%load_ext sql\n", - "%sql $connection_string" + "Design your queries to search by prefix when possible.\n", + "```" ] }, { - "cell_type": "code", - "execution_count": 36, + "cell_type": "markdown", + "id": "cell-38", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " * mysql://dimitri:***@db.ust-data-sci.net\n", - "(pymysql.err.ProgrammingError) (1007, \"Can't create database 'dimitri_indexes'; database exists\")\n", - "[SQL: create database dimitri_indexes]\n", - "(Background on this error at: https://sqlalche.me/e/14/f405)\n" - ] - } - ], "source": [ - "%%sql\n", + "## Viewing Table Indexes\n", "\n", - "create database indexes" + "Use the `describe()` method to see all indexes defined on a table:" ] }, { "cell_type": "code", - "execution_count": 37, + "execution_count": null, + "id": "cell-39", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " * mysql://dimitri:***@db.ust-data-sci.net\n", - "5 rows affected.\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Tables_in_dimitri_indexes
mouse
mouse2
rat
rat2
~log
" - ], - "text/plain": [ - "[('mouse',), ('mouse2',), ('rat',), ('rat2',), ('~log',)]" - ] - }, - "execution_count": 37, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "%%sql\n", - "\n", - "SHOW TABLES in indexes;" + "Rat2.describe();" ] }, { - "cell_type": "code", - "execution_count": 38, + "cell_type": "markdown", + "id": "cell-40", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " * mysql://dimitri:***@db.ust-data-sci.net\n", - "(pymysql.err.OperationalError) (1050, \"Table 'mouse' already exists\")\n", - "[SQL: CREATE TABLE dimitri_indexes.mouse(\n", - "mouse_id int NOT NULL,\n", - "tag_id int NOT NULL,\n", - "primary key(mouse_id)\n", - ")]\n", - "(Background on this error at: https://sqlalche.me/e/14/e3q8)\n" - ] - } - ], "source": [ - "%%sql\n", + "## Equivalent SQL Syntax\n", "\n", - "CREATE TABLE indexes.mouse (\n", - "mouse_id int NOT NULL,\n", - "tag_id int NOT NULL,\n", - "primary key(mouse_id)\n", - ")" + "For reference, here's how indexes are declared in standard SQL:\n", + "\n", + "**(DataJoint)**\n", + "```python\n", + "@schema\n", + "class Mouse(dj.Manual):\n", + " definition = \"\"\"\n", + " mouse_id : int\n", + " ---\n", + " tag_id : int\n", + " unique index(tag_id)\n", + " \"\"\"\n", + "```\n", + "\n", + "**(Equivalent SQL)**\n", + "```sql\n", + "CREATE TABLE mouse (\n", + " mouse_id INT NOT NULL,\n", + " tag_id INT NOT NULL,\n", + " PRIMARY KEY (mouse_id),\n", + " UNIQUE INDEX (tag_id)\n", + ");\n", + "```\n", + "\n", + "You can also add indexes to existing tables in SQL:\n", + "```sql\n", + "-- Add a regular index\n", + "CREATE INDEX idx_date ON rat (date_of_birth);\n", + "\n", + "-- Add a unique index\n", + "CREATE UNIQUE INDEX idx_tag ON mouse (tag_id);\n", + "\n", + "-- Remove an index\n", + "DROP INDEX idx_tag ON mouse;\n", + "```" ] }, { - "cell_type": "code", - "execution_count": 39, + "cell_type": "markdown", + "id": "cell-41", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " * mysql://dimitri:***@db.ust-data-sci.net\n", - "0 rows affected.\n" - ] - }, - { - "data": { - "text/plain": [ - "[]" - ] - }, - "execution_count": 39, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ - "%%sql\n", + "## Quiz\n", + "\n", + "```{admonition} Question\n", + ":class: note\n", "\n", - "drop table dimitri_indexes.mouse" + "How many indexes does the table `Rat2` have? What are they?\n", + "```" ] }, { "cell_type": "code", - "execution_count": 40, + "execution_count": null, + "id": "cell-42", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " * mysql://dimitri:***@db.ust-data-sci.net\n", - "0 rows affected.\n" - ] - }, - { - "data": { - "text/plain": [ - "[]" - ] - }, - "execution_count": 40, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "%%sql\n", - "\n", - "CREATE TABLE dimitri_indexes.mouse(\n", - "mouse_id int NOT NULL,\n", - "tag_id int NOT NULL,\n", - "primary key(mouse_id),\n", - "unique index (tag_id)\n", - ")" + "# Check the table definition to see all indexes\n", + "Rat2.describe();" ] }, { - "cell_type": "code", - "execution_count": 41, + "cell_type": "markdown", + "id": "cell-43", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " * mysql://dimitri:***@db.ust-data-sci.net\n", - "0 rows affected.\n" - ] - }, - { - "data": { - "text/plain": [ - "[]" - ] - }, - "execution_count": 41, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ - "%%sql\n", + "```{admonition} Answer\n", + ":class: tip\n", + ":class: dropdown\n", "\n", - "CREATE UNIQUE INDEX mouse_idx ON dimitri_indexes.mouse (tag_id)" + "**Three indexes:**\n", + "1. Primary key index on `(lab_name, rat_id)` — automatic\n", + "2. Secondary index on `rat_id` — explicit\n", + "3. Secondary index on `date_of_birth` — explicit\n", + "```" ] }, { - "cell_type": "code", - "execution_count": 45, + "cell_type": "markdown", + "id": "cell-44", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " * mysql://dimitri:***@db.ust-data-sci.net\n", - "0 rows affected.\n" - ] - }, - { - "data": { - "text/plain": [ - "[]" - ] - }, - "execution_count": 45, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%sql\n", - "\n", - "DROP INDEX mouse_idx ON dimitri_indexes.mouse;" - ] + "source": "## Summary\n\nIndexes are essential for query performance in tables with many records:\n\n1. **Primary keys** automatically create unique indexes for fast entity lookups\n2. **Foreign keys** automatically create secondary indexes for fast joins\n3. **Explicit indexes** can be added for frequently queried non-key attributes\n4. **Composite index order matters** — only leftmost attributes benefit from the index\n5. **Unique indexes** enforce uniqueness in addition to speeding up lookups\n\n```{admonition} When to Add Indexes\n:class: tip\n\nAdd secondary indexes when:\n- You frequently query by a non-key attribute\n- Queries on large tables are slow\n- You need to enforce uniqueness on a non-primary-key attribute\n\nDon't over-index: Each index adds overhead to insert/update operations and uses storage space. Only index attributes that are actually queried frequently.\n```\n\n```{admonition} Next Steps\n:class: note\n\nNow that you understand how to optimize queries with indexes, explore:\n- [Queries](../50-queries/005-queries.ipynb) — Writing efficient database queries\n- [Pipeline Projects](090-pipeline-project.md) — Designing complete data pipelines\n```" }, { "cell_type": "code", "execution_count": null, + "id": "cell-45", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "# To re-run the notebook, drop the schema to create anew\n", + "# schema.drop()" + ] } ], "metadata": { "kernelspec": { - "display_name": "base", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.10" + "version": "3.9" } }, "nbformat": 4, - "nbformat_minor": 2 -} + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/book/30-design/090-pipeline-project.md b/book/30-design/090-pipeline-project.md new file mode 100644 index 0000000..57e2edb --- /dev/null +++ b/book/30-design/090-pipeline-project.md @@ -0,0 +1,320 @@ +--- +title: Pipeline Projects +authors: + - name: Dimitri Yatsenko +--- + +# Pipeline Projects + +DataJoint pipelines can range from simple scripts to full-fledged software projects. This chapter describes how to organize a pipeline for **production deployment**. + +## When Do You Need a Full Project? + +**Simple scripts and notebooks** work well for: +- Learning DataJoint +- Exploratory analysis +- Small pipelines with a single user +- Examples and tutorials (like those in this book) + +**A full project structure** is recommended when you need: +- Version control with Git for the pipeline code +- Multiple collaborators working on the same pipeline +- Automated computation workers +- Reproducible deployment with Docker +- Object storage configuration +- Installation as a Python package + +## Pipeline ≡ Git Repository + +A production DataJoint pipeline is implemented as a dedicated **Git repository** containing a Python package. This repository serves as the single source of truth for the entire pipeline definition: + +- **Schema definitions** — Table structures and relationships +- **Computation logic** — The `make` methods for automated tables +- **Configuration** — Object storage settings +- **Dependencies** — Required packages and environment specifications +- **Documentation** — Usage guides and API references +- **Containerization** — Docker configurations for reproducible environments + +## Standard Project Structure + +The pipeline code lives in `src/workflow/`, following the modern Python `src` layout: + +``` +my_pipeline/ +├── LICENSE # Project license (e.g., MIT, Apache 2.0) +├── README.md # Project documentation +├── pyproject.toml # Project metadata and configuration +├── .gitignore # Git ignore patterns +│ +├── src/ +│ └── workflow/ # Python package directory +│ ├── __init__.py # Package initialization +│ ├── subject.py # subject schema module +│ ├── acquisition.py # acquisition schema module +│ ├── processing.py # processing schema module +│ └── analysis.py # analysis schema module +│ +├── notebooks/ # Jupyter notebooks for exploration +│ ├── 01-data-entry.ipynb +│ └── 02-analysis.ipynb +│ +├── docs/ # Documentation sources +│ └── index.md +│ +├── docker/ # Docker configurations +│ ├── Dockerfile +│ ├── docker-compose.yaml +│ └── .env.example +│ +└── tests/ # Test suite + └── test_pipeline.py +``` + +### Directory Purposes + +| Directory | Purpose | +|-----------|---------| +| `src/workflow/` | Pipeline code — schema modules with table definitions | +| `notebooks/` | Interactive exploration and analysis notebooks | +| `docs/` | Documentation sources | +| `docker/` | Containerization for reproducible deployment | +| `tests/` | Unit and integration tests | + +## Database Schema ≡ Python Module + +Each database schema corresponds to a Python module within `src/workflow/`: + +| Database Construct | Python Construct | +|---|---| +| Database schema | Python module (`.py` file) | +| Database table | Python class | + +```{figure} ../95-reference/figures/schema-illustration.png +:width: 600px +:align: center + +Each database schema corresponds to a Python module containing related table definitions. +``` + +Each module defines a `schema` object and uses it to declare tables: + +```python +# src/workflow/subject.py +import datajoint as dj + +schema = dj.Schema('subject') + +@schema +class Subject(dj.Manual): + definition = """ + subject_id : int + --- + subject_name : varchar(100) + """ +``` + +## Pipeline as a DAG of Modules + +A pipeline forms a **Directed Acyclic Graph (DAG)** where: + +- **Nodes** are schema modules +- **Edges** represent dependencies (Python imports and foreign key bundles) + +```{figure} ../95-reference/figures/pipeline-illustration.png +:width: 600px +:align: center + +Schemas form a DAG where edges represent both Python imports and foreign key relationships. +``` + +Downstream modules import upstream modules: + +```python +# src/workflow/acquisition.py +import datajoint as dj +from . import subject # Import upstream module + +schema = dj.Schema('acquisition') + +@schema +class Session(dj.Manual): + definition = """ + -> subject.Subject # Foreign key to upstream schema + session_id : int + --- + session_date : date + """ +``` + +**Cyclic dependencies are prohibited** — both Python imports and foreign keys must form a DAG. + +## Project Configuration + +### pyproject.toml + +The `pyproject.toml` file defines project metadata, dependencies, and object storage configuration: + +```toml +[build-system] +requires = ["setuptools>=61.0"] +build-backend = "setuptools.build_meta" + +[project] +name = "my-pipeline" +version = "0.1.0" +description = "A DataJoint pipeline for experiments" +readme = "README.md" +license = {file = "LICENSE"} +requires-python = ">=3.9" +dependencies = [ + "datajoint>=0.14", + "numpy", +] + +[project.optional-dependencies] +dev = ["pytest", "jupyter"] + +[tool.setuptools.packages.find] +where = ["src"] + +# Object storage configuration +[tool.datajoint.stores.main] +protocol = "s3" +endpoint = "s3.amazonaws.com" +bucket = "my-pipeline-data" +location = "raw" +``` + +### Object Storage + +The `[tool.datajoint.stores]` section configures external storage for large data objects: + +| Setting | Description | +|---------|-------------| +| `protocol` | Storage protocol (`s3`, `file`, etc.) | +| `endpoint` | Storage server endpoint | +| `bucket` | Bucket or root directory name | +| `location` | Subdirectory within the bucket | + +Tables reference stores for `object` attributes: + +```python +@schema +class Recording(dj.Imported): + definition = """ + -> Session + --- + raw_data : object@main # Stored in 'main' store + """ +``` + +## Docker Deployment + +### Dockerfile + +```dockerfile +FROM python:3.11-slim + +WORKDIR /app + +RUN apt-get update && apt-get install -y \ + default-libmysqlclient-dev \ + build-essential \ + && rm -rf /var/lib/apt/lists/* + +COPY pyproject.toml README.md LICENSE ./ +COPY src/ ./src/ +RUN pip install -e . + +CMD ["python", "-m", "workflow.worker"] +``` + +### docker-compose.yaml + +```yaml +services: + db: + image: mysql:8.0 + environment: + MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD} + volumes: + - mysql_data:/var/lib/mysql + ports: + - "3306:3306" + + minio: + image: minio/minio + command: server /data --console-address ":9001" + environment: + MINIO_ROOT_USER: ${MINIO_ROOT_USER} + MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD} + volumes: + - minio_data:/data + ports: + - "9000:9000" + - "9001:9001" + + worker: + build: + context: .. + dockerfile: docker/Dockerfile + environment: + DJ_HOST: db + DJ_USER: root + DJ_PASS: ${MYSQL_ROOT_PASSWORD} + depends_on: + - db + - minio + +volumes: + mysql_data: + minio_data: +``` + +## Managed Deployment with DataJoint Platform + +For teams that prefer managed infrastructure over DIY deployment, the [DataJoint Platform](https://datajoint.com) is specifically designed for hosting and managing full DataJoint projects. The platform provides: + +- Managed databases and object storage +- Automated computation orchestration +- Web-based data exploration and visualization +- Team collaboration tools +- Enterprise support + +This eliminates the need to configure and maintain your own database servers, storage backends, and worker infrastructure while following the same project conventions described in this chapter. + +## Best Practices + +1. **One schema per module** — Never define multiple schemas in one module + +2. **Clear naming** — Schema names use lowercase with underscores; table classes use CamelCase + +3. **Explicit imports** — Import upstream modules at the top of each file: + ```python + from . import subject + from . import acquisition + ``` + +4. **Credentials in environment** — Keep database credentials in environment variables, not in code + +5. **Use the src layout** — Prevents accidental imports from the project root + +## Summary + +A production DataJoint pipeline project: + +1. Lives in a **Git repository** +2. Contains a **LICENSE** file +3. Places code in **`src/workflow/`** +4. Maps **one schema to one module** +5. Forms a **DAG** with no cyclic dependencies +6. Configures **object storage** in `pyproject.toml` +7. Includes **Docker** configurations for deployment + +For simple scripts and learning, see the examples throughout this book. Use this full project structure when you're ready for production deployment. + +:::{seealso} +- [Create Schemas](010-schema.ipynb) — Declaring schemas and tables +- [Orchestration](../40-operations/060-orchestration.ipynb) — Running pipelines at scale +- [DataJoint Specs](../95-reference/SPECS_2_0.md) — Complete specification reference +:::