Skip to content

[BUG]: Multi-table generation crashes/hangs on 3-level schemas when parents have no children (empty chunks) — local mode + DP #731

Description

@Anna-Pav

Bug Description

On a 3-level hierarchical schema (grandparent → parent → child) in LOCAL mode with differential privacy enabled, generation intermittently fails when many parent rows have no children. The same config can succeed on one run and fail on the next, because it depends on whether a generation batch happens to contain only childless parents (→ an empty chunk). The failure surfaces as one of several errors, all rooted in empty-chunk handling in the multi-table generation path. In one case the SDK also hung indefinitely (the wait-loop never returned after the worker subprocess died).

Python Version

Python 3.11

Steps to Reproduce

Data shape that triggers it

A 3-level chain where a large fraction of grandparents have zero parent rows
(e.g. ~54% of "accounts" have no "mortgages"), so during generation some batches
produce zero rows for the parent table.

Errors observed (all on the same schema; which one appears varies)

  1. pyarrow.lib.ArrowTypeError: Array type doesn't match type of values set: string vs null
    • in mostlyai/sdk/_data/context.py::add_gpc_context, at the
      read_data_prefixed(..., where={gp_pk: chunk[parent_context_key]}) call.
    • Confirmed by instrumentation: at crash time the filter Series is
      len=0, dtype=string — an empty IN-filter, which pyarrow infers as
      null-typed and cannot match the string key column.
  2. KeyError: '<parent>$prev::<col>'
    • in mostlyai/sdk/_data/context.py::add_ns_context (~line 121), at
      ctx_data[[root_key, cur_name]].groupby(root_key).apply(add_previous_values_ctx)[prev_name]
    • the groupby/apply on empty data returns a frame without prev_name.
  3. RuntimeError: Cannot pack empty tensors
    • in DP-SGD training (opacus) via mostlyai/engine/_tabular/training.py,
      when value_protection=False (sequence-length protection no longer masks
      empty sequences).

Minimal reproduction (representative — generic data, high childless rate)

import pandas as pd
from mostlyai.sdk import MostlyAI

# grandparent: 200 customers, but only the first 60 ever have orders -> 140 childless
customers = pd.DataFrame({"customer_id": range(1, 201), "segment": (["A", "B"] * 100)})
orders = pd.DataFrame({
    "order_id": range(1, 121),
    "customer_id": [(i % 60) + 1 for i in range(120)],     # only customers 1..60
    "channel": (["web", "store"] * 60),
})
items = pd.DataFrame({
    "item_id": range(1, 301),
    "order_id": [(i % 120) + 1 for i in range(300)],
    "qty": [1 + (i % 4) for i in range(300)],
})

dp = {"differential_privacy": {"max_epsilon": 1.0, "delta": 1e-6}}
config = {"name": "repro", "tables": [
    {"name": "customers", "data": customers, "primary_key": "customer_id",
     "tabular_model_configuration": dp},
    {"name": "orders", "data": orders, "primary_key": "order_id",
     "foreign_keys": [{"column": "customer_id", "referenced_table": "customers", "is_context": True}],
     "tabular_model_configuration": dp},
    {"name": "items", "data": items, "primary_key": "order_id" and "item_id",
     "foreign_keys": [{"column": "order_id", "referenced_table": "orders", "is_context": True}],
     "tabular_model_configuration": dp},
]}

m = MostlyAI(local=True)
g = m.train(config=config, start=True, wait=True, progress_bar=False)
sd = m.generate(g, size={"customers": 200}, start=True, wait=True, progress_bar=False)
print({k: len(v) for k, v in sd.data().items()})

### Expected Behavior

### Expected
A childless parent is a normal, common pattern (customers with no orders, accounts
with no loans, etc.). Generation should handle an empty chunk gracefully (produce
zero child rows for it) rather than crash or hang.

### Additional Context

mostlyai 6.1.1
mostlyai-engine 2.6.2 
torch 2.11.0 
macOS (LOCAL mode; CPU)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions