Bug Description
On a 3-level hierarchical schema (grandparent → parent → child) in LOCAL mode with differential privacy enabled, generation intermittently fails when many parent rows have no children. The same config can succeed on one run and fail on the next, because it depends on whether a generation batch happens to contain only childless parents (→ an empty chunk). The failure surfaces as one of several errors, all rooted in empty-chunk handling in the multi-table generation path. In one case the SDK also hung indefinitely (the wait-loop never returned after the worker subprocess died).
Python Version
Python 3.11
Steps to Reproduce
Data shape that triggers it
A 3-level chain where a large fraction of grandparents have zero parent rows
(e.g. ~54% of "accounts" have no "mortgages"), so during generation some batches
produce zero rows for the parent table.
Errors observed (all on the same schema; which one appears varies)
pyarrow.lib.ArrowTypeError: Array type doesn't match type of values set: string vs null
- in
mostlyai/sdk/_data/context.py::add_gpc_context, at the
read_data_prefixed(..., where={gp_pk: chunk[parent_context_key]}) call.
- Confirmed by instrumentation: at crash time the filter Series is
len=0, dtype=string — an empty IN-filter, which pyarrow infers as
null-typed and cannot match the string key column.
KeyError: '<parent>$prev::<col>'
- in
mostlyai/sdk/_data/context.py::add_ns_context (~line 121), at
ctx_data[[root_key, cur_name]].groupby(root_key).apply(add_previous_values_ctx)[prev_name]
- the groupby/apply on empty data returns a frame without
prev_name.
RuntimeError: Cannot pack empty tensors
- in DP-SGD training (opacus) via
mostlyai/engine/_tabular/training.py,
when value_protection=False (sequence-length protection no longer masks
empty sequences).
Minimal reproduction (representative — generic data, high childless rate)
import pandas as pd
from mostlyai.sdk import MostlyAI
# grandparent: 200 customers, but only the first 60 ever have orders -> 140 childless
customers = pd.DataFrame({"customer_id": range(1, 201), "segment": (["A", "B"] * 100)})
orders = pd.DataFrame({
"order_id": range(1, 121),
"customer_id": [(i % 60) + 1 for i in range(120)], # only customers 1..60
"channel": (["web", "store"] * 60),
})
items = pd.DataFrame({
"item_id": range(1, 301),
"order_id": [(i % 120) + 1 for i in range(300)],
"qty": [1 + (i % 4) for i in range(300)],
})
dp = {"differential_privacy": {"max_epsilon": 1.0, "delta": 1e-6}}
config = {"name": "repro", "tables": [
{"name": "customers", "data": customers, "primary_key": "customer_id",
"tabular_model_configuration": dp},
{"name": "orders", "data": orders, "primary_key": "order_id",
"foreign_keys": [{"column": "customer_id", "referenced_table": "customers", "is_context": True}],
"tabular_model_configuration": dp},
{"name": "items", "data": items, "primary_key": "order_id" and "item_id",
"foreign_keys": [{"column": "order_id", "referenced_table": "orders", "is_context": True}],
"tabular_model_configuration": dp},
]}
m = MostlyAI(local=True)
g = m.train(config=config, start=True, wait=True, progress_bar=False)
sd = m.generate(g, size={"customers": 200}, start=True, wait=True, progress_bar=False)
print({k: len(v) for k, v in sd.data().items()})
### Expected Behavior
### Expected
A childless parent is a normal, common pattern (customers with no orders, accounts
with no loans, etc.). Generation should handle an empty chunk gracefully (produce
zero child rows for it) rather than crash or hang.
### Additional Context
mostlyai 6.1.1
mostlyai-engine 2.6.2
torch 2.11.0
macOS (LOCAL mode; CPU)
Bug Description
On a 3-level hierarchical schema (grandparent → parent → child) in LOCAL mode with differential privacy enabled, generation intermittently fails when many parent rows have no children. The same config can succeed on one run and fail on the next, because it depends on whether a generation batch happens to contain only childless parents (→ an empty chunk). The failure surfaces as one of several errors, all rooted in empty-chunk handling in the multi-table generation path. In one case the SDK also hung indefinitely (the wait-loop never returned after the worker subprocess died).
Python Version
Python 3.11
Steps to Reproduce
Data shape that triggers it
A 3-level chain where a large fraction of grandparents have zero parent rows
(e.g. ~54% of "accounts" have no "mortgages"), so during generation some batches
produce zero rows for the parent table.
Errors observed (all on the same schema; which one appears varies)
pyarrow.lib.ArrowTypeError: Array type doesn't match type of values set: string vs nullmostlyai/sdk/_data/context.py::add_gpc_context, at theread_data_prefixed(..., where={gp_pk: chunk[parent_context_key]})call.len=0, dtype=string— an empty IN-filter, which pyarrow infers asnull-typed and cannot match the string key column.
KeyError: '<parent>$prev::<col>'mostlyai/sdk/_data/context.py::add_ns_context(~line 121), atctx_data[[root_key, cur_name]].groupby(root_key).apply(add_previous_values_ctx)[prev_name]prev_name.RuntimeError: Cannot pack empty tensorsmostlyai/engine/_tabular/training.py,when
value_protection=False(sequence-length protection no longer masksempty sequences).
Minimal reproduction (representative — generic data, high childless rate)