-
Notifications
You must be signed in to change notification settings - Fork 460
Description
Rationalize Checkpoint Format Between Python and .NET
Overview
Python and .NET workflows use fundamentally different checkpoint structures and serialization approaches. This makes cross-language checkpoint interoperability impossible and creates inconsistent developer experiences when working with checkpointed workflows across languages.
Current State
Python Checkpoint Format
Structure: Flat, self-contained snapshot (WorkflowCheckpoint dataclass)
checkpoint_id
: UUID stringworkflow_id
: String identifiertimestamp
: ISO 8601 stringmessages
: Flat dict mapping source executor IDs to list of message dictsshared_state
: Single dict with all shared workflow stateexecutor_states
: Flat dict mapping executor IDs to their state dictsiteration_count
: Current iteration numbermax_iterations
: Configured limitmetadata
: User-provided metadata dictversion
: Format version string ("1.0")
Message Encoding:
{
"data": {"__af_model__": "module:Class", "value": {...}},
"source_id": "executor_1",
"target_id": "executor_2",
"trace_contexts": [{"traceparent": "..."}],
"source_span_ids": ["span_123"]
}
Serialization Approach:
- Direct JSON serialization via dataclasses.asdict()
- Custom type markers:
__af_model__
for models,__af_dataclass__
for dataclasses - Stores serialized message payload, not message metadata
- No workflow topology information
- No validation on restore
Storage: CheckpointStorage
protocol with file/memory implementations
- Single JSON file per checkpoint:
{checkpoint_id}.json
- No index or parent/child relationships
.NET Checkpoint Format
Structure: Hierarchical with workflow topology (Checkpoint class)
StepNumber
: Integer (-1 = initial checkpoint)Workflow
: Complete workflow topology (WorkflowInfo)Executors
: Dictionary of ExecutorInfo (type, ID)Edges
: Dictionary of EdgeInfo (source -> targets)RequestPorts
: Set of RequestPortInfoInputType
: TypeId for workflow inputStartExecutorId
: StringOutputExecutorIds
: Set of output executor IDs
RunnerData
: Runtime execution state (RunnerStateData)InstantiatedExecutors
: HashSet of executor IDsQueuedMessages
: Dictionary of PortableMessageEnvelope listsOutstandingRequests
: List of ExternalRequest objects
StateData
: Dictionary of ScopeKey -> PortableValue (executor states)EdgeStateData
: Dictionary of EdgeId -> PortableValue (edge-specific state)Parent
: Optional CheckpointInfo reference (parent checkpoint)
Message Encoding:
{
"MessageType": { "Assembly": "...", "FullName": "..." },
"Message": { /* PortableValue with type metadata */ },
"Source": { "ExecutorId": "...", "InstanceId": "..." },
"TargetId": "executor_2"
}
Serialization Approach:
- System.Text.Json with custom converters
- Rich type system:
TypeId
,ExecutorIdentity
,PortableValue
- Stores full message envelope with metadata
- Includes complete workflow topology for validation
- Validates workflow structure on restore via
WorkflowInfo.IsMatch()
- Parent/child checkpoint hierarchy support
Storage: ICheckpointStore<TStoreObject>
interface
- Separate files:
{runId}_{checkpointId}.json
for data index.jsonl
for checkpoint index- Hierarchical relationships tracked via
Parent
property
Key Differences
Aspect | Python | .NET |
---|---|---|
Structure | Flat, self-contained | Hierarchical with topology |
Workflow Info | Not stored | Full topology in WorkflowInfo |
Message Format | Simple dict with markers | PortableMessageEnvelope with rich metadata |
Type System | String markers (__af_model__ ) |
TypeId , PortableValue , ExecutorIdentity |
Executor Tracking | Just states | Tracks instantiated + states |
State Organization | Flat executor_states dict |
Separate StateData + EdgeStateData |
Validation | None on restore | Full topology validation via IsMatch() |
Hierarchy | Single-level | Parent/child via CheckpointInfo |
Storage | Single file per checkpoint | File + index with runId prefix |
Outstanding Requests | In executor state | Explicit OutstandingRequests list |
Problem Scenarios
Scenario 1: Cross-Language Workflow Migration
Problem: A team builds a workflow in Python, checkpoints it during execution, then wants to migrate to .NET for better performance or integration requirements.
Current State: Impossible. The Python checkpoint cannot be loaded into .NET because:
- .NET expects
WorkflowInfo
topology that doesn't exist in Python checkpoints - Message format is incompatible (
PortableMessageEnvelope
vs simple dicts) - Type encoding is different (
TypeId
vs string markers) - .NET validation will fail due to missing topology
Business Impact: Forces complete workflow reruns when switching languages, losing hours or days of progress for long-running workflows.
Scenario 2: Debugging Cross-Language Checkpoints
Problem: A developer debugging a workflow issue receives a checkpoint file from a colleague using a different language SDK.
Current State: Cannot inspect or analyze the checkpoint because:
- Different field names and structure
- Incompatible type encoding
- .NET checkpoints have extra metadata that Python doesn't understand
- No shared tooling for checkpoint inspection
Business Impact: Wastes developer time recreating issues, can't leverage checkpoints for debugging across team language preferences.
Scenario 3: Workflow Topology Validation
Problem: A workflow's executor graph changes during development. In .NET, restored checkpoints fail with validation errors. In Python, they silently continue with mismatched topology.
Current State: Inconsistent behavior:
- .NET:
WorkflowInfo.IsMatch()
catches topology changes, fails fast - Python: No validation, potentially runs with wrong graph structure
Business Impact: Python workflows may corrupt state or produce incorrect results when topology changes aren't detected.
Scenario 4: Multi-Language Workflow Ecosystem
Problem: An organization wants to build a workflow marketplace or shared workflow library where users can use Python or .NET interchangeably.
Current State: Impossible due to checkpoint incompatibility. Each language creates isolated ecosystems. If a workflow needs to be paused and resumed, language choice is locked in.
Business Impact: Reduces ecosystem value, forces language lock-in, prevents maximum flexibility.
Scenario 5: Hierarchical Checkpoint Analysis
Problem: A long-running workflow creates multiple checkpoints over days. A developer wants to understand the checkpoint lineage and potentially restore to an earlier point in the chain.
Current State:
- .NET:
Parent
property enables checkpoint tree navigation - Python: No parent/child relationships, flat list only
Business Impact: Python users cannot leverage advanced checkpoint features like branching restore or tree visualization.
Scenario 6: Request/Response Tracking
Problem: Workflows with human-in-the-loop or external API calls need to track pending requests that haven't received responses yet.
Current State:
- .NET: Explicit
OutstandingRequests
list inRunnerData
- Python: Buried in executor state, no standard location
Business Impact: Python workflows may lose request tracking on restore, .NET cannot consume Python checkpoints with pending requests.
Questions for Resolution
- Should we aim for full format compatibility (same JSON schema) or just semantic compatibility (convertible)?
- Does Python need workflow topology validation on checkpoint restore?
- Should Python adopt hierarchical checkpoints with parent/child relationships?
- Can we define a common type system that works in both languages?
- Should we version the checkpoint format and support multiple versions?
- Do we need a migration tool for existing checkpoints?
- Should outstanding requests be first-class in both languages?
- What's the migration path for users with existing checkpoints?
Related Code
Python:
python/packages/core/agent_framework/_workflows/_checkpoint.py
- WorkflowCheckpoint dataclasspython/packages/core/agent_framework/_workflows/_runner_context.py
- CheckpointState, encoding/decodingpython/samples/getting_started/workflows/checkpoint/checkpoint_with_resume.py
- checkpoint usage
.NET:
dotnet/src/Microsoft.Agents.AI.Workflows/Checkpointing/Checkpoint.cs
- main checkpoint structuredotnet/src/Microsoft.Agents.AI.Workflows/Checkpointing/WorkflowInfo.cs
- topology validationdotnet/src/Microsoft.Agents.AI.Workflows/Checkpointing/PortableMessageEnvelope.cs
- message formatdotnet/src/Microsoft.Agents.AI.Workflows/Execution/RunnerStateData.cs
- runtime statedotnet/src/Microsoft.Agents.AI.Workflows/Checkpointing/FileSystemJsonCheckpointStore.cs
- storage