Skip to content

.NET: Rationalize Checkpoint Format Between Python and .NET #1443

@alliscode

Description

@alliscode

Rationalize Checkpoint Format Between Python and .NET

Overview

Python and .NET workflows use fundamentally different checkpoint structures and serialization approaches. This makes cross-language checkpoint interoperability impossible and creates inconsistent developer experiences when working with checkpointed workflows across languages.

Current State

Python Checkpoint Format

Structure: Flat, self-contained snapshot (WorkflowCheckpoint dataclass)

  • checkpoint_id: UUID string
  • workflow_id: String identifier
  • timestamp: ISO 8601 string
  • messages: Flat dict mapping source executor IDs to list of message dicts
  • shared_state: Single dict with all shared workflow state
  • executor_states: Flat dict mapping executor IDs to their state dicts
  • iteration_count: Current iteration number
  • max_iterations: Configured limit
  • metadata: User-provided metadata dict
  • version: Format version string ("1.0")

Message Encoding:

{
  "data": {"__af_model__": "module:Class", "value": {...}},
  "source_id": "executor_1",
  "target_id": "executor_2",
  "trace_contexts": [{"traceparent": "..."}],
  "source_span_ids": ["span_123"]
}

Serialization Approach:

  • Direct JSON serialization via dataclasses.asdict()
  • Custom type markers: __af_model__ for models, __af_dataclass__ for dataclasses
  • Stores serialized message payload, not message metadata
  • No workflow topology information
  • No validation on restore

Storage: CheckpointStorage protocol with file/memory implementations

  • Single JSON file per checkpoint: {checkpoint_id}.json
  • No index or parent/child relationships

.NET Checkpoint Format

Structure: Hierarchical with workflow topology (Checkpoint class)

  • StepNumber: Integer (-1 = initial checkpoint)
  • Workflow: Complete workflow topology (WorkflowInfo)
    • Executors: Dictionary of ExecutorInfo (type, ID)
    • Edges: Dictionary of EdgeInfo (source -> targets)
    • RequestPorts: Set of RequestPortInfo
    • InputType: TypeId for workflow input
    • StartExecutorId: String
    • OutputExecutorIds: Set of output executor IDs
  • RunnerData: Runtime execution state (RunnerStateData)
    • InstantiatedExecutors: HashSet of executor IDs
    • QueuedMessages: Dictionary of PortableMessageEnvelope lists
    • OutstandingRequests: List of ExternalRequest objects
  • StateData: Dictionary of ScopeKey -> PortableValue (executor states)
  • EdgeStateData: Dictionary of EdgeId -> PortableValue (edge-specific state)
  • Parent: Optional CheckpointInfo reference (parent checkpoint)

Message Encoding:

{
  "MessageType": { "Assembly": "...", "FullName": "..." },
  "Message": { /* PortableValue with type metadata */ },
  "Source": { "ExecutorId": "...", "InstanceId": "..." },
  "TargetId": "executor_2"
}

Serialization Approach:

  • System.Text.Json with custom converters
  • Rich type system: TypeId, ExecutorIdentity, PortableValue
  • Stores full message envelope with metadata
  • Includes complete workflow topology for validation
  • Validates workflow structure on restore via WorkflowInfo.IsMatch()
  • Parent/child checkpoint hierarchy support

Storage: ICheckpointStore<TStoreObject> interface

  • Separate files: {runId}_{checkpointId}.json for data
  • index.jsonl for checkpoint index
  • Hierarchical relationships tracked via Parent property

Key Differences

Aspect Python .NET
Structure Flat, self-contained Hierarchical with topology
Workflow Info Not stored Full topology in WorkflowInfo
Message Format Simple dict with markers PortableMessageEnvelope with rich metadata
Type System String markers (__af_model__) TypeId, PortableValue, ExecutorIdentity
Executor Tracking Just states Tracks instantiated + states
State Organization Flat executor_states dict Separate StateData + EdgeStateData
Validation None on restore Full topology validation via IsMatch()
Hierarchy Single-level Parent/child via CheckpointInfo
Storage Single file per checkpoint File + index with runId prefix
Outstanding Requests In executor state Explicit OutstandingRequests list

Problem Scenarios

Scenario 1: Cross-Language Workflow Migration

Problem: A team builds a workflow in Python, checkpoints it during execution, then wants to migrate to .NET for better performance or integration requirements.

Current State: Impossible. The Python checkpoint cannot be loaded into .NET because:

  • .NET expects WorkflowInfo topology that doesn't exist in Python checkpoints
  • Message format is incompatible (PortableMessageEnvelope vs simple dicts)
  • Type encoding is different (TypeId vs string markers)
  • .NET validation will fail due to missing topology

Business Impact: Forces complete workflow reruns when switching languages, losing hours or days of progress for long-running workflows.

Scenario 2: Debugging Cross-Language Checkpoints

Problem: A developer debugging a workflow issue receives a checkpoint file from a colleague using a different language SDK.

Current State: Cannot inspect or analyze the checkpoint because:

  • Different field names and structure
  • Incompatible type encoding
  • .NET checkpoints have extra metadata that Python doesn't understand
  • No shared tooling for checkpoint inspection

Business Impact: Wastes developer time recreating issues, can't leverage checkpoints for debugging across team language preferences.

Scenario 3: Workflow Topology Validation

Problem: A workflow's executor graph changes during development. In .NET, restored checkpoints fail with validation errors. In Python, they silently continue with mismatched topology.

Current State: Inconsistent behavior:

  • .NET: WorkflowInfo.IsMatch() catches topology changes, fails fast
  • Python: No validation, potentially runs with wrong graph structure

Business Impact: Python workflows may corrupt state or produce incorrect results when topology changes aren't detected.

Scenario 4: Multi-Language Workflow Ecosystem

Problem: An organization wants to build a workflow marketplace or shared workflow library where users can use Python or .NET interchangeably.

Current State: Impossible due to checkpoint incompatibility. Each language creates isolated ecosystems. If a workflow needs to be paused and resumed, language choice is locked in.

Business Impact: Reduces ecosystem value, forces language lock-in, prevents maximum flexibility.

Scenario 5: Hierarchical Checkpoint Analysis

Problem: A long-running workflow creates multiple checkpoints over days. A developer wants to understand the checkpoint lineage and potentially restore to an earlier point in the chain.

Current State:

  • .NET: Parent property enables checkpoint tree navigation
  • Python: No parent/child relationships, flat list only

Business Impact: Python users cannot leverage advanced checkpoint features like branching restore or tree visualization.

Scenario 6: Request/Response Tracking

Problem: Workflows with human-in-the-loop or external API calls need to track pending requests that haven't received responses yet.

Current State:

  • .NET: Explicit OutstandingRequests list in RunnerData
  • Python: Buried in executor state, no standard location

Business Impact: Python workflows may lose request tracking on restore, .NET cannot consume Python checkpoints with pending requests.

Questions for Resolution

  1. Should we aim for full format compatibility (same JSON schema) or just semantic compatibility (convertible)?
  2. Does Python need workflow topology validation on checkpoint restore?
  3. Should Python adopt hierarchical checkpoints with parent/child relationships?
  4. Can we define a common type system that works in both languages?
  5. Should we version the checkpoint format and support multiple versions?
  6. Do we need a migration tool for existing checkpoints?
  7. Should outstanding requests be first-class in both languages?
  8. What's the migration path for users with existing checkpoints?

Related Code

Python:

  • python/packages/core/agent_framework/_workflows/_checkpoint.py - WorkflowCheckpoint dataclass
  • python/packages/core/agent_framework/_workflows/_runner_context.py - CheckpointState, encoding/decoding
  • python/samples/getting_started/workflows/checkpoint/checkpoint_with_resume.py - checkpoint usage

.NET:

  • dotnet/src/Microsoft.Agents.AI.Workflows/Checkpointing/Checkpoint.cs - main checkpoint structure
  • dotnet/src/Microsoft.Agents.AI.Workflows/Checkpointing/WorkflowInfo.cs - topology validation
  • dotnet/src/Microsoft.Agents.AI.Workflows/Checkpointing/PortableMessageEnvelope.cs - message format
  • dotnet/src/Microsoft.Agents.AI.Workflows/Execution/RunnerStateData.cs - runtime state
  • dotnet/src/Microsoft.Agents.AI.Workflows/Checkpointing/FileSystemJsonCheckpointStore.cs - storage

Metadata

Metadata

Assignees

Labels

.NETpythonsquad: workflowsAgent Framework Workflows SquadworkflowsRelated to Workflows in agent-framework

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions