Skip to content

Python: Define Workflow Concurrency Semantics #1444

@alliscode

Description

@alliscode

Define Workflow Concurrency Semantics

Problem

Workflows execute with concurrent operations in multiple scenarios (fan-out edges, multiple source executors, shared state access), but the concurrency guarantees, ordering, and thread-safety characteristics are not explicitly documented or consistently defined between Python and .NET. This creates:

  • Ambiguity for developers - Unclear what ordering and atomicity guarantees exist
  • Risk of race conditions - Developers may write incorrect concurrent code without realizing it
  • Non-deterministic behavior - Workflows may behave differently on each run
  • Inconsistent implementations - Python and .NET may have different concurrency semantics
  • Debugging difficulties - Hard to reproduce issues without clear concurrency rules

Current State

Python:

  • Superstep-based execution with asyncio.gather() for concurrent delivery
  • SharedState uses asyncio.Lock with hold() context manager for multi-operation atomicity
  • Message ordering within supersteps: unspecified
  • Fan-out delivery: concurrent with no ordering guarantees

.NET:

  • Superstep-based execution with configurable modes (OffThread, Lockstep, Subworkflow)
  • Task.WhenAll() for concurrent fan-out
  • Shared state synchronization: not explicitly documented
  • Message ordering: unspecified across execution modes

Desired End State

Well-Defined Concurrency Model

A documented concurrency model that clearly specifies:

  1. Message Ordering Guarantees

    • Within a single source → target path
    • From multiple sources to one target in the same superstep
    • During fan-out to multiple targets
    • During fan-in from multiple sources
  2. Shared State Semantics

    • What operations are atomic
    • What locking patterns are required for read-modify-write operations
    • Whether locks span operation boundaries or individual calls
    • How to perform multi-operation atomic blocks
  3. Parallelism Guarantees

    • Whether fan-out execution is guaranteed concurrent or implementation-defined
    • Whether executors in the same superstep can run in parallel
    • What happens when one executor blocks
  4. Superstep Boundary Semantics

    • When exactly a superstep completes
    • Whether all executors must finish before boundary
    • What state is visible at boundaries
    • Checkpointing consistency guarantees
  5. Determinism vs Performance Trade-offs

    • Whether message processing order is deterministic
    • Whether workflows can be made reproducible
    • Performance implications of ordering guarantees
  6. Deadlock Prevention

    • Whether circular message dependencies are allowed
    • How to detect and handle deadlock scenarios
    • Runtime enforcement mechanisms

Documentation Requirements

For Developers:

  • Clear API documentation for SharedState operations and locking requirements
  • Examples showing correct concurrent patterns
  • Anti-patterns and common pitfalls
  • Guidelines for when atomicity is needed

For Framework:

  • Specification of execution semantics that both languages must implement
  • Test cases demonstrating expected concurrent behavior
  • Decision on whether Python and .NET must have identical semantics or language-appropriate patterns

Real-World Impact Scenarios

Without clear concurrency semantics:

  1. Lost Updates - Multiple executors increment a shared counter, some updates are lost due to race conditions
  2. Non-Deterministic Results - Aggregator receives messages in random order, producing different outputs each run
  3. Checkpoint Corruption - Checkpoint occurs mid-execution with partially updated shared state
  4. Workflow Hangs - Circular message dependencies cause deadlock
  5. Performance Confusion - Developer expects parallelism but sees serial execution

Questions

  1. Should message ordering be guaranteed within supersteps? If so, how (insertion order, timestamp, explicit priority)?
  2. Should shared state operations document required locking patterns for atomic updates?
  3. Should fan-out guarantee concurrent execution, or allow runtime flexibility?
  4. Should superstep boundaries be fully synchronous (all executors complete) or allow streaming?
  5. Should workflows provide determinism modes (reproducible order vs maximum parallelism)?
  6. How should deadlock scenarios be detected and handled?
  7. Should .NET and Python have identical concurrency semantics, or language-appropriate patterns?
  8. Should concurrency semantics be part of the workflow specification or implementation detail?

Related Code

Python:

  • python/packages/core/agent_framework/_workflows/_runner.py - superstep execution
  • python/packages/core/agent_framework/_workflows/_edge_runner.py - fan-out/fan-in runners
  • python/packages/core/agent_framework/_workflows/_shared_state.py - locking semantics

.NET:

  • dotnet/src/Microsoft.Agents.AI.Workflows/InProcessExecution.cs - execution modes
  • dotnet/src/Microsoft.Agents.AI.Workflows/Execution/FanOutEdgeRunner.cs - concurrent instantiation
  • dotnet/src/Microsoft.Agents.AI.Workflows/Execution/ - execution infrastructure

Metadata

Metadata

Assignees

Labels

.NETpythonsquad: workflowsAgent Framework Workflows SquadworkflowsRelated to Workflows in agent-framework

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions