Skip to content

Python: Add Support for Additional Workflow Runtimes, Especially Distributed Execution #1445

@alliscode

Description

@alliscode

Add Support for Additional Workflow Runtimes, Especially Distributed Execution

Problem

Workflows currently only support in-process execution where all executors run in a single process. This creates fundamental limitations for:

  • Scalability - Cannot distribute compute-intensive executors across multiple machines
  • Resource isolation - All executors share the same process memory and resources
  • Fault tolerance - Single process failure brings down the entire workflow
  • Cloud-native deployment - Cannot leverage container orchestration platforms (Kubernetes, Azure Container Apps)
  • Heterogeneous environments - Cannot run Python and .NET executors in the same workflow
  • Cost optimization - Cannot scale individual executors independently based on load

Current State

Python:

  • Single Runner class executing all executors in one asyncio event loop
  • All executors instantiated in the same process memory
  • No abstraction for executor location or communication

.NET:

  • InProcessRunner explicitly documented as "without distributed coordination"
  • Three execution modes (OffThread, Lockstep, Subworkflow) all run in-process
  • Comments indicate distributed execution is a known gap: "primarily intended for testing, debugging, or scenarios where workflow execution does not require executor distribution"

Desired End State

Runtime Abstraction

A pluggable runtime model that supports multiple execution environments:

  1. In-Process Runtime (current, default)

    • Single process execution for development and simple scenarios
    • Fast, low-latency, minimal overhead
    • No additional infrastructure required
  2. Distributed Runtime (new, production)

    • Executors run across multiple processes/machines
    • Horizontal scalability for compute-intensive workflows
    • Fault isolation - executor failures don't crash workflow
    • Location-transparent message delivery
  3. Hybrid Runtime (new, advanced)

    • Mix of local and remote executors in same workflow
    • Python and .NET executors in same workflow graph
    • Specialized executors run on GPU machines, agents run on CPU clusters

Core Runtime Capabilities

Executor Distribution:

  • Executors can run in separate processes, containers, or machines
  • Workflow builder specifies placement preferences (affinity, resource requirements)
  • Runtime handles executor instantiation and lifecycle

Message Delivery:

  • Transparent message routing between local and remote executors
  • Preserves superstep semantics across distributed executors
  • Efficient serialization for cross-process communication

State Management:

  • Shared state works across distributed executors
  • Distributed locking or coordination for atomicity
  • Checkpoint/restore compatible with distributed execution

Discovery & Registry:

  • Executors register with runtime on startup
  • Workflow can discover available executor types
  • Dynamic executor scaling based on load

Fault Tolerance:

  • Executor failures detected and isolated
  • Workflow can retry failed executors or route around failures
  • Graceful degradation patterns

Real-World Scenarios Enabled

Scenario 1: Large-Scale Document Processing

  • Workflow processes 10,000 legal documents
  • 50 document analysis executors distributed across 10 machines
  • Each executor handles documents independently
  • Aggregator executor collects results
  • Current limitation: All executors must run in one process, limited by single-machine memory/CPU

Scenario 2: Multi-Language AI Pipeline

  • Python executors for data preprocessing and ML inference
  • .NET executors for business logic and database access
  • Agents in both languages collaborate in same workflow
  • Current limitation: Must choose one language for entire workflow

Scenario 3: Cloud-Native Deployment

  • Workflow deployed to Kubernetes cluster
  • Executors scale independently via HPA (Horizontal Pod Autoscaler)
  • GPU-intensive executors on GPU nodes, lightweight executors on CPU nodes
  • Workflow survives individual pod failures
  • Current limitation: Cannot distribute to Kubernetes, entire workflow in one pod

Scenario 4: Cost-Optimized Agent Swarm

  • Workflow with 100 agent executors for parallel research tasks
  • Scale up to 20 instances during high load
  • Scale down to 2 instances during idle periods
  • Pay only for active compute time
  • Current limitation: Cannot scale individual executors, all-or-nothing deployment

Scenario 5: Enterprise Integration

  • Workflow coordinates agents across department boundaries
  • Marketing team executors run in their Azure subscription
  • Legal team executors run in isolated compliance environment
  • Workflow orchestrates cross-department collaboration
  • Current limitation: All executors must be co-located, no isolation boundaries

Runtime Requirements

Runtime Interface:

  • Standard contract that all runtimes implement
  • Workflow code remains unchanged across runtimes
  • Runtime selected at workflow instantiation

Executor Deployment Models:

  • Process-per-executor (isolation)
  • Shared executor pools (efficiency)
  • Sidecar containers (Kubernetes)
  • Serverless functions (Azure Functions, AWS Lambda)

Communication Patterns:

  • Point-to-point message delivery
  • Fan-out broadcast
  • Fan-in aggregation
  • Request-response for human-in-the-loop

State Coordination:

  • Distributed shared state with consistency guarantees
  • Checkpoint store accessible from all executors
  • State migration when executors restart

Observability:

  • Distributed tracing across executor boundaries
  • Metrics for message latency, executor throughput
  • Workflow visualization showing physical deployment topology

Questions

  1. Should distributed runtime be in-core or a separate package/extension?
  2. What distributed coordination technologies should we support (Orleans, Dapr, Temporal, custom)?
  3. Should runtime be transparent to workflow authors or require explicit opt-in?
  4. How do we handle cross-language executors (Python ↔ .NET communication)?
  5. What serialization format for remote messages (JSON, Protobuf, custom)?
  6. Should executors be stateless (state in shared store) or stateful (local state + replication)?
  7. How do we test distributed workflows without complex infrastructure?
  8. What's the migration path from in-process to distributed?

Related Code

Python:

  • python/packages/core/agent_framework/_workflows/_runner.py - current in-process runner
  • python/packages/redis/ - distributed storage infrastructure (chat store, context provider)

.NET:

  • dotnet/src/Microsoft.Agents.AI.Workflows/InProc/InProcessRunner.cs - current in-process runner
  • dotnet/src/Microsoft.Agents.AI.Workflows/InProcessExecution.cs - execution modes
  • dotnet/src/Microsoft.Agents.AI.Workflows/Execution/ - execution infrastructure to abstract

Metadata

Metadata

Assignees

No one assigned

    Labels

    .NETpythonsquad: workflowsAgent Framework Workflows SquadworkflowsRelated to Workflows in agent-framework

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions