Python: Add Support for Additional Workflow Runtimes, Especially Distributed Execution

# Add Support for Additional Workflow Runtimes, Especially Distributed Execution

## Problem

Workflows currently only support in-process execution where all executors run in a single process. This creates fundamental limitations for:

- **Scalability** - Cannot distribute compute-intensive executors across multiple machines
- **Resource isolation** - All executors share the same process memory and resources
- **Fault tolerance** - Single process failure brings down the entire workflow
- **Cloud-native deployment** - Cannot leverage container orchestration platforms (Kubernetes, Azure Container Apps)
- **Heterogeneous environments** - Cannot run Python and .NET executors in the same workflow
- **Cost optimization** - Cannot scale individual executors independently based on load

## Current State

**Python:**
- Single `Runner` class executing all executors in one asyncio event loop
- All executors instantiated in the same process memory
- No abstraction for executor location or communication

**.NET:**
- `InProcessRunner` explicitly documented as "without distributed coordination"
- Three execution modes (OffThread, Lockstep, Subworkflow) all run in-process
- Comments indicate distributed execution is a known gap: "primarily intended for testing, debugging, or scenarios where workflow execution does not require executor distribution"

## Desired End State

### Runtime Abstraction

A **pluggable runtime model** that supports multiple execution environments:

1. **In-Process Runtime** (current, default)
   - Single process execution for development and simple scenarios
   - Fast, low-latency, minimal overhead
   - No additional infrastructure required

2. **Distributed Runtime** (new, production)
   - Executors run across multiple processes/machines
   - Horizontal scalability for compute-intensive workflows
   - Fault isolation - executor failures don't crash workflow
   - Location-transparent message delivery

3. **Hybrid Runtime** (new, advanced)
   - Mix of local and remote executors in same workflow
   - Python and .NET executors in same workflow graph
   - Specialized executors run on GPU machines, agents run on CPU clusters

### Core Runtime Capabilities

**Executor Distribution:**
- Executors can run in separate processes, containers, or machines
- Workflow builder specifies placement preferences (affinity, resource requirements)
- Runtime handles executor instantiation and lifecycle

**Message Delivery:**
- Transparent message routing between local and remote executors
- Preserves superstep semantics across distributed executors
- Efficient serialization for cross-process communication

**State Management:**
- Shared state works across distributed executors
- Distributed locking or coordination for atomicity
- Checkpoint/restore compatible with distributed execution

**Discovery & Registry:**
- Executors register with runtime on startup
- Workflow can discover available executor types
- Dynamic executor scaling based on load

**Fault Tolerance:**
- Executor failures detected and isolated
- Workflow can retry failed executors or route around failures
- Graceful degradation patterns

## Real-World Scenarios Enabled

**Scenario 1: Large-Scale Document Processing**
- Workflow processes 10,000 legal documents
- 50 document analysis executors distributed across 10 machines
- Each executor handles documents independently
- Aggregator executor collects results
- Current limitation: All executors must run in one process, limited by single-machine memory/CPU

**Scenario 2: Multi-Language AI Pipeline**
- Python executors for data preprocessing and ML inference
- .NET executors for business logic and database access
- Agents in both languages collaborate in same workflow
- Current limitation: Must choose one language for entire workflow

**Scenario 3: Cloud-Native Deployment**
- Workflow deployed to Kubernetes cluster
- Executors scale independently via HPA (Horizontal Pod Autoscaler)
- GPU-intensive executors on GPU nodes, lightweight executors on CPU nodes
- Workflow survives individual pod failures
- Current limitation: Cannot distribute to Kubernetes, entire workflow in one pod

**Scenario 4: Cost-Optimized Agent Swarm**
- Workflow with 100 agent executors for parallel research tasks
- Scale up to 20 instances during high load
- Scale down to 2 instances during idle periods
- Pay only for active compute time
- Current limitation: Cannot scale individual executors, all-or-nothing deployment

**Scenario 5: Enterprise Integration**
- Workflow coordinates agents across department boundaries
- Marketing team executors run in their Azure subscription
- Legal team executors run in isolated compliance environment
- Workflow orchestrates cross-department collaboration
- Current limitation: All executors must be co-located, no isolation boundaries

## Runtime Requirements

**Runtime Interface:**
- Standard contract that all runtimes implement
- Workflow code remains unchanged across runtimes
- Runtime selected at workflow instantiation

**Executor Deployment Models:**
- Process-per-executor (isolation)
- Shared executor pools (efficiency)
- Sidecar containers (Kubernetes)
- Serverless functions (Azure Functions, AWS Lambda)

**Communication Patterns:**
- Point-to-point message delivery
- Fan-out broadcast
- Fan-in aggregation
- Request-response for human-in-the-loop

**State Coordination:**
- Distributed shared state with consistency guarantees
- Checkpoint store accessible from all executors
- State migration when executors restart

**Observability:**
- Distributed tracing across executor boundaries
- Metrics for message latency, executor throughput
- Workflow visualization showing physical deployment topology

## Questions

1. Should distributed runtime be in-core or a separate package/extension?
2. What distributed coordination technologies should we support (Orleans, Dapr, Temporal, custom)?
3. Should runtime be transparent to workflow authors or require explicit opt-in?
4. How do we handle cross-language executors (Python ↔ .NET communication)?
5. What serialization format for remote messages (JSON, Protobuf, custom)?
6. Should executors be stateless (state in shared store) or stateful (local state + replication)?
7. How do we test distributed workflows without complex infrastructure?
8. What's the migration path from in-process to distributed?

## Related Code

**Python:**
- `python/packages/core/agent_framework/_workflows/_runner.py` - current in-process runner
- `python/packages/redis/` - distributed storage infrastructure (chat store, context provider)

**.NET:**
- `dotnet/src/Microsoft.Agents.AI.Workflows/InProc/InProcessRunner.cs` - current in-process runner
- `dotnet/src/Microsoft.Agents.AI.Workflows/InProcessExecution.cs` - execution modes
- `dotnet/src/Microsoft.Agents.AI.Workflows/Execution/` - execution infrastructure to abstract


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python: Add Support for Additional Workflow Runtimes, Especially Distributed Execution #1445

Add Support for Additional Workflow Runtimes, Especially Distributed Execution

Problem

Current State

Desired End State

Runtime Abstraction

Core Runtime Capabilities

Real-World Scenarios Enabled

Runtime Requirements

Questions

Related Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Python: Add Support for Additional Workflow Runtimes, Especially Distributed Execution #1445

Description

Add Support for Additional Workflow Runtimes, Especially Distributed Execution

Problem

Current State

Desired End State

Runtime Abstraction

Core Runtime Capabilities

Real-World Scenarios Enabled

Runtime Requirements

Questions

Related Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions