-
Notifications
You must be signed in to change notification settings - Fork 531
Open
Open
Copy link
Labels
.NETpythonsquad: workflowsAgent Framework Workflows SquadAgent Framework Workflows SquadworkflowsRelated to Workflows in agent-frameworkRelated to Workflows in agent-framework
Description
Add Support for Additional Workflow Runtimes, Especially Distributed Execution
Problem
Workflows currently only support in-process execution where all executors run in a single process. This creates fundamental limitations for:
- Scalability - Cannot distribute compute-intensive executors across multiple machines
- Resource isolation - All executors share the same process memory and resources
- Fault tolerance - Single process failure brings down the entire workflow
- Cloud-native deployment - Cannot leverage container orchestration platforms (Kubernetes, Azure Container Apps)
- Heterogeneous environments - Cannot run Python and .NET executors in the same workflow
- Cost optimization - Cannot scale individual executors independently based on load
Current State
Python:
- Single
Runner
class executing all executors in one asyncio event loop - All executors instantiated in the same process memory
- No abstraction for executor location or communication
.NET:
InProcessRunner
explicitly documented as "without distributed coordination"- Three execution modes (OffThread, Lockstep, Subworkflow) all run in-process
- Comments indicate distributed execution is a known gap: "primarily intended for testing, debugging, or scenarios where workflow execution does not require executor distribution"
Desired End State
Runtime Abstraction
A pluggable runtime model that supports multiple execution environments:
-
In-Process Runtime (current, default)
- Single process execution for development and simple scenarios
- Fast, low-latency, minimal overhead
- No additional infrastructure required
-
Distributed Runtime (new, production)
- Executors run across multiple processes/machines
- Horizontal scalability for compute-intensive workflows
- Fault isolation - executor failures don't crash workflow
- Location-transparent message delivery
-
Hybrid Runtime (new, advanced)
- Mix of local and remote executors in same workflow
- Python and .NET executors in same workflow graph
- Specialized executors run on GPU machines, agents run on CPU clusters
Core Runtime Capabilities
Executor Distribution:
- Executors can run in separate processes, containers, or machines
- Workflow builder specifies placement preferences (affinity, resource requirements)
- Runtime handles executor instantiation and lifecycle
Message Delivery:
- Transparent message routing between local and remote executors
- Preserves superstep semantics across distributed executors
- Efficient serialization for cross-process communication
State Management:
- Shared state works across distributed executors
- Distributed locking or coordination for atomicity
- Checkpoint/restore compatible with distributed execution
Discovery & Registry:
- Executors register with runtime on startup
- Workflow can discover available executor types
- Dynamic executor scaling based on load
Fault Tolerance:
- Executor failures detected and isolated
- Workflow can retry failed executors or route around failures
- Graceful degradation patterns
Real-World Scenarios Enabled
Scenario 1: Large-Scale Document Processing
- Workflow processes 10,000 legal documents
- 50 document analysis executors distributed across 10 machines
- Each executor handles documents independently
- Aggregator executor collects results
- Current limitation: All executors must run in one process, limited by single-machine memory/CPU
Scenario 2: Multi-Language AI Pipeline
- Python executors for data preprocessing and ML inference
- .NET executors for business logic and database access
- Agents in both languages collaborate in same workflow
- Current limitation: Must choose one language for entire workflow
Scenario 3: Cloud-Native Deployment
- Workflow deployed to Kubernetes cluster
- Executors scale independently via HPA (Horizontal Pod Autoscaler)
- GPU-intensive executors on GPU nodes, lightweight executors on CPU nodes
- Workflow survives individual pod failures
- Current limitation: Cannot distribute to Kubernetes, entire workflow in one pod
Scenario 4: Cost-Optimized Agent Swarm
- Workflow with 100 agent executors for parallel research tasks
- Scale up to 20 instances during high load
- Scale down to 2 instances during idle periods
- Pay only for active compute time
- Current limitation: Cannot scale individual executors, all-or-nothing deployment
Scenario 5: Enterprise Integration
- Workflow coordinates agents across department boundaries
- Marketing team executors run in their Azure subscription
- Legal team executors run in isolated compliance environment
- Workflow orchestrates cross-department collaboration
- Current limitation: All executors must be co-located, no isolation boundaries
Runtime Requirements
Runtime Interface:
- Standard contract that all runtimes implement
- Workflow code remains unchanged across runtimes
- Runtime selected at workflow instantiation
Executor Deployment Models:
- Process-per-executor (isolation)
- Shared executor pools (efficiency)
- Sidecar containers (Kubernetes)
- Serverless functions (Azure Functions, AWS Lambda)
Communication Patterns:
- Point-to-point message delivery
- Fan-out broadcast
- Fan-in aggregation
- Request-response for human-in-the-loop
State Coordination:
- Distributed shared state with consistency guarantees
- Checkpoint store accessible from all executors
- State migration when executors restart
Observability:
- Distributed tracing across executor boundaries
- Metrics for message latency, executor throughput
- Workflow visualization showing physical deployment topology
Questions
- Should distributed runtime be in-core or a separate package/extension?
- What distributed coordination technologies should we support (Orleans, Dapr, Temporal, custom)?
- Should runtime be transparent to workflow authors or require explicit opt-in?
- How do we handle cross-language executors (Python ↔ .NET communication)?
- What serialization format for remote messages (JSON, Protobuf, custom)?
- Should executors be stateless (state in shared store) or stateful (local state + replication)?
- How do we test distributed workflows without complex infrastructure?
- What's the migration path from in-process to distributed?
Related Code
Python:
python/packages/core/agent_framework/_workflows/_runner.py
- current in-process runnerpython/packages/redis/
- distributed storage infrastructure (chat store, context provider)
.NET:
dotnet/src/Microsoft.Agents.AI.Workflows/InProc/InProcessRunner.cs
- current in-process runnerdotnet/src/Microsoft.Agents.AI.Workflows/InProcessExecution.cs
- execution modesdotnet/src/Microsoft.Agents.AI.Workflows/Execution/
- execution infrastructure to abstract
rbrtrg
Metadata
Metadata
Assignees
Labels
.NETpythonsquad: workflowsAgent Framework Workflows SquadAgent Framework Workflows SquadworkflowsRelated to Workflows in agent-frameworkRelated to Workflows in agent-framework