feat(code_executors): add opt-in sandbox flag to LocalCommandLineCodeExecutor (#7462)#7611
feat(code_executors): add opt-in sandbox flag to LocalCommandLineCodeExecutor (#7462)#7611xr843 wants to merge 3 commits into
Conversation
@hanselhansel Reframe the sandbox=True posture from "feature with fallback" to "contract with explicit failure" per microsoft#7611 review feedback. * Add PlatformExecutionScopeError(RuntimeError) — raised when sandbox=True is requested but the platform cannot honor the in-process isolation guarantees (no preexec_fn / no resource module on Windows). * Replace the silent UserWarning + env-scrub-only Windows fallback with an explicit raise at __init__ that names which guarantees are unavailable (RLIMIT_AS / RLIMIT_CPU / RLIMIT_NOFILE / RLIMIT_NPROC + preexec_fn). * Defense-in-depth: also raise at the subprocess-exec site so a future refactor cannot silently spawn without the rlimit contract. * Rewrite the sandbox docstring as a per-platform contract: every POSIX guarantee is enumerated (env-scrub patterns + each rlimit + cap value), and the Windows path documents "raises PlatformExecutionScopeError" — not "falls back to". * Add test coverage: Windows raise path (sys.platform mocked) plus a RuntimeError-subclass assertion so existing handlers keep working. Procurement / threat-model reviewers can now audit the failure surface from the docstring and exception message alone, without inferring intent from "falls back to" prose.
|
@hanselhansel addressed your contract framing in 4df5f7a: replaced the silent platform fallback (Windows previously emitted a What changed:
Test coverage added:
Local run: 19 passed, 1 skipped on |
|
Replacing the silent platform fallback with explicit behavior is the right shape — the original Issue #7462 concern was that "sandboxed" was a property the caller couldn't reliably observe. Explicit-over-silent makes the security boundary inspectable, which is what an enterprise procurement review is going to ask for. Thanks for picking this up. |
…odeExecutor Refs microsoft#7462. Supersedes closed PR microsoft#7467. The legacy `UserWarning` at construction was easily suppressed by production warning filters and `python -W ignore`, leaving unsandboxed execution of LLM-generated code as the silent default. This change introduces an explicit three-state sandbox posture parameter: - sandbox=None (default, legacy behavior for backward compatibility): DeprecationWarning + logger.warning() surface the risk in both Python warning channels and structured logging pipelines. A future release will make this parameter required. - sandbox=False Caller explicitly acknowledges unsandboxed execution; no warning is emitted. - sandbox=True Best-effort in-process hardening: * Environment entries whose name contains credential patterns (TOKEN, SECRET, API_KEY, PASSWORD, PRIVATE_KEY, CREDENTIAL, SESSION, COOKIE, AUTH) are stripped from the child process. * On POSIX, per-child rlimits (RLIMIT_CPU, RLIMIT_AS, RLIMIT_NOFILE, RLIMIT_NPROC) are applied via preexec_fn so runaway memory/fork-bomb payloads are capped. * On Windows, env scrub applies but preexec is unavailable; a UserWarning directs callers to the Docker executor for strong isolation. Docstring and LocalCommandLineCodeExecutorConfig are updated to round-trip the posture through serialization so declarative deployments cannot silently downgrade. This is NOT a substitute for DockerCommandLineCodeExecutor — adversarial payloads can still read files, make outbound connections, and write to work_dir. The docstring states this explicitly. Tests cover: default DeprecationWarning, explicit opt-out silence, env scrubbing on POSIX, and config round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hanselhansel Reframe the sandbox=True posture from "feature with fallback" to "contract with explicit failure" per microsoft#7611 review feedback. * Add PlatformExecutionScopeError(RuntimeError) — raised when sandbox=True is requested but the platform cannot honor the in-process isolation guarantees (no preexec_fn / no resource module on Windows). * Replace the silent UserWarning + env-scrub-only Windows fallback with an explicit raise at __init__ that names which guarantees are unavailable (RLIMIT_AS / RLIMIT_CPU / RLIMIT_NOFILE / RLIMIT_NPROC + preexec_fn). * Defense-in-depth: also raise at the subprocess-exec site so a future refactor cannot silently spawn without the rlimit contract. * Rewrite the sandbox docstring as a per-platform contract: every POSIX guarantee is enumerated (env-scrub patterns + each rlimit + cap value), and the Windows path documents "raises PlatformExecutionScopeError" — not "falls back to". * Add test coverage: Windows raise path (sys.platform mocked) plus a RuntimeError-subclass assertion so existing handlers keep working. Procurement / threat-model reviewers can now audit the failure surface from the docstring and exception message alone, without inferring intent from "falls back to" prose.
5b4d26e to
5a0b2c8
Compare
|
Thanks @msaleme — "explicit-over-silent makes the security boundary inspectable" is exactly the framing I want this PR to land on. The @ekzhu — flagging for your eyes since this extends the security posture you introduced in #7035 (Docker-default + warnings). The PR keeps backward compatibility (default |
|
@msaleme thanks again for the review feedback — glad the "explicit-over-silent makes the security boundary inspectable" framing landed. I believe the PR is now in its final shape: the opt-in sandbox flag with |
|
Pushed The "will become required" intent stays in the message — that's the actionable channel; the category is the visibility channel. Tests updated and passing locally (5/5 sandbox tests). The orthogonal follow-up — flipping the default so bare |
|
Pushed
On the follow-on question: I am treating this PR as strong best-effort visibility without adding an unconditional stderr-direct side effect. If an application deliberately suppresses both Python warnings and this logger, the non-bypassable enforcement step should be the follow-up breaking change where bare |
|
Closing this one myself. autogen has been in maintenance mode since early April (the most recent merge, #7521 on 2026-04-06, was the maintenance-mode README banner itself), so a new executor feature like this opt-in sandbox flag is out of scope for the project's current direction. The design discussion in #7462 and in this thread may still be a useful reference if the silent-platform-fallback concern is ever revisited. Thanks @msaleme for the thoughtful review feedback earlier. |
Summary
Draft PR addressing #7462 — adds an opt-in
sandboxparameter toLocalCommandLineCodeExecutorso users who cannot run Docker still get best-effort in-process hardening. This supersedes #7467 with broader scope (env-scrub + rlimits + Windows degrade path + config round-trip).Opening as draft because I'd like maintainer input on the Windows strategy (see below) before investing further.
Three modes for
sandbox: Optional[bool]NoneDeprecationWarning+logger.warningthat a future release will make the parameter required. Execution unchanged — fully backward compatible. Replaces the existingUserWarning.FalseTruepreexec_fnapplyingRLIMIT_CPU(=timeout + 5s) andRLIMIT_AS(default 512 MiB, configurable viasandbox_memory_bytes) + credential env-var scrub on both the code-execution subprocess and the pip-install subprocess in_setup_functions. Windows: logs a warning that rlimits are unavailable; env-scrub still applies.Env-var scrub patterns (case-insensitive)
*_API_KEY,*_TOKEN,*_SECRET,*PASSWORD*,AWS_*,AZURE_*,GCP_*,GOOGLE_*,OPENAI_*,ANTHROPIC_*,HF_*,HUGGINGFACE_*,GITHUB_TOKEN,GH_TOKEN,NPM_TOKEN,PYPI_*.What is intentionally out of scope for this draft
SetInformationJobObject). The Windows path here is degrade-only. I'd like maintainer direction on whether you want a proper Windows implementation in this PR, a follow-up, or left documented as-is.DockerCommandLineCodeExecutor; the docstring says so explicitly.Tests
New
tests/code_executors/test_local_sandbox.py:sandbox=NoneemitsDeprecationWarning(notUserWarning)sandbox=Falseemits no warningsandbox=Trueemits noDeprecationWarningsandbox+sandbox_memory_bytesfor all three valuesFAKE_API_KEYin env is scrubbed, harmless var survivessandbox=False,FAKE_API_KEYreaches the subprocesssandbox=Trueconstructs cleanly (Windows smoke test)Locally:
uv run pytest tests/code_executors/test_local_sandbox.py— 8 passed. The existingtest_commandline_code_executor.py— 13 passed, 1 skipped (Windows-only). One unrelated pre-existing failure intest_user_defined_functions.py::test_can_load_function_with_reqsdue topipmissing in my uv-created venv, not touched by this PR.Decisions worth a second look
RLIMIT_AS= 512 MiB. Large enough for most small Python/shell, tight enough to catch runaway allocations. Configurable viasandbox_memory_bytes.RLIMIT_CPU=timeout + 5. 5s grace above the async wait-for timeout so we don't race it and mask the existing timeout behavior.RLIMIT_ASrefused on some platforms (e.g. macOS) — swallowed silently, CPU cap + env-scrub still apply. Let me know if you'd rather hard-fail there.preexec_fnruns in the child between fork and exec. Standard pattern forsubprocessrlimit enforcement;asyncio.create_subprocess_execpasses it through.Links: issue #7462 · supersedes #7467
cc @msaleme @hanselhansel @polterguy from the issue collaboration trail.
🤖 Generated with Claude Code