Skip to content

feat(code_executors): add opt-in sandbox flag to LocalCommandLineCodeExecutor (#7462)#7611

Closed
xr843 wants to merge 3 commits into
microsoft:mainfrom
xr843:fix/lce-sandbox-flag
Closed

feat(code_executors): add opt-in sandbox flag to LocalCommandLineCodeExecutor (#7462)#7611
xr843 wants to merge 3 commits into
microsoft:mainfrom
xr843:fix/lce-sandbox-flag

Conversation

@xr843

@xr843 xr843 commented Apr 20, 2026

Copy link
Copy Markdown

Summary

Draft PR addressing #7462 — adds an opt-in sandbox parameter to LocalCommandLineCodeExecutor so users who cannot run Docker still get best-effort in-process hardening. This supersedes #7467 with broader scope (env-scrub + rlimits + Windows degrade path + config round-trip).

Opening as draft because I'd like maintainer input on the Windows strategy (see below) before investing further.

Three modes for sandbox: Optional[bool]

Value Behavior
None Default. Emits DeprecationWarning + logger.warning that a future release will make the parameter required. Execution unchanged — fully backward compatible. Replaces the existing UserWarning.
False Explicit acknowledgement of unsandboxed execution. No warning.
True POSIX: preexec_fn applying RLIMIT_CPU (= timeout + 5s) and RLIMIT_AS (default 512 MiB, configurable via sandbox_memory_bytes) + credential env-var scrub on both the code-execution subprocess and the pip-install subprocess in _setup_functions. Windows: logs a warning that rlimits are unavailable; env-scrub still applies.

Env-var scrub patterns (case-insensitive)

*_API_KEY, *_TOKEN, *_SECRET, *PASSWORD*, AWS_*, AZURE_*, GCP_*, GOOGLE_*, OPENAI_*, ANTHROPIC_*, HF_*, HUGGINGFACE_*, GITHUB_TOKEN, GH_TOKEN, NPM_TOKEN, PYPI_*.

What is intentionally out of scope for this draft

  • Windows resource limits (Job Objects / SetInformationJobObject). The Windows path here is degrade-only. I'd like maintainer direction on whether you want a proper Windows implementation in this PR, a follow-up, or left documented as-is.
  • README changes. Docstring-only for now to keep scope tight.
  • Network isolation, filesystem chroot, seccomp — all correctly belong in DockerCommandLineCodeExecutor; the docstring says so explicitly.

Tests

New tests/code_executors/test_local_sandbox.py:

  1. sandbox=None emits DeprecationWarning (not UserWarning)
  2. sandbox=False emits no warning
  3. sandbox=True emits no DeprecationWarning
  4. Config round-trip preserves sandbox + sandbox_memory_bytes for all three values
  5. Config model defaults
  6. POSIX behavioral: FAKE_API_KEY in env is scrubbed, harmless var survives
  7. POSIX behavioral: with sandbox=False, FAKE_API_KEY reaches the subprocess
  8. sandbox=True constructs cleanly (Windows smoke test)

Locally: uv run pytest tests/code_executors/test_local_sandbox.py — 8 passed. The existing test_commandline_code_executor.py — 13 passed, 1 skipped (Windows-only). One unrelated pre-existing failure in test_user_defined_functions.py::test_can_load_function_with_reqs due to pip missing in my uv-created venv, not touched by this PR.

Decisions worth a second look

  • Default RLIMIT_AS = 512 MiB. Large enough for most small Python/shell, tight enough to catch runaway allocations. Configurable via sandbox_memory_bytes.
  • RLIMIT_CPU = timeout + 5. 5s grace above the async wait-for timeout so we don't race it and mask the existing timeout behavior.
  • RLIMIT_AS refused on some platforms (e.g. macOS) — swallowed silently, CPU cap + env-scrub still apply. Let me know if you'd rather hard-fail there.
  • preexec_fn runs in the child between fork and exec. Standard pattern for subprocess rlimit enforcement; asyncio.create_subprocess_exec passes it through.

Links: issue #7462 · supersedes #7467

cc @msaleme @hanselhansel @polterguy from the issue collaboration trail.

🤖 Generated with Claude Code

@xr843

xr843 commented May 6, 2026

Copy link
Copy Markdown
Author

Marking ready for review. cc @ekzhu — this is the opt-in sandbox flag implementation discussed in #7462 (alternative direction to #7467). Happy to iterate if you'd like a different surface for the flag, or if it should be wired through differently in the agent factory.

xr843 added a commit to xr843/autogen that referenced this pull request May 9, 2026
@hanselhansel

Reframe the sandbox=True posture from "feature with fallback" to "contract
with explicit failure" per microsoft#7611 review feedback.

* Add PlatformExecutionScopeError(RuntimeError) — raised when sandbox=True
  is requested but the platform cannot honor the in-process isolation
  guarantees (no preexec_fn / no resource module on Windows).
* Replace the silent UserWarning + env-scrub-only Windows fallback with an
  explicit raise at __init__ that names which guarantees are unavailable
  (RLIMIT_AS / RLIMIT_CPU / RLIMIT_NOFILE / RLIMIT_NPROC + preexec_fn).
* Defense-in-depth: also raise at the subprocess-exec site so a future
  refactor cannot silently spawn without the rlimit contract.
* Rewrite the sandbox docstring as a per-platform contract: every POSIX
  guarantee is enumerated (env-scrub patterns + each rlimit + cap value),
  and the Windows path documents "raises PlatformExecutionScopeError" —
  not "falls back to".
* Add test coverage: Windows raise path (sys.platform mocked) plus a
  RuntimeError-subclass assertion so existing handlers keep working.

Procurement / threat-model reviewers can now audit the failure surface
from the docstring and exception message alone, without inferring intent
from "falls back to" prose.
@xr843

xr843 commented May 9, 2026

Copy link
Copy Markdown
Author

@hanselhansel addressed your contract framing in 4df5f7a: replaced the silent platform fallback (Windows previously emitted a UserWarning and degraded to env-scrub-only) with an explicit PlatformExecutionScopeError raise at __init__, and rewrote the sandbox: docstring per-platform as a contract rather than a feature note.

What changed:

  • New exception PlatformExecutionScopeError(RuntimeError) lives alongside the executor and is exported from autogen_ext.code_executors.local.
  • POSIX docstring section enumerates every guarantee that's part of the contract: env-scrub patterns + RLIMIT_CPU / RLIMIT_AS (2 GiB) / RLIMIT_NOFILE (256) / RLIMIT_NPROC (64).
  • Windows docstring section reads: "Raises PlatformExecutionScopeErrorpreexec_fn and the resource-module rlimits used to enforce the POSIX hardening contract are not available on Windows." No "falls back to".
  • The error message names which guarantees are unavailable (RLIMIT_AS / RLIMIT_CPU / RLIMIT_NOFILE / RLIMIT_NPROC + preexec_fn) and points callers at DockerCommandLineCodeExecutor.
  • Defense-in-depth: also raises at the subprocess-exec site so a future refactor can't regress the construction-time guard.

Test coverage added:

  • test_sandbox_true_on_windows_raises_platform_scope_error (uses sys.platform mock).
  • test_platform_execution_scope_error_is_runtime_error (subclass invariant — keeps existing except RuntimeError: handlers working).

Local run: 19 passed, 1 skipped on test_commandline_code_executor.py. Ready for another look.

@msaleme

msaleme commented May 9, 2026

Copy link
Copy Markdown

Replacing the silent platform fallback with explicit behavior is the right shape — the original Issue #7462 concern was that "sandboxed" was a property the caller couldn't reliably observe. Explicit-over-silent makes the security boundary inspectable, which is what an enterprise procurement review is going to ask for. Thanks for picking this up.

xr843 and others added 2 commits May 10, 2026 10:43
…odeExecutor

Refs microsoft#7462. Supersedes closed PR microsoft#7467.

The legacy `UserWarning` at construction was easily suppressed by production
warning filters and `python -W ignore`, leaving unsandboxed execution of
LLM-generated code as the silent default. This change introduces an explicit
three-state sandbox posture parameter:

- sandbox=None   (default, legacy behavior for backward compatibility):
                 DeprecationWarning + logger.warning() surface the risk in
                 both Python warning channels and structured logging
                 pipelines. A future release will make this parameter
                 required.
- sandbox=False  Caller explicitly acknowledges unsandboxed execution;
                 no warning is emitted.
- sandbox=True   Best-effort in-process hardening:
                 * Environment entries whose name contains credential
                   patterns (TOKEN, SECRET, API_KEY, PASSWORD,
                   PRIVATE_KEY, CREDENTIAL, SESSION, COOKIE, AUTH) are
                   stripped from the child process.
                 * On POSIX, per-child rlimits (RLIMIT_CPU, RLIMIT_AS,
                   RLIMIT_NOFILE, RLIMIT_NPROC) are applied via
                   preexec_fn so runaway memory/fork-bomb payloads are
                   capped.
                 * On Windows, env scrub applies but preexec is
                   unavailable; a UserWarning directs callers to the
                   Docker executor for strong isolation.

Docstring and LocalCommandLineCodeExecutorConfig are updated to round-trip
the posture through serialization so declarative deployments cannot silently
downgrade.

This is NOT a substitute for DockerCommandLineCodeExecutor — adversarial
payloads can still read files, make outbound connections, and write to
work_dir. The docstring states this explicitly.

Tests cover: default DeprecationWarning, explicit opt-out silence, env
scrubbing on POSIX, and config round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hanselhansel

Reframe the sandbox=True posture from "feature with fallback" to "contract
with explicit failure" per microsoft#7611 review feedback.

* Add PlatformExecutionScopeError(RuntimeError) — raised when sandbox=True
  is requested but the platform cannot honor the in-process isolation
  guarantees (no preexec_fn / no resource module on Windows).
* Replace the silent UserWarning + env-scrub-only Windows fallback with an
  explicit raise at __init__ that names which guarantees are unavailable
  (RLIMIT_AS / RLIMIT_CPU / RLIMIT_NOFILE / RLIMIT_NPROC + preexec_fn).
* Defense-in-depth: also raise at the subprocess-exec site so a future
  refactor cannot silently spawn without the rlimit contract.
* Rewrite the sandbox docstring as a per-platform contract: every POSIX
  guarantee is enumerated (env-scrub patterns + each rlimit + cap value),
  and the Windows path documents "raises PlatformExecutionScopeError" —
  not "falls back to".
* Add test coverage: Windows raise path (sys.platform mocked) plus a
  RuntimeError-subclass assertion so existing handlers keep working.

Procurement / threat-model reviewers can now audit the failure surface
from the docstring and exception message alone, without inferring intent
from "falls back to" prose.
@xr843

xr843 commented May 13, 2026

Copy link
Copy Markdown
Author

Thanks @msaleme — "explicit-over-silent makes the security boundary inspectable" is exactly the framing I want this PR to land on. The PlatformExecutionScopeError raise-at-construction behavior in 5a0b2c8 should match that contract; happy to tighten the docstring further if you want it to name the failure mode in the class header rather than just __init__.

@ekzhu — flagging for your eyes since this extends the security posture you introduced in #7035 (Docker-default + warnings). The PR keeps backward compatibility (default sandbox=None preserves current behavior), and only opts callers into hardening when they explicitly request it. Would appreciate a maintainer review pass when you have a window.

@xr843

xr843 commented May 19, 2026

Copy link
Copy Markdown
Author

@msaleme thanks again for the review feedback — glad the "explicit-over-silent makes the security boundary inspectable" framing landed. I believe the PR is now in its final shape: the opt-in sandbox flag with PlatformExecutionScopeError raised at construction addresses the original #7462 concern that "sandboxed" was an unobservable property. CI is green. Would you be able to give a formal review approval, or let me know if anything else is needed before this can land? Happy to make adjustments.

@xr843

xr843 commented May 19, 2026

Copy link
Copy Markdown
Author

Pushed cd58017da reverting the sandbox notice from DeprecationWarning back to UserWarning, per @Ilya0527's PEP 565 analysis on #7462. Library-level DeprecationWarning is silenced by stock filters (default::DeprecationWarning:__main__ + ignore::DeprecationWarning), so the factory-module instantiation shape that #7462 targets never saw the warning. UserWarning has no ignore filter; logger.warning(...) still runs in parallel.

The "will become required" intent stays in the message — that's the actionable channel; the category is the visibility channel. Tests updated and passing locally (5/5 sandbox tests).

The orthogonal follow-up — flipping the default so bare LocalCommandLineCodeExecutor() requires an explicit sandbox= and raises otherwise — I'll file as a separate issue once this lands, since it's a breaking change that wants its own deprecation cycle.

@xr843

xr843 commented May 21, 2026

Copy link
Copy Markdown
Author

Pushed ce4c538 to make the branch match the discussion here:

  • the default unsandboxed path now emits UserWarning again, not DeprecationWarning, so library-level construction remains visible under Python's stock filters;
  • logger.warning(...) remains a parallel visibility lane;
  • added a regression test for warnings.simplefilter("ignore") to pin that the log record still emits even when Python warnings are globally suppressed.

On the follow-on question: I am treating this PR as strong best-effort visibility without adding an unconditional stderr-direct side effect. If an application deliberately suppresses both Python warnings and this logger, the non-bypassable enforcement step should be the follow-up breaking change where bare LocalCommandLineCodeExecutor() requires an explicit sandbox=... posture and raises otherwise. That keeps this PR backward-compatible while still making the current default risk visible in normal library usage.

@xr843

xr843 commented Jun 27, 2026

Copy link
Copy Markdown
Author

Closing this one myself. autogen has been in maintenance mode since early April (the most recent merge, #7521 on 2026-04-06, was the maintenance-mode README banner itself), so a new executor feature like this opt-in sandbox flag is out of scope for the project's current direction.

The design discussion in #7462 and in this thread may still be a useful reference if the silent-platform-fallback concern is ever revisited. Thanks @msaleme for the thoughtful review feedback earlier.

@xr843 xr843 closed this Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants