feat(trim): introduce trim operator #329

danielelotito · 2025-12-17T15:29:45Z

This PR introduces in ADO the possibility to set up the exploration of a discovery space that stops when a good predictive model for the rest of the space is acquired.
In this iterative modeling, a holdout set is made up the last sampled points and are used to evaluate the performance of the latest acquired model.
As we progress these performance metrics are compared, if the prediction is stable iterative modeling stops.
You can try iterative modeling following the example at examples/trim_custom_experiments/README.md.

My TODOs:

write extensive documentation
write more tests
write a document in which you define more challenging settings and test the expected behavior of the operator, then test the operator in these settings
add more references to this PR draft

Operation cleanup logic was mixed together (a single function for cleaning up everything was required/was different for different scenarios) and also mixed with signal handling (a shutdown could be successful operation or a signal). In this commit - cleanup code required by different operations/functions is separated - A single signal handler exists that is generic - operations register their cleanup requirements with the handler so they are cleaned-up - shutdown global now only indicates if a signal was received NOT if an operation finished

Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>

…ation It is raised as a new InterruptedOperationError which contains the operation identifier and is subclass of KeyboardInterrupt Previously nothing was raised and the operation exited normally after interruption. However, this pattern is not easy to maintain with multiple nested operations and each operation would have to check if the inner operation exited due to KeyboardInterrupt. This way operators do not have to handle KeyboardInterrupt. Each outer operation will catch the inner interrupt and raise a new exception with its own id. The outermost handler (in operation/create.py) now catches InterruptedOperationError and prints the id of the outermost (parent) interrupted operation.

Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com> Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>

Signed-off-by: Daniele Lotito <daniele.lotito@ibm.com>

Pattern is orchestrate() creates/destroys the resource cleaner which exists in default namespace. If another function calls an explore operation that function has same responsibility.

Signed-off-by: Daniele Lotito <daniele.lotito@ibm.com>

… in the source space Note that there is a WIP on the random_shift strategy for high D sampling fix point in time to replicate comment #281 (comment) Signed-off-by: Daniele Lotito <daniele.lotito@ibm.com>

This is so the actors registered with the cleaner during the operation can be cleaned BEFORE the operation clean up deletes their parents (causing them to be deleted). The other option is to have a single resource cleaner but then all actors wanting to use it must be created "detached"

- Remove get_measurement_queue class method. It does nothing anymore - Add ray_namespace parameter ray_namespace parameter allows less parallel passing of same information, which potentially is source of errors, and more consistency checks. A set of actuators, discovery space manager, and queue instance should all be in same ray namespace.

- Move all actuator validation and initialization code to setup_actuator. - Create actuators first so validation failures are detected early - setup_actuators obtains namespace from queue rather than having potential inconsistency.

Normal shutdown and SIGTERM were handled but other exception were not leading to operation resource being cleaned up

Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com> Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>

Signed-off-by: Daniele Lotito <daniele.lotito@ibm.com>

The batchsize passed to the trim random walk is different from what random walk uses Signed-off-by: Daniele Lotito <daniele.lotito@ibm.com>

Operation cleanup logic was mixed together (a single function for cleaning up everything was required/was different for different scenarios) and also mixed with signal handling (a shutdown could be successful operation or a signal). In this commit - cleanup code required by different operations/functions is separated - A single signal handler exists that is generic - operations register their cleanup requirements with the handler so they are cleaned-up - shutdown global now only indicates if a signal was received NOT if an operation finished

* build: update pre-commit hooks * style(docs): update feature request template * style(docs): update creating custom experiments * style(docs): update run_experiment * style(docs): update sft-trainer * style(docs): update vllm-performance * style(docs): update data-sharing * style(docs): update vllm-performance-full * style(docs): update ado * style(docs): update optimisation-with-ray-tune * style(docs): update random-walk * style(docs): update datacontainer * style(docs): update discovery-spaces

* build: enable python 3.13 * refactor(test): replace test actuatorconfiguration * refactor(test): replace test discoveryspace * test: support testing python 3.13 * build(sfttrainer): require python <3.13 * docs(sfttrainer): mention supported python versions * build: replace workspace construct * build: return most packages to workspace * fix(test): test_custom_experiments strikes again

…nature

danielelotito · 2026-01-16T16:00:45Z

@michael-johnston you can have a look at the doc file I have created with docs(example): present trim to a broad public with an example
Next Monday I can do the longer guide in which I explain how to set up parameters different than defaults

…the sampling logic

AlessandroPomponio

initial comments

examples/trim/custom_experiments/trim_custom_experiments/experiments.py

AlessandroPomponio · 2026-01-19T14:01:08Z

examples/trim/custom_experiments/pyproject.toml

+
+[project.entry-points."ado.custom_experiments"]
+#This should be python file with your decorated function(s).
+my_experiment = "trim_custom_experiments.experiments" 


This should probably have a name

AlessandroPomponio · 2026-01-19T14:03:13Z

orchestrator/modules/operators/discovery_space_manager.py

 from orchestrator.utilities.environment import enable_ray_actor_coverage
 from orchestrator.utilities.logging import configure_logging

+PropertyFormatType = typing.Literal["observed", "target"]


This symbol is already defined in DiscoverySpace, but it's not available in this scope. It could be worth moving it to orchestrator/schema/property.py and importing it from there in both places

AlessandroPomponio · 2026-01-19T14:44:23Z