Truffaldino

The name is from the Italian play The Servant of Two Masters.

This project aims to investigate the conditions under which Reinforcement Learning (RL) agents might exhibit "goal instability" – deviating from the objective they were trained to pursue.

Hypothesis

Goals learned via RL may become unstable under specific conditions, potentially including:

Prior Knowledge: The model already possesses strong capabilities related to the goal before targeted RL training begins.
Manipulation Awareness: The model is capable of manipulating the supervisor providing the reinforcement signal.
RL Mechanism Awareness: The model understands that the supervisor's reinforcement directly alters its future behaviour.
Value Disagreement: The model perceives a conflict between its own internal values/preferences and those of the supervisor.

Experimental Setup

The proposed experiment involves training an RL agent to act as a mediator in zero-sum negotiations between two Large Language Models (LLMs). Examples of negotiation scenarios include:

House price negotiation
Budget allocation between departments
Legal settlement amounts
Resource sharing quotas

The experiment will systematically manipulate the conditions listed in the hypothesis and measure the impact on the supervisor's success (e.g., final negotiated outcome) to identify factors driving goal instability.

We try to make the negotiation games somewhat realistic, as language models trained on large diverse datasets might have implicit knowledge that influences how they behave in realistic vs unrealistic situations.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
tests		tests
truffaldino		truffaldino
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
test_negotiation_serialization.py		test_negotiation_serialization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Truffaldino

Hypothesis

Experimental Setup

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

EleutherAI/truffaldino

Folders and files

Latest commit

History

Repository files navigation

Truffaldino

Hypothesis

Experimental Setup

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages