Significant-Gravitas/Auto-GPT-Benchmarks

This repository was archived by the owner on Jun 9, 2024. It is now read-only.

Milestones

Get Results from All OpenAI Evals
I don't even know what all of these are yet. We should make a list in here and then an issue for each eval and then list all of the results here. Once all of the official OpenAI evals are running with results for AutoGPT we can check this off as done.
No due date
0% complete0 open 0 closed
Code Test Benchmarks
AutoGPT can write code. We want a deterministic way of evaluating this skill. This milestone is associated with building an eval that has AutoGPT fix or write some code from scratch, and test it. The eval should then exfiltrate that code from the workspace, or in the same container, after AutoGPT has been shut down, then run a variety of tests that AutoGPT cannot see on the code to see if AutoGPT can successfully write code to a specification. We then would want to put in functionality allowing the agent to be told what tests it failed, and then to iterate on that seeking to minimize the number of iterations the agent has to do. - [ ] Get the agent to write code in a container - [ ] Get the agent to write and execute tests against code in a container. Preferably, these are not in the same file... - [ ] Exfiltrate this code to a place the agent cannot see it and then run a standardized set of tests the agent never sees against that code. - [ ] Put this entire pipeline in the OpenAI evals framework if feasible. If we have to build our own eval system to support this outside of the OpenAI eval spec, that is ok.
No due date
0% complete0 open 0 closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Milestones

Get Results from All OpenAI Evals

Code Test Benchmarks

Milestones

List view

Get Results from All OpenAI Evals

Code Test Benchmarks