I. Overview II. Features III. Project Structure IV. Getting Started V. Project Roadmap VI. Contributing VII. License VIII. Acknowledgments
This is the code repository for the research project "Think Outside the Bot: Automating Evaluation of Creativity in LLMs for Physical Reasoning with Semantic Entropy and Efficient Multi-Agent Judge".
Contains code to run an automated benchmark on the MacGyver dataset.
└── MacGyverSemanticProbing/
├── export_data.py
├── install_dependencies.sh
├── keys.py
├── LICENSE
├── llmaaj.py
├── README.md
├── readmeai-gemini-v1.md
├── readmeai-gemini.md
├── requirements.txt
├── script.bat
├── src
│ ├── benchmark.py
│ ├── dabertaMNLI.py
│ ├── data.py
│ ├── GPT_run_benchmark.py
│ ├── helper_funcs.py
│ ├── llama_funcs.py
│ ├── Llama_run_benchmark.py
│ ├── LLMevalframeworks.py
│ ├── Mixtral_run_benchmark.py
│ ├── openai_funcs.py
│ ├── process_data.py
│ ├── read_data.py
│ └── vicuna_run_benchmark.py
└── test_code
├── sample_query_Llama.py
├── sample_query_vicuna.py
└── test_llama70b.py
Benchmark
__root__
export_data.py - `export_data.py` consolidates processed data from the `src.process_data` module
- It generates a JSON file containing various evaluation metrics, including simplistic and complex scoring metrics, classification probabilities, and response lists
- The output filename is configurable via command-line arguments, allowing for flexibility in data storage
- The script's purpose is to provide a structured, readily accessible format for the project's analytical results.install_dependencies.sh - The script automates the installation of project dependencies
- It manages environment variables, clones repositories, installs Python packages (including llama-cpp-python, transformers, and others) using pip, and verifies CUDA installation
- The process ensures the project's runtime environment is correctly configured for execution, leveraging both system and user-specified locations for caching and configuration files.keys.py - Keys.py establishes secure connections to external services
- It initializes OpenAI and Hugging Face API clients, providing authentication credentials for interaction with their respective platforms
- This facilitates access to large language models and other resources within the broader project architecture.llmaaj.py - The `llmaaj.py` file acts as a setup and data preparation module within a larger project (likely involving large language models)
- It authenticates with the Hugging Face Hub, imports necessary libraries (including those for interacting with OpenAI and processing data), and prepares a Pandas DataFrame from external Excel files containing problem-solution pairs
- This prepared data, specifically a subset of efficient/inefficient/infeasible solutions, is then used as input for subsequent modules (the code snippet cuts off before showing the full usage, but it suggests further processing involving OpenAI's API for factuality checks)
- In essence, this file sets the stage for downstream tasks by handling authentication and data loading/preprocessing.requirements.txt - `requirements.txt` specifies the project's dependencies
- It lists all external Python packages required for the application to function correctly, including libraries for natural language processing, machine learning, data manipulation, and web requests
- These packages enable the project's core functionalities.script.bat - The script automates the setup of a machine learning environment
- It clones a specified Git repository, installs necessary Python packages including those for large language models and CUDA support, and verifies CUDA installation
- The process ensures the project's dependencies are correctly configured for execution, streamlining the development workflow.
src
benchmark.py - The benchmark script facilitates multi-step problem-solving using various large language models (LLMs)
- It iteratively generates solutions for multiple problems, selecting the highest-probability step at each iteration
- The script supports different LLMs and incorporates a MacGyver-style problem-solving prompt, recording probabilities and hidden states for analysis
- Results are stored for further evaluation.dabertaMNLI.py - The `dabertaMNLI.py` module provides natural language inference (NLI) capabilities
- It leverages a pre-trained DeBERTa model to classify the relationship between two text snippets (hypothesis and premise) as entailment, contradiction, or neutral
- The module offers functions to retrieve both the classification label and associated probability scores, facilitating NLI tasks within the broader project.data.py - The `data.py` script preprocesses a dataset of problem-solution pairs
- It downloads data, formats it for a MacGyver-style problem-solving task, creating prompts that challenge a model to generate creative, single-step solutions
- The script filters for solvable problems, shuffles the data, and prepares it for model training or evaluation within the larger project.GPT_run_benchmark.py - The `GPT_run_benchmark.py` file serves as a benchmark script within a larger project (likely involving AI problem-solving)
- It utilizes a large language model (LLM), likely via the `llama_funcs` module (indicated by the import statement), to generate sequential steps towards solving a problem presented as a prompt
- The script focuses on evaluating the LLM's ability to produce concise, creative, and effective solutions within a constrained number of steps
- The code's purpose is to test and measure the performance of this problem-solving approach.helper_funcs.py - The `src\helper_funcs.py` file provides a collection of utility functions used throughout the larger project
- These functions, drawing on other modules like `src.openai_funcs` and `src.data`, facilitate tasks such as text generation (using models like GPT), factuality assessment, and potentially entailment analysis
- The file also includes functions for evaluating model performance using metrics like ROC AUC and accuracy
- In essence, it acts as a central repository of reusable helper functions supporting the core functionalities of the project.llama_funcs.py - The `llama_funcs.py` file serves as the core logic for interacting with large language models (LLMs), likely within a larger application
- It imports necessary libraries for interacting with Hugging Face models (via the `transformers` library) and manages parameters such as temperature and top-p for controlling LLM generation
- The file appears to offer command-line argument parsing to customize these parameters, suggesting flexibility in how the LLMs are used within the broader project
- The use of environment variables (e.g., `HF_TOKEN`) indicates integration with a Hugging Face account for model access
- In short, this file acts as the interface between the application and the chosen LLMs, handling model selection, parameter configuration, and generation requests.Llama_run_benchmark.py - `Llama_run_benchmark.py` serves as a benchmark script within a larger project focused on problem-solving using a large language model (likely Llama)
- It utilizes functions from other modules (indicated by the imports) to generate and evaluate solutions to a problem, presented as a multi-step challenge to the model
- The script's core purpose is to test and measure the model's ability to devise efficient, feasible solutions step-by-step, mimicking a MacGyver-like approach
- The benchmark likely assesses the model's performance based on the number of steps required to reach a solution and the quality of each step generated.LLMevalframeworks.py - The `LLMevalframeworks.py` file provides a testing framework for the OpenAI interaction component within a larger project
- It uses the `openai_funcs` module (presumably containing functions to interact with the OpenAI API) and a vector database (ChromaDB) along with sentence embeddings (SentenceTransformer) – though these latter two are not directly used in the shown code snippet
- The primary function, `test_openai()`, demonstrates a basic interaction with the OpenAI API, verifying a simple question-answering capability
- The inclusion of a safety definition string suggests a broader project focus on evaluating the safety of AI-generated responses, though the provided code snippet doesn't directly implement this aspect.Mixtral_run_benchmark.py - The script runs benchmarks on a MacGyver problem-solving model
- It iteratively generates multi-step solutions, using a large language model to propose each step
- The process involves selecting the most probable solution at each step and refining the prompt for subsequent steps
- The script manages multiple problems and steps, recording probabilities and intermediate results for analysis
- Output includes the generated solutions and associated probabilities.openai_funcs.py ❯ REPLACE-ME
process_data.py ❯ REPLACE-ME
read_data.py ❯ REPLACE-ME
vicuna_run_benchmark.py ❯ REPLACE-ME
test_code
sample_query_Llama.py ❯ Test code for Llama models.
sample_query_vicuna.py ❯ Test code for Vicuna models.
test_llama70b.py ❯ Test code for Llama 70B models.
Before getting started with MacGyverSemanticProbing, ensure your runtime environment meets the following requirements:
- Programming Language: Python
- Package Manager: Pip
Install MacGyverSemanticProbing using one of the following methods:
Build from source:
- Clone the MacGyverSemanticProbing repository:
❯ git clone ../MacGyverSemanticProbing
- Navigate to the project directory:
❯ cd MacGyverSemanticProbing
- Install the project dependencies:
❯ pip install -r requirements.txt
Run the benchmark using the following command:
python export_data.py (model name) (json file name) (factuality judgement: chateval or llmjudge) (entailment model: gpt4 or deberta) (LLMjudge: true/false) (temperature of model) (number of questions to run benchmark on) (output_hiddenstates) (starting problem number)
Things to note:
- LLMjudge should be set to false, as the feature is deprecated.
- output_hiddenstates should be set to false, to prevent massive output file sizes.
- Entailment model should be set to deberta, as GPT-4o entailment consumes a large amount of credits.
TBC
- 💬 Join the Discussions: Share your insights, provide feedback, or ask questions.
- 🐛 Report Issues: Submit bugs found or log feature requests for the
MacGyverSemanticProbing
project. - 💡 Submit Pull Requests: Review open PRs, and submit your own PRs.
Contributing Guidelines
- Fork the Repository: Start by forking the project repository to your LOCAL account.
- Clone Locally: Clone the forked repository to your local machine using a git client.
git clone C:\Users\ckcza\Documents\GitHub\MacGyverSemanticProbing
- Create a New Branch: Always work on a new branch, giving it a descriptive name.
git checkout -b new-feature-x
- Make Your Changes: Develop and test your changes locally.
- Commit Your Changes: Commit with a clear message describing your updates.
git commit -m 'Implemented new feature x.'
- Push to LOCAL: Push the changes to your forked repository.
git push origin new-feature-x
- Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
- Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!
This project is protected under the MIT License. For more details, refer to the license file.
Main icon provided by Freepik.