MacGyver Semantic Probing

❯ Python

Overview

This is the code repository for the research project "Think Outside the Bot: Automating Evaluation of Creativity in LLMs for Physical Reasoning with Semantic Entropy and Efficient Multi-Agent Judge".

Features

Contains code to run an automated benchmark on the MacGyver dataset.

Project Structure

└── MacGyverSemanticProbing/
    ├── export_data.py
    ├── install_dependencies.sh
    ├── keys.py
    ├── LICENSE
    ├── llmaaj.py
    ├── README.md
    ├── readmeai-gemini-v1.md
    ├── readmeai-gemini.md
    ├── requirements.txt
    ├── script.bat
    ├── src
    │   ├── benchmark.py
    │   ├── dabertaMNLI.py
    │   ├── data.py
    │   ├── GPT_run_benchmark.py
    │   ├── helper_funcs.py
    │   ├── llama_funcs.py
    │   ├── Llama_run_benchmark.py
    │   ├── LLMevalframeworks.py
    │   ├── Mixtral_run_benchmark.py
    │   ├── openai_funcs.py
    │   ├── process_data.py
    │   ├── read_data.py
    │   └── vicuna_run_benchmark.py
    └── test_code
        ├── sample_query_Llama.py
        ├── sample_query_vicuna.py
        └── test_llama70b.py

Project Index

Benchmark

__root__

export_data.py - `export_data.py` consolidates processed data from the `src.process_data` module
- It generates a JSON file containing various evaluation metrics, including simplistic and complex scoring metrics, classification probabilities, and response lists
- The output filename is configurable via command-line arguments, allowing for flexibility in data storage
- The script's purpose is to provide a structured, readily accessible format for the project's analytical results.

install_dependencies.sh - The script automates the installation of project dependencies
- It manages environment variables, clones repositories, installs Python packages (including llama-cpp-python, transformers, and others) using pip, and verifies CUDA installation
- The process ensures the project's runtime environment is correctly configured for execution, leveraging both system and user-specified locations for caching and configuration files.

keys.py - Keys.py establishes secure connections to external services
- It initializes OpenAI and Hugging Face API clients, providing authentication credentials for interaction with their respective platforms
- This facilitates access to large language models and other resources within the broader project architecture.

llmaaj.py - The `llmaaj.py` file acts as a setup and data preparation module within a larger project (likely involving large language models)
- It authenticates with the Hugging Face Hub, imports necessary libraries (including those for interacting with OpenAI and processing data), and prepares a Pandas DataFrame from external Excel files containing problem-solution pairs
- This prepared data, specifically a subset of efficient/inefficient/infeasible solutions, is then used as input for subsequent modules (the code snippet cuts off before showing the full usage, but it suggests further processing involving OpenAI's API for factuality checks)
- In essence, this file sets the stage for downstream tasks by handling authentication and data loading/preprocessing.

requirements.txt - `requirements.txt` specifies the project's dependencies
- It lists all external Python packages required for the application to function correctly, including libraries for natural language processing, machine learning, data manipulation, and web requests
- These packages enable the project's core functionalities.

script.bat - The script automates the setup of a machine learning environment
- It clones a specified Git repository, installs necessary Python packages including those for large language models and CUDA support, and verifies CUDA installation
- The process ensures the project's dependencies are correctly configured for execution, streamlining the development workflow.

src

benchmark.py - The benchmark script facilitates multi-step problem-solving using various large language models (LLMs)
- It iteratively generates solutions for multiple problems, selecting the highest-probability step at each iteration
- The script supports different LLMs and incorporates a MacGyver-style problem-solving prompt, recording probabilities and hidden states for analysis
- Results are stored for further evaluation.

dabertaMNLI.py - The `dabertaMNLI.py` module provides natural language inference (NLI) capabilities
- It leverages a pre-trained DeBERTa model to classify the relationship between two text snippets (hypothesis and premise) as entailment, contradiction, or neutral
- The module offers functions to retrieve both the classification label and associated probability scores, facilitating NLI tasks within the broader project.

data.py - The `data.py` script preprocesses a dataset of problem-solution pairs
- It downloads data, formats it for a MacGyver-style problem-solving task, creating prompts that challenge a model to generate creative, single-step solutions
- The script filters for solvable problems, shuffles the data, and prepares it for model training or evaluation within the larger project.

GPT_run_benchmark.py - The `GPT_run_benchmark.py` file serves as a benchmark script within a larger project (likely involving AI problem-solving)
- It utilizes a large language model (LLM), likely via the `llama_funcs` module (indicated by the import statement), to generate sequential steps towards solving a problem presented as a prompt
- The script focuses on evaluating the LLM's ability to produce concise, creative, and effective solutions within a constrained number of steps
- The code's purpose is to test and measure the performance of this problem-solving approach.

helper_funcs.py - The `src\helper_funcs.py` file provides a collection of utility functions used throughout the larger project
- These functions, drawing on other modules like `src.openai_funcs` and `src.data`, facilitate tasks such as text generation (using models like GPT), factuality assessment, and potentially entailment analysis
- The file also includes functions for evaluating model performance using metrics like ROC AUC and accuracy
- In essence, it acts as a central repository of reusable helper functions supporting the core functionalities of the project.

llama_funcs.py - The `llama_funcs.py` file serves as the core logic for interacting with large language models (LLMs), likely within a larger application
- It imports necessary libraries for interacting with Hugging Face models (via the `transformers` library) and manages parameters such as temperature and top-p for controlling LLM generation
- The file appears to offer command-line argument parsing to customize these parameters, suggesting flexibility in how the LLMs are used within the broader project
- The use of environment variables (e.g., `HF_TOKEN`) indicates integration with a Hugging Face account for model access
- In short, this file acts as the interface between the application and the chosen LLMs, handling model selection, parameter configuration, and generation requests.

Llama_run_benchmark.py - `Llama_run_benchmark.py` serves as a benchmark script within a larger project focused on problem-solving using a large language model (likely Llama)
- It utilizes functions from other modules (indicated by the imports) to generate and evaluate solutions to a problem, presented as a multi-step challenge to the model
- The script's core purpose is to test and measure the model's ability to devise efficient, feasible solutions step-by-step, mimicking a MacGyver-like approach
- The benchmark likely assesses the model's performance based on the number of steps required to reach a solution and the quality of each step generated.

LLMevalframeworks.py - The `LLMevalframeworks.py` file provides a testing framework for the OpenAI interaction component within a larger project
- It uses the `openai_funcs` module (presumably containing functions to interact with the OpenAI API) and a vector database (ChromaDB) along with sentence embeddings (SentenceTransformer) – though these latter two are not directly used in the shown code snippet
- The primary function, `test_openai()`, demonstrates a basic interaction with the OpenAI API, verifying a simple question-answering capability
- The inclusion of a safety definition string suggests a broader project focus on evaluating the safety of AI-generated responses, though the provided code snippet doesn't directly implement this aspect.

Mixtral_run_benchmark.py - The script runs benchmarks on a MacGyver problem-solving model
- It iteratively generates multi-step solutions, using a large language model to propose each step
- The process involves selecting the most probable solution at each step and refining the prompt for subsequent steps
- The script manages multiple problems and steps, recording probabilities and intermediate results for analysis
- Output includes the generated solutions and associated probabilities.

openai_funcs.py ❯ REPLACE-ME

process_data.py ❯ REPLACE-ME

read_data.py ❯ REPLACE-ME

vicuna_run_benchmark.py ❯ REPLACE-ME

test_code

sample_query_Llama.py ❯ Test code for Llama models.

sample_query_vicuna.py ❯ Test code for Vicuna models.

test_llama70b.py ❯ Test code for Llama 70B models.

Getting Started

Prerequisites

Before getting started with MacGyverSemanticProbing, ensure your runtime environment meets the following requirements:

Programming Language: Python
Package Manager: Pip

Installation

Install MacGyverSemanticProbing using one of the following methods:

Build from source:

Clone the MacGyverSemanticProbing repository:

❯ git clone ../MacGyverSemanticProbing

Navigate to the project directory:

❯ cd MacGyverSemanticProbing

Install the project dependencies:

Using pip

❯ pip install -r requirements.txt

Usage

Run the benchmark using the following command:

python export_data.py (model name) (json file name) (factuality judgement: chateval or llmjudge) (entailment model: gpt4 or deberta) (LLMjudge: true/false) (temperature of model) (number of questions to run benchmark on) (output_hiddenstates) (starting problem number)

Things to note:

LLMjudge should be set to false, as the feature is deprecated.
output_hiddenstates should be set to false, to prevent massive output file sizes.
Entailment model should be set to deberta, as GPT-4o entailment consumes a large amount of credits.

Project Roadmap

TBC

Contributing

💬 Join the Discussions: Share your insights, provide feedback, or ask questions.
🐛 Report Issues: Submit bugs found or log feature requests for the MacGyverSemanticProbing project.
💡 Submit Pull Requests: Review open PRs, and submit your own PRs.

Contributing Guidelines

Fork the Repository: Start by forking the project repository to your LOCAL account.
Clone Locally: Clone the forked repository to your local machine using a git client.
```
git clone C:\Users\ckcza\Documents\GitHub\MacGyverSemanticProbing
```
Create a New Branch: Always work on a new branch, giving it a descriptive name.
```
git checkout -b new-feature-x
```
Make Your Changes: Develop and test your changes locally.
Commit Your Changes: Commit with a clear message describing your updates.
```
git commit -m 'Implemented new feature x.'
```
Push to LOCAL: Push the changes to your forked repository.
```
git push origin new-feature-x
```
Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!

Contributor Graph

License

This project is protected under the MIT License. For more details, refer to the license file.

Acknowledgments

Main icon provided by Freepik.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MacGyver Semantic Probing

Table of Contents

Overview

Features

Project Structure

Project Index

Getting Started

Prerequisites

Installation

Usage

Project Roadmap

Contributing

License

Acknowledgments

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
src		src
test_code		test_code
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
export_data.py		export_data.py
install_dependencies.sh		install_dependencies.sh
llmaaj.py		llmaaj.py
requirements.txt		requirements.txt
script.bat		script.bat

export_data.py	- `export_data.py` consolidates processed data from the `src.process_data` module - It generates a JSON file containing various evaluation metrics, including simplistic and complex scoring metrics, classification probabilities, and response lists - The output filename is configurable via command-line arguments, allowing for flexibility in data storage - The script's purpose is to provide a structured, readily accessible format for the project's analytical results.
install_dependencies.sh	- The script automates the installation of project dependencies - It manages environment variables, clones repositories, installs Python packages (including llama-cpp-python, transformers, and others) using pip, and verifies CUDA installation - The process ensures the project's runtime environment is correctly configured for execution, leveraging both system and user-specified locations for caching and configuration files.
keys.py	- Keys.py establishes secure connections to external services - It initializes OpenAI and Hugging Face API clients, providing authentication credentials for interaction with their respective platforms - This facilitates access to large language models and other resources within the broader project architecture.
llmaaj.py	- The `llmaaj.py` file acts as a setup and data preparation module within a larger project (likely involving large language models) - It authenticates with the Hugging Face Hub, imports necessary libraries (including those for interacting with OpenAI and processing data), and prepares a Pandas DataFrame from external Excel files containing problem-solution pairs - This prepared data, specifically a subset of efficient/inefficient/infeasible solutions, is then used as input for subsequent modules (the code snippet cuts off before showing the full usage, but it suggests further processing involving OpenAI's API for factuality checks) - In essence, this file sets the stage for downstream tasks by handling authentication and data loading/preprocessing.
requirements.txt	- `requirements.txt` specifies the project's dependencies - It lists all external Python packages required for the application to function correctly, including libraries for natural language processing, machine learning, data manipulation, and web requests - These packages enable the project's core functionalities.
script.bat	- The script automates the setup of a machine learning environment - It clones a specified Git repository, installs necessary Python packages including those for large language models and CUDA support, and verifies CUDA installation - The process ensures the project's dependencies are correctly configured for execution, streamlining the development workflow.

benchmark.py	- The benchmark script facilitates multi-step problem-solving using various large language models (LLMs) - It iteratively generates solutions for multiple problems, selecting the highest-probability step at each iteration - The script supports different LLMs and incorporates a MacGyver-style problem-solving prompt, recording probabilities and hidden states for analysis - Results are stored for further evaluation.
dabertaMNLI.py	- The `dabertaMNLI.py` module provides natural language inference (NLI) capabilities - It leverages a pre-trained DeBERTa model to classify the relationship between two text snippets (hypothesis and premise) as entailment, contradiction, or neutral - The module offers functions to retrieve both the classification label and associated probability scores, facilitating NLI tasks within the broader project.
data.py	- The `data.py` script preprocesses a dataset of problem-solution pairs - It downloads data, formats it for a MacGyver-style problem-solving task, creating prompts that challenge a model to generate creative, single-step solutions - The script filters for solvable problems, shuffles the data, and prepares it for model training or evaluation within the larger project.
GPT_run_benchmark.py	- The `GPT_run_benchmark.py` file serves as a benchmark script within a larger project (likely involving AI problem-solving) - It utilizes a large language model (LLM), likely via the `llama_funcs` module (indicated by the import statement), to generate sequential steps towards solving a problem presented as a prompt - The script focuses on evaluating the LLM's ability to produce concise, creative, and effective solutions within a constrained number of steps - The code's purpose is to test and measure the performance of this problem-solving approach.
helper_funcs.py	- The `src\helper_funcs.py` file provides a collection of utility functions used throughout the larger project - These functions, drawing on other modules like `src.openai_funcs` and `src.data`, facilitate tasks such as text generation (using models like GPT), factuality assessment, and potentially entailment analysis - The file also includes functions for evaluating model performance using metrics like ROC AUC and accuracy - In essence, it acts as a central repository of reusable helper functions supporting the core functionalities of the project.
llama_funcs.py	- The `llama_funcs.py` file serves as the core logic for interacting with large language models (LLMs), likely within a larger application - It imports necessary libraries for interacting with Hugging Face models (via the `transformers` library) and manages parameters such as temperature and top-p for controlling LLM generation - The file appears to offer command-line argument parsing to customize these parameters, suggesting flexibility in how the LLMs are used within the broader project - The use of environment variables (e.g., `HF_TOKEN`) indicates integration with a Hugging Face account for model access - In short, this file acts as the interface between the application and the chosen LLMs, handling model selection, parameter configuration, and generation requests.
Llama_run_benchmark.py	- `Llama_run_benchmark.py` serves as a benchmark script within a larger project focused on problem-solving using a large language model (likely Llama) - It utilizes functions from other modules (indicated by the imports) to generate and evaluate solutions to a problem, presented as a multi-step challenge to the model - The script's core purpose is to test and measure the model's ability to devise efficient, feasible solutions step-by-step, mimicking a MacGyver-like approach - The benchmark likely assesses the model's performance based on the number of steps required to reach a solution and the quality of each step generated.
LLMevalframeworks.py	- The `LLMevalframeworks.py` file provides a testing framework for the OpenAI interaction component within a larger project - It uses the `openai_funcs` module (presumably containing functions to interact with the OpenAI API) and a vector database (ChromaDB) along with sentence embeddings (SentenceTransformer) – though these latter two are not directly used in the shown code snippet - The primary function, `test_openai()`, demonstrates a basic interaction with the OpenAI API, verifying a simple question-answering capability - The inclusion of a safety definition string suggests a broader project focus on evaluating the safety of AI-generated responses, though the provided code snippet doesn't directly implement this aspect.
Mixtral_run_benchmark.py	- The script runs benchmarks on a MacGyver problem-solving model - It iteratively generates multi-step solutions, using a large language model to propose each step - The process involves selecting the most probable solution at each step and refining the prompt for subsequent steps - The script manages multiple problems and steps, recording probabilities and intermediate results for analysis - Output includes the generated solutions and associated probabilities.
openai_funcs.py	`❯ REPLACE-ME`
process_data.py	`❯ REPLACE-ME`
read_data.py	`❯ REPLACE-ME`
vicuna_run_benchmark.py	`❯ REPLACE-ME`

sample_query_Llama.py	`❯ Test code for Llama models.`
sample_query_vicuna.py	`❯ Test code for Vicuna models.`
test_llama70b.py	`❯ Test code for Llama 70B models.`

License

stonkmem/MacGyverSemanticProbing

Folders and files

Latest commit

History

Repository files navigation

MacGyver Semantic Probing

Table of Contents

Overview

Features

Project Structure

Project Index

Getting Started

Prerequisites

Installation

Usage

Project Roadmap

Contributing

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages