Teaching Large Language Models how to Code

Author: Pierre Loertscher
Course: MACS 37005 – AI Agents for Social Science and Society, University of Chicago
Date: March 13, 2026

Overview

This project investigates how multi-agent LLM systems and automated prompt optimisation techniques can teach large language models to perform structured information extraction — specifically, subject-verb relationship ("motif") extraction from social-media texts about Donald Trump.

The pipeline proceeds through four stages:

AutoGen multi-agent prompt engineering (Stages 1–3): Four specialised prompt-engineer agents each analyse an exclusive slice of annotated data, produce extraction prompts, then converge (with human-in-the-loop feedback) on a single best prompt via a coordinator agent.
GRPO reinforcement learning (separate notebook): A Llama-3.1-8B-Instruct model is fine-tuned with Group Relative Policy Optimisation to reward structured JSON output and penalise verbose completions.
DSPy optimisation: The AutoGen-derived prompt seeds BootstrapFewShot, MIPROv2, BootstrapFinetune, and BetterTogether optimisers on both OpenAI and the RL-trained Llama backend.
SpaCy MCP tools: NER and dependency-parsing tools are exposed to the LLM via a ReAct loop (2-call-per-document safety limit, full audit logging), allowing the model to ground extraction decisions in syntactic structure.

Repository Structure

ai_agents_project/
├── full_project.ipynb       # Main notebook: Stages A–D, evaluation, and visualisations
├── llama_RL.ipynb           # Standalone GRPO RL training (run on a separate Colab A100 instance)
├── modules/
│   ├── __init__.py
│   ├── autogen_pipeline.py  # AutoGen multi-agent pipeline (Stages 1–3) + SpaCy tool-assisted
│   │                        #   extraction, audit logging, and all related visualisations
│   ├── dspy_pipeline.py     # DSPy pure-optimisation pipeline (Part A) + DSPy + MCP ReAct
│   │                        #   pipeline with SpaCy tools (Part B) and MCP visualisations
│   └── llama_RL.py          # GRPO RL training utilities (data loading, reward function,
│                            #   trainer setup, evaluation); used exclusively by llama_RL.ipynb
└── README.md

Module responsibilities

Module	Used by	Key exports
`autogen_pipeline.py`	`full_project.ipynb`	`run_stage1/2/3`, `run_stage3_with_tools`, `run_stage3_with_tools_standalone`, visualisation helpers, `WORKER_BEHAVIORAL_PROMPT`
`dspy_pipeline.py`	`full_project.ipynb`	`setup_dspy`, `run_bootstrap_fewshot`, `run_mipro`, `run_mcp_`, `plot_`, `MotifReActModule`
`llama_RL.py`	`llama_RL.ipynb`	`load_motifs_for_rl`, `make_motif_reward_func`, `RLConfig`, GRPO training utilities

Running the Project

Both notebooks are designed for Google Colab (GPU recommended for llama_RL.ipynb).

`full_project.ipynb`

Mount Google Drive and place original_content_trump_motifs_en_10k.csv at drive/MyDrive/.
Upload the modules/ directory via the VS Code ↔ Colab extension (or the Colab file browser).
Run cells in order. Section 0 installs all dependencies.
Enter your OpenAI API key when prompted.

Section 2 – Llama worker via SGLang (optional but required for Llama baseline):

Section 2 swaps the OpenAI worker for a local Llama-3.1-8B-Instruct model served through SGLang. Before running its cells:

A HuggingFace token with access to meta-llama/Llama-3.1-8B-Instruct is required. You will be prompted to enter it; it is written to os.environ["HF_TOKEN"] so that SGLang can download the weights.
SGLang is reinstalled at the start of Section 2 (pip install --force-reinstall sglang backoff) to pin the correct version.

The server is launched as a background process on port 7501:

CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
    --port 7501 --model-path meta-llama/Llama-3.1-8B-Instruct > sglang.log 2>&1 &

Any existing process on port 7501 is killed automatically before launch to avoid port-conflict errors.
After launch, wait for the server to become ready (watch sglang.log via !tail sglang.log) before proceeding to the evaluation cells.

`llama_RL.ipynb`

Run on a separate Colab instance with an A100 GPU.
Paste the final prompt produced by Stage 2 of full_project.ipynb when prompted.
After training, the model is saved to drive/MyDrive/llama_motif_grpo.
Update rl_model_path in full_project.ipynb to load the RL model for Stage C evaluation.

Data

The dataset (original_content_trump_motifs_en_10k.csv) contains ~10 000 social-media posts with hand-coded entity/action annotations. It is available via Google Drive; the notebooks download it automatically if not already present.

Dependencies

pyautogen / autogen-agentchat~=0.2
openai
dspy-ai
sglang + backoff          # Section 2 only – SGLang inference server for Llama
trl
transformers
datasets
accelerate
spacy + en_core_web_lg
sentence-transformers
scikit-learn
matplotlib
pandas
numpy
gdown

Install in Colab via the !pip install cell at the top of each notebook.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Teaching Large Language Models how to Code

Overview

Repository Structure

Module responsibilities

Running the Project

`full_project.ipynb`

`llama_RL.ipynb`

Data

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
modules		modules
CITATION.cff		CITATION.cff
README.md		README.md
full_project.ipynb		full_project.ipynb
llama_RL.ipynb		llama_RL.ipynb

Folders and files

Latest commit

History

Repository files navigation

Teaching Large Language Models how to Code

Overview

Repository Structure

Module responsibilities

Running the Project

full_project.ipynb

llama_RL.ipynb

Data

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`full_project.ipynb`

`llama_RL.ipynb`

Packages