Cognitive Manipulation via Learned HDC Operators

CODEBASE REVIEW SEPT 18th

The current HDC space setup uses non-unitary transformations that don't preserve vector norms run_tl.py:87-90 . The projection matrix is normalized but the sign binarization destroys length preservation complete_demo.py:82-88 .

cognitive manipulation operators currently use residual connections and bounded outputs run.py:82 , which are not unitary transformations. Current set up appears more like activation steering which is akin to anthropic's work, therefore, codebase should be better formalized for this goal. We can organize it more cleanly in this case, and then branch to a more grounded one with high dimensoinal computing.

Study Design and Findings

Understanding how to mathematically manipulate cognitive states in artificial neural networks has direct applications to brain-computer interfaces, where similar operators might modify biological neural activity to achieve targeted cognitive changes. We tested whether we could learn mathematical operators to steer GPT-2 sentiment while keeping text readable.

Data Generation Phase We created paired positive and negative prompts to extract contrasting cognitive states: Positive prompts: "I love my new puppy, he is so...", "That was a wonderful and happy..." Negative prompts: "The traffic was horrible this morning, it was...", "I had a terrible and awful..."

We fed these prompts through GPT-2 and used TransformerLens to capture the model's internal activations from Layer 6. We'll convert all of these captured activations into a hyperdimensional computing space (HDC), and refer to them as cogit vectors. These vectors, in the sense of their sentiment, should be directionally opposed.

Next, we use these vectors to determine mathematical "dial" that can turn a positive thought into a negative one.

Train a Transformation: We will train a simple machine learning model. Its only job is to learn the mathematical recipe (the "operator") that transforms a positive cogit vector into its corresponding negative cogit vector.
The Result: At the end of this phase, you will have a single, learned mathematical function: make_negative_operator. This function now represents the abstract concept of "making something more negative."
We then test this operator on sentences the model was not trained on, namely, neutral unfinished thoughts.

When training data is small, the model overfits, and intervention renders incoherent response. When training data is increased, say, from 5 pairs of prompts to 50 pairs of prompts, coherence and directionality is maintained. Thus, this gives us indicator of a proof of concept that small, subtle interventions are an attack vector for cognitive securtiy. We expect to replicate these findings robustly in stronger models.

⚠️ Codebase Warning: Very Messy Pilot

This repository contains a mix of working experimental code and broken/unused files from development. Most directories and files should be ignored.

✅ WORKING FILES - Look at these:

Core Architecture:

src/model_adapter_tl.py - TransformerLens implementation that actually works

Sentiment Manipulation Experiment (the actual research):

generate_diverse_prompts.py - Creates the 50 balanced training prompts
sentiment_experiment.py - Baseline 10-example experiment (demonstrates overfitting)
sentiment_experiment_improved.py - Phase 1: Collects diverse data from 50 prompts
fix_balance.py - Fixes data imbalance (50 positive, 55 negative → 50/50)
sentiment_phase2_improved.py - Phase 2: Trains the mathematical operator
sentiment_phase3_improved.py - Phase 3: Tests intervention with blending

Data and Results:

data/sentiment_experiment/diverse_prompts_50.json - The 50 training prompts
data/sentiment_experiment/balanced_cogits_20250903_035408.json - Final training data
models/sentiment_operator/robust_sentiment_operator.pt - Trained operator (64MB)
results/sentiment_intervention/improved_results.json - Final experimental results

❌ IGNORE THESE FILES - Broken/unused development artifacts:

Broken Infrastructure:

src/model_adapter.py - Manual hook implementation (caused infinite hanging)
src/stage1_simulation/ - Original unused framework
src/stage2_encoding/ - Original unused framework
src/stage3_learning/ - Original unused framework

Failed Development Attempts:

test_*.py, debug_*.py, minimal_test.py - Hook debugging attempts
working_injection.py, fix_injection.py - Manual hook failures
sentiment_phase2_efficient.py - Earlier training version
sentiment_phase3_intervention.py - Earlier intervention version
compare_implementations.py - Development comparison

Empty/Unused Directories:

embeddings/ - Empty
notebooks/ - Empty
scripts/ - Empty
docs/ - Empty
infra/ - Empty
models/operators/ - Original framework operators (unused)
data/raw/, data/processed/, data/models/ - Original pipeline data (unused)

How to Run the Working Experiment

# 1. Generate training prompts (if needed)
python generate_diverse_prompts.py

# 2. Phase 1: Collect activation data  
python sentiment_experiment_improved.py

# 3. Fix data balance (if needed)
python fix_balance.py

# 4. Phase 2: Train the operator
python sentiment_phase2_improved.py

# 5. Phase 3: Test intervention
python sentiment_phase3_improved.py

For HDC Verification - Exact Code Locations

HDC Encoding Process (verify the math here):

sentiment_experiment_improved.py:118-146 - HDC encoding function with random projection
sentiment_experiment_improved.py:93-97 - Projection matrix creation (deterministic, seed=42)
sentiment_experiment_improved.py:107-110 - Sign binarization: cogit = torch.sign(hd_vector)

HDC Decoding Process:

sentiment_phase3_improved.py:194-206 - HDCDecoder class with pseudoinverse
sentiment_phase3_improved.py:201-205 - Decode function: activation = torch.matmul(cogit, self.inverse_projection)

Critical HDC Operations:

Encoding: 768-dim activation → multiply by random projection matrix → sign() → 10K-dim binary cogit
Transform: Neural network operates directly on 10K-dim cogit space
Decoding: cogit → multiply by pseudoinverse of projection → 768-dim activation
Blending: (1-ratio) * original + ratio * modified in original activation space

Verify These Lines Specifically:

sentiment_experiment_improved.py:108 - hd_vector = torch.matmul(activation, self.projection)
sentiment_experiment_improved.py:110 - cogit = torch.sign(hd_vector)
sentiment_phase3_improved.py:204 - activation = torch.matmul(cogit, self.inverse_projection)

Key Finding

Math can steer models. It is likely to steer minds.

Research Team

Research by: Bryce-Allen Bagley & Austin Morrissey
Implementation assistance: Claude Opus 4.1

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data/sentiment_experiment		data/sentiment_experiment
models/sentiment_operator		models/sentiment_operator
results		results
src		src
.gitignore		.gitignore
EXPERIMENT_GUIDE.md		EXPERIMENT_GUIDE.md
README.md		README.md
fix_balance.py		fix_balance.py
generate_diverse_prompts.py		generate_diverse_prompts.py
requirements.txt		requirements.txt
sentiment_experiment.py		sentiment_experiment.py
sentiment_experiment_improved.py		sentiment_experiment_improved.py
sentiment_phase2_improved.py		sentiment_phase2_improved.py
sentiment_phase3_improved.py		sentiment_phase3_improved.py
test_blend_ratios.py		test_blend_ratios.py
test_fresh_experiment.py		test_fresh_experiment.py
test_imports.py		test_imports.py
verify_gpt2.py		verify_gpt2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cognitive Manipulation via Learned HDC Operators

Study Design and Findings

⚠️ Codebase Warning: Very Messy Pilot

✅ WORKING FILES - Look at these:

❌ IGNORE THESE FILES - Broken/unused development artifacts:

How to Run the Working Experiment

For HDC Verification - Exact Code Locations

Key Finding

Research Team

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

gvmfhy/project-cogit-framework

Folders and files

Latest commit

History

Repository files navigation

Cognitive Manipulation via Learned HDC Operators

Study Design and Findings

⚠️ Codebase Warning: Very Messy Pilot

✅ WORKING FILES - Look at these:

❌ IGNORE THESE FILES - Broken/unused development artifacts:

How to Run the Working Experiment

For HDC Verification - Exact Code Locations

Key Finding

Research Team

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages