Kafka-trained LLM

The goal of this project is to fine-tune a LLM to continue Kafka's novel: the Castle, as the author never ended it. It is designed such as French is the model language.

It is designed for a Nvidia GPU with at least 12 Go VRAM. The Nvidia container toolkit is needed. A minimal 32Go RAM is safe for the data preparation step.

It takes as input:

a free-to-operate base model: TinyLlama, a 1.1B parameters with a 2048 tokens context window.
a French copyleft translation of The Castle by Kafka, available here.
a French literature dataset: Gallica for French language enrichment and narrative consistency.

🐳 Launch

Build & launch the docker container:

docker build -t kafka .
docker run -d \
  --name kafka \
  -v ./project:/project \
  -p 8000:8000 \
  --gpus all \
  kafka

The project is supported by a notebook, available at port 8000 once the container runs. This notebook may be used to monitor the training and test the models.

✨ Curriculum learning

The curriculum of this project is presented in a logical order. However, it is also possible to prepare all data beforehands (0, then 1a, 1b and 1c) then to run all the training steps (2a, 2b and 2c).

Step 0 - Resource collection

This step collects the HuggingFace resources: the model, and the training dataset.

docker exec -it kafka python 0_data_collection.py

Step 1 - French literature aculturation

The goal of this step is to reorient the base model towards a generator of French literature. It consists in a full-weight training over 1M samples of 512 tokens from the Gallica collection. The model should forget its chatbot abilities, its multilinguism and its coding knowledge; to learn about boudoir intrigues and French classical literature content and form.

docker exec -it kafka python 1a_prepare_gallica_fullweight.py
docker exec -it kafka python 2a_train_gallica_fullweight.py

Step 2 - Strengthen the narrative arc

This step aims at teaching the model long (2048 tokens) and consistent narrative arcs, which is essential for a literature project. However, because the VRAM need increases quadratically with the context window, a QLoRA approach is undertaken from this step (and for the next one). LoRA adapters are trained over 250M samples of 2048, still from the Gallica collection. At the end ot this step, the model has seen 1B tokens of French literature.

docker exec -it kafka python 1b_prepare_gallica_QLoRA.py
docker exec -it kafka python 2b_train_gallica_QLoRA.py

Step 3 - Stylistic imprinting on Kafka

Finally, the French litterature model and more specifically its previously pre-trained LoRA are fine-tuned on the target book: the French translation of The Castle by Kafka. This step takes as input 2048 token long sequences of the book, with a stride of 512, yielding 4 shuffled "pseudo-epochs" (each token of the book is seen 4 times).

docker exec -it kafka python 1c_prepare_kafka.py
docker exec -it kafka python 2c_train_kakfa_QLoRA.py

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
project		project
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kafka-trained LLM

🐳 Launch

✨ Curriculum learning

Step 0 - Resource collection

Step 1 - French literature aculturation

Step 2 - Strengthen the narrative arc

Step 3 - Stylistic imprinting on Kafka

About

Uh oh!

Releases

Packages

Languages

License

Almarch/kafka

Folders and files

Latest commit

History

Repository files navigation

Kafka-trained LLM

🐳 Launch

✨ Curriculum learning

Step 0 - Resource collection

Step 1 - French literature aculturation

Step 2 - Strengthen the narrative arc

Step 3 - Stylistic imprinting on Kafka

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages