The goal of this project is to fine-tune a LLM to continue Kafka's novel: the Castle, as the author never ended it. It is designed such as French is the model language.
It is designed for a Nvidia GPU with at least 12 Go VRAM. The Nvidia container toolkit is needed. A minimal 32Go RAM is safe for the data preparation step.
It takes as input:
- a free-to-operate base model: TinyLlama, a 1.1B parameters with a 2048 tokens context window.
- a French copyleft translation of The Castle by Kafka, available here.
- a French literature dataset: Gallica for French language enrichment and narrative consistency.
Build & launch the docker container:
docker build -t kafka .
docker run -d \
--name kafka \
-v ./project:/project \
-p 8000:8000 \
--gpus all \
kafkaThe project is supported by a notebook, available at port 8000 once the container runs. This notebook may be used to monitor the training and test the models.
The curriculum of this project is presented in a logical order. However, it is also possible to prepare all data beforehands (0, then 1a, 1b and 1c) then to run all the training steps (2a, 2b and 2c).
This step collects the HuggingFace resources: the model, and the training dataset.
docker exec -it kafka python 0_data_collection.py
The goal of this step is to reorient the base model towards a generator of French literature. It consists in a full-weight training over 1M samples of 512 tokens from the Gallica collection. The model should forget its chatbot abilities, its multilinguism and its coding knowledge; to learn about boudoir intrigues and French classical literature content and form.
docker exec -it kafka python 1a_prepare_gallica_fullweight.pydocker exec -it kafka python 2a_train_gallica_fullweight.py
This step aims at teaching the model long (2048 tokens) and consistent narrative arcs, which is essential for a literature project. However, because the VRAM need increases quadratically with the context window, a QLoRA approach is undertaken from this step (and for the next one). LoRA adapters are trained over 250M samples of 2048, still from the Gallica collection. At the end ot this step, the model has seen 1B tokens of French literature.
docker exec -it kafka python 1b_prepare_gallica_QLoRA.pydocker exec -it kafka python 2b_train_gallica_QLoRA.py
Finally, the French litterature model and more specifically its previously pre-trained LoRA are fine-tuned on the target book: the French translation of The Castle by Kafka. This step takes as input 2048 token long sequences of the book, with a stride of 512, yielding 4 shuffled "pseudo-epochs" (each token of the book is seen 4 times).
docker exec -it kafka python 1c_prepare_kafka.pydocker exec -it kafka python 2c_train_kakfa_QLoRA.py
