Skip to content

Latest commit

 

History

History
41 lines (31 loc) · 1.97 KB

README.md

File metadata and controls

41 lines (31 loc) · 1.97 KB

pseugc

pseugc (pseudonymize user generated content) is the project developed as the output of Master's thesis titled "De-identification techniques with large language models" by Saurav Kumar Saha as a requirement to complete the M.Sc. in Data Science from Berliner Hochschule für Technik (BHT). It was supervised by Prof. Dr. rer. nat. Felix Bießmann and reviewed by Prof. Dr. Selcan Ipek-Ugay.

The thesis explores and develops different NER (Named Entity Recognition) models and pseudonymization techniques, ranging from training a BiLSTM-CRF based NER, fine-tuning transformers (e.g. GELECTRA) based NER or bi-directional encoder-decoder based text-to-text language model to prompting LLMs (Large Language Models), to detect private entities in unstructured German-language text documents and replace them with type-compliant and format preserving pseudonyms. A novel unified approach by fine-tuning a mT5 -base text-to-text model is developed to both identify and generate pseudonyms with a single model by leveraging training data prepared from the two versions of German-language email corpus CodE Alltag.

A paper, with all the experiments and findings of this project, titled "End-to-end Pseudonymization of German Texts with Deep Learning – An Empirical Comparison of Classical and Modern Approaches" has been accepted for publication in the First International Conference on AI in Medicine and Healthcare (AiMH' 2025) 8-10 April 2025, Innsbruck, Austria.

Pseudonymization App

A containerized image of the developed pseudonymization tool with a simple user interface and a API endpoint is available at -

https://hub.docker.com/r/sksdotsauravs/pseugc-app