This repository contains ICD-10-CM annotations for the paper: "Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation", EMNLP 2025, Industry Track
This repository only contains the double expert-annotated ICD-10-CM annotations used for the paper. To derive the full training and testing data, including the corresponding notes, please follow these steps below:
- Download https://github.com/wyim/aci-bench to ACI_BENCH_PATH
- Run
python merge_aci_annotations.py --aci_data_dir ${ACI_BENCH_PATH}/data/challenge_data --annotation_dir annotation --output_dir merged_data
- In merged_data, you should find:
JSONL files with merged data:
train.jsonl
(67 records)valid.jsonl
(20 records)test.jsonl
(120 records - combines all test files)
Each record contains dialogue, clinical note, and associated ICD10 codes.
If you find this data useful or if you use this for research and development, please cite
@inproceedings{toward-reliable-clinical-coding-verification-adaptation,
title = "Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation",
author = "Yuan, Zhangdie and
Shing, Han-Chin and
Strong, Mitch and
Shivade, Chaitanya",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track",
publisher = "Association for Computational Linguistics",
}
This library is licensed under the CC-BY-NC-4.0 License.
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: CC-BY-NC-4.0