Skip to content

Conversation

@SnowMasaya
Copy link

Overview

This notebook that implements a persona-seeded synthetic data generation pipeline for Japanese CommonsenseQA (jcommonsenseqa) using NeMo Data Designer and Nemotron Personas Japan.


What this PR adds

  • A complete notebook for synthetic commonsense QA data generation
  • Integration of Nemotron Personas Japan as structured generation seeds
  • Generation of jcommonsenseqa-style multiple-choice questions with strict format constraints

Key design decisions

  • Clear separation of roles
    • A large model (GPT-OSS-120B) is used exclusively for:
      • Synthetic data generation
      • Quality evaluation (judge)
    • Target models (e.g., Nemotron-Nano-9B-v2) are evaluated separately using llm-jp-eval
  • Persona-aware but non-leaking seeds
    • Persona attributes are used to stabilize generation distribution
    • Seeds are not exposed directly in the generated questions

@kirit93 kirit93 self-requested a review January 12, 2026 00:34
@kirit93 kirit93 merged commit 4aee4e9 into NVIDIA:main Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants