Skip to content

Structured LLM JSON Output → Dynamic DataFrame Columns for Topic & Document Analysis #2466

@kloniphani

Description

@kloniphani

Feature request

Add support for LLM-generated structured JSON outputs that can be automatically parsed into a pandas DataFrame, enabling dynamic column creation for topic- and document-level analysis.

Instead of returning only a single topic title or fixed schema, allow users to define or receive arbitrary structured fields generated by an LLM, such as:

  • Topic Title
  • Topic Summary
  • Key Insights
  • Positive Feedback
  • Negative Feedback
  • Risks / Opportunities
  • Sentiment Analysis
  • Custom LLM-derived attributes

These fields would be returned as JSON and expanded into DataFrame columns, allowing users to dynamically choose which attributes they want per topic or per document.

Example output schema (LLM-generated):
{ "topic_id": 3, "title": "Customer Support Issues", "summary": "Users report delays and inconsistent responses from support teams.", "positive_comments": ["Helpful agents when reachable"], "negative_comments": ["Long response times", "Ticket closures without resolution"], "sentiment": "Mostly Negative" }

Motivation

BERTopic already supports LLM-based representations, but the current outputs are largely limited to labels, keywords, or static fields. In real-world applications, users often want flexible, semantically rich insights that vary by use case.

Common needs include:

  • Choosing only Title + Summary for lightweight reporting
  • Adding Positive / Negative feedback for customer review analysis
  • Adding Risks / Opportunities for strategic analysis
  • Adding Custom LLM analysis fields without modifying core BERTopic code

By allowing structured JSON outputs from LLMs to be automatically expanded into DataFrames, BERTopic would:

  • Enable highly customizable topic and document analysis
  • Remove the need for manual JSON parsing and post-processing
  • Fit naturally into pandas-based workflows
  • Scale across different analytical use cases without rigid schemas

This approach keeps BERTopic model-agnostic, extensible, and aligned with its modular architecture while significantly improving downstream usability for LLM-driven insight generation.

Your contribution

I have already implemented this pattern externally by:

  • Using an LLM to return structured JSON per topic
  • Parsing the output with the orjson package
  • Dynamically expanding the parsed fields into new DataFrame columns

This approach has worked well in practice and enabled:

  • Rapid dashboard development
  • Flexible cluster-level insight generation
  • Easy experimentation with different LLM-derived attributes (e.g. sentiment, positives/negatives, summaries)

I believe this could be generalized into BERTopic as an optional utility or extension, potentially leveraging existing LLM representation models while keeping the core lightweight.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions