Structured LLM JSON Output → Dynamic DataFrame Columns for Topic & Document Analysis

### Feature request

Add support for LLM-generated structured JSON outputs that can be automatically parsed into a pandas DataFrame, enabling dynamic column creation for topic- and document-level analysis.

Instead of returning only a single topic title or fixed schema, allow users to define or receive arbitrary structured fields generated by an LLM, such as:
- Topic Title
- Topic Summary
- Key Insights
- Positive Feedback
- Negative Feedback
- Risks / Opportunities
- Sentiment Analysis
- Custom LLM-derived attributes

These fields would be returned as JSON and expanded into DataFrame columns, allowing users to dynamically choose which attributes they want per topic or per document.

Example output schema (LLM-generated):
`{
  "topic_id": 3,
  "title": "Customer Support Issues",
  "summary": "Users report delays and inconsistent responses from support teams.",
  "positive_comments": ["Helpful agents when reachable"],
  "negative_comments": ["Long response times", "Ticket closures without resolution"],
  "sentiment": "Mostly Negative"
}
`

### Motivation

BERTopic already supports LLM-based representations, but the current outputs are largely limited to labels, keywords, or static fields. In real-world applications, users often want flexible, semantically rich insights that vary by use case.

Common needs include:

- Choosing only Title + Summary for lightweight reporting
- Adding Positive / Negative feedback for customer review analysis
- Adding Risks / Opportunities for strategic analysis
- Adding Custom LLM analysis fields without modifying core BERTopic code

By allowing structured JSON outputs from LLMs to be automatically expanded into DataFrames, BERTopic would:

- Enable highly customizable topic and document analysis
- Remove the need for manual JSON parsing and post-processing
- Fit naturally into pandas-based workflows
- Scale across different analytical use cases without rigid schemas

This approach keeps BERTopic model-agnostic, extensible, and aligned with its modular architecture while significantly improving downstream usability for LLM-driven insight generation.

### Your contribution

I have already implemented this pattern externally by:

- Using an LLM to return structured JSON per topic
- Parsing the output with the orjson package
- Dynamically expanding the parsed fields into new DataFrame columns

This approach has worked well in practice and enabled:

- Rapid dashboard development
- Flexible cluster-level insight generation
- Easy experimentation with different LLM-derived attributes (e.g. sentiment, positives/negatives, summaries)

I believe this could be generalized into BERTopic as an optional utility or extension, potentially leveraging existing LLM representation models while keeping the core lightweight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Structured LLM JSON Output → Dynamic DataFrame Columns for Topic & Document Analysis #2466

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Structured LLM JSON Output → Dynamic DataFrame Columns for Topic & Document Analysis #2466

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions