-
Notifications
You must be signed in to change notification settings - Fork 872
Description
Feature request
Add support for LLM-generated structured JSON outputs that can be automatically parsed into a pandas DataFrame, enabling dynamic column creation for topic- and document-level analysis.
Instead of returning only a single topic title or fixed schema, allow users to define or receive arbitrary structured fields generated by an LLM, such as:
- Topic Title
- Topic Summary
- Key Insights
- Positive Feedback
- Negative Feedback
- Risks / Opportunities
- Sentiment Analysis
- Custom LLM-derived attributes
These fields would be returned as JSON and expanded into DataFrame columns, allowing users to dynamically choose which attributes they want per topic or per document.
Example output schema (LLM-generated):
{ "topic_id": 3, "title": "Customer Support Issues", "summary": "Users report delays and inconsistent responses from support teams.", "positive_comments": ["Helpful agents when reachable"], "negative_comments": ["Long response times", "Ticket closures without resolution"], "sentiment": "Mostly Negative" }
Motivation
BERTopic already supports LLM-based representations, but the current outputs are largely limited to labels, keywords, or static fields. In real-world applications, users often want flexible, semantically rich insights that vary by use case.
Common needs include:
- Choosing only Title + Summary for lightweight reporting
- Adding Positive / Negative feedback for customer review analysis
- Adding Risks / Opportunities for strategic analysis
- Adding Custom LLM analysis fields without modifying core BERTopic code
By allowing structured JSON outputs from LLMs to be automatically expanded into DataFrames, BERTopic would:
- Enable highly customizable topic and document analysis
- Remove the need for manual JSON parsing and post-processing
- Fit naturally into pandas-based workflows
- Scale across different analytical use cases without rigid schemas
This approach keeps BERTopic model-agnostic, extensible, and aligned with its modular architecture while significantly improving downstream usability for LLM-driven insight generation.
Your contribution
I have already implemented this pattern externally by:
- Using an LLM to return structured JSON per topic
- Parsing the output with the orjson package
- Dynamically expanding the parsed fields into new DataFrame columns
This approach has worked well in practice and enabled:
- Rapid dashboard development
- Flexible cluster-level insight generation
- Easy experimentation with different LLM-derived attributes (e.g. sentiment, positives/negatives, summaries)
I believe this could be generalized into BERTopic as an optional utility or extension, potentially leveraging existing LLM representation models while keeping the core lightweight.