Replies: 1 comment
-
They are extracted based on their c-TF-IDF similarity to the topic's c-TF-IDF values. Then, they can be further reranked using MMR to select the most appropriate ones (for example to remove duplicates). The code can be found here:
I personally had great success with the one that is already in BERTopic. Selecting the top n and reducing it to k. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi BERTopic Team, I encountered a discrepancy when generating cluster labels using LLMs, depending on the quotes provided as examples. For context, I'm analyzing some user feedback about our app.
Upon reviewing the 300 randomly selected quotes, I noticed that "Returning to the app (after technical issues were resolved)" is indeed one of the most "coherent" sub-theme as it constituted one of the highest %s of quotes. However, that was still only 20% of the quotes. The remaining quotes cover a long tail of topics, including various technical issues and users mentioning taking a break from the app.
My questions:
Are Representative Docs selected to maximize theme "coherence", or do they represent the most frequent theme within the cluster? (read the wiki on Representative Docs but would appreciate a more intuitive understanding)
Any advice on which method to choose, such that I generate label that best reflects the overall distribution of themes within a cluster?
Thanks for your help and for developing such a great tool!
Beta Was this translation helpful? Give feedback.
All reactions