-
Notifications
You must be signed in to change notification settings - Fork 872
Description
Have you searched existing issues? 🔎
- I have searched and found no existing issues
Desribe the bug
The names for the topics are not set to the zero-shot topic names when the fit_transform results in only zero-shot topics .
If the result contains zero-shot topics and non-zero-shot topics the zero-shot topics get their name from zeroshot_topic_list. However when there are only zero-shot topics (by using a very low zeroshot_min_similarity for example) all documents get assigned to zero-shot topics, but the names are not taken from the list.
Name for topic when running reproduction code: 0_optimization_prediction_algorithms_optimal
Expected name for topic: Clustering
Looking at the code, when not all documents are assigned to zeroshot topics len(documents) > 0 is true and combine_zeroshot_topics is called where the topic/label mapping is updated. When all documents have been assigned this function is not called and the mapping is not updated.
if len(documents) > 0:
# Cluster reduced embeddings
documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
if self._is_zeroshot() and len(assigned_documents) > 0:
documents, embeddings = self._combine_zeroshot_topics(
documents, embeddings, assigned_documents, assigned_embeddings
)
else:
# All documents matches zero-shot topics
documents = assigned_documents
embeddings = assigned_embeddingsReproduction
from datasets import load_dataset
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
# We select a subsample of 100 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:100]
# We define a number of topics that we know are in the documents
zeroshot_topic_list = ["Clustering"]
# We fit our model using the zero-shot topics
# and we define a minimum similarity. For each document,
# if the similarity does not exceed that value, it will be used
# for clustering instead.
topic_model = BERTopic(
embedding_model="thenlper/gte-small",
min_topic_size=15,
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.000001, # Low value to make sure all documents are assigned to zero-shot topics
representation_model=KeyBERTInspired()
)
topics, _ = topic_model.fit_transform(docs)
print(topic_model.get_topic_info())BERTopic Version
0.17.3