-
|
Hello, I am loading a saved BERTopic model to append the calculated topics into my training dataset. To ensure reproductibility, I have previously saved the document embeddings and the probabilities. When inspecting the model after loading it (topic_model.get_topic_freq()), I observe that the number of documents assigned to each topic is consistent whith what I had calculated before. Yet, when inspecting the topics (np.sum(topics == -1), the number of documents assigned to each topic is different. It seems like the number of outliers gets reduced, and the topic assignment shifts. Despite this shift, the number of topics is the same, and the topic assignments still make sense when looking at the documents. The results are still coherent. I am making sure that the length of the probabilities, training dataset and embeddings is the same. I want to emphasize that I have made sure that the training dataset and the dataset I am using when calling .transform are exactly the same. I suspect that when calling .transform a new assignment takes place, even when using the same data as when calling .fit_transform. The new results obtained after .transform look better. Why does this improvement happen? Is it possible to ensure reproductibility when calling .transform? Thank you in advance. This is the code I am using: def load_model(df_of_choice, docs, embeddings, custom_probs_path=None):
"""
Load a pretrained BERTopic model and apply it to a list of documents.
Parameters:
-----------
df_of_choice : str
The name of the trained BERTopic model directory.
docs : list of str
The documents to apply the model on.
Returns:
--------
topic_model : BERTopic
The loaded topic model.
topics : list
Topic assignments for each document.
probs : list
Topic probabilities for each document.
"""
topic_model = BERTopic.load(cfg.MODELS_PATH / f'{df_of_choice}', embedding_model="sentence-transformers/all-MiniLM-L6-v2")
print("[INFO] Computing topics and probs with model.transform()")
topics, probs = topic_model.transform(docs, embeddings=embeddings)
return topic_model, topics, probs
embeddings = np.load(cfg.MODELS_PATH / "embeddings_float32.npy")
COL_OF_INTEREST = 'NAME'
df = pd.read_parquet(cfg.PROCESSED_DATA_PATH / data.parquet)
df = df.dropna(subset=[COL_OF_INTEREST])
docs = df[COL_OF_INTEREST].tolist()
topic_model, topics, probs = load_model('saved_model', docs, embeddings)
print(topic_model.get_topic_freq()) # returns x -1 assignments
print(np.sum(topics == -1)) # returns y -1 assignments |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
The functionality of |
Beta Was this translation helpful? Give feedback.
The functionality of
.transformdepends on whether you saved the model usingpickleor usingpytorch/safetensors. In the former,.transformuses the underlying cluster model, which will behave exactly as the cluster model you have chosen. The HDBSCAN model, for instance, does an approximation for instance. In the latter, it uses a different technique to get the predictions, namely by finding the cosine similarities between the document and topic embeddings.