.fit_transform vs .transform On Same Data #2452

carobs9 · 2025-10-31T11:06:10Z

carobs9
Oct 31, 2025

Hello,

I am loading a saved BERTopic model to append the calculated topics into my training dataset. To ensure reproductibility, I have previously saved the document embeddings and the probabilities. When inspecting the model after loading it (topic_model.get_topic_freq()), I observe that the number of documents assigned to each topic is consistent whith what I had calculated before. Yet, when inspecting the topics (np.sum(topics == -1), the number of documents assigned to each topic is different. It seems like the number of outliers gets reduced, and the topic assignment shifts. Despite this shift, the number of topics is the same, and the topic assignments still make sense when looking at the documents. The results are still coherent.

I am making sure that the length of the probabilities, training dataset and embeddings is the same. I want to emphasize that I have made sure that the training dataset and the dataset I am using when calling .transform are exactly the same.

I suspect that when calling .transform a new assignment takes place, even when using the same data as when calling .fit_transform. The new results obtained after .transform look better. Why does this improvement happen? Is it possible to ensure reproductibility when calling .transform?

Thank you in advance.

This is the code I am using:

def load_model(df_of_choice, docs, embeddings, custom_probs_path=None):
    """
    Load a pretrained BERTopic model and apply it to a list of documents.

    Parameters:
    -----------
    df_of_choice : str
        The name of the trained BERTopic model directory.

    docs : list of str
        The documents to apply the model on.

    Returns:
    --------
    topic_model : BERTopic
        The loaded topic model.

    topics : list
        Topic assignments for each document.

    probs : list
        Topic probabilities for each document.
    """

    topic_model = BERTopic.load(cfg.MODELS_PATH / f'{df_of_choice}', embedding_model="sentence-transformers/all-MiniLM-L6-v2")

    print("[INFO] Computing topics and probs with model.transform()")
    topics, probs = topic_model.transform(docs, embeddings=embeddings)

    return topic_model, topics, probs

embeddings = np.load(cfg.MODELS_PATH / "embeddings_float32.npy")
COL_OF_INTEREST = 'NAME'

df = pd.read_parquet(cfg.PROCESSED_DATA_PATH / data.parquet)
df = df.dropna(subset=[COL_OF_INTEREST])
docs = df[COL_OF_INTEREST].tolist()

topic_model, topics, probs = load_model('saved_model', docs, embeddings)

print(topic_model.get_topic_freq()) # returns x -1 assignments
print(np.sum(topics == -1)) # returns y -1 assignments

Answered by MaartenGr

Nov 11, 2025

The functionality of .transform depends on whether you saved the model using pickle or using pytorch/safetensors. In the former, .transform uses the underlying cluster model, which will behave exactly as the cluster model you have chosen. The HDBSCAN model, for instance, does an approximation for instance. In the latter, it uses a different technique to get the predictions, namely by finding the cosine similarities between the document and topic embeddings.

View full answer

MaartenGr · 2025-11-11T12:25:54Z

MaartenGr
Nov 11, 2025
Maintainer

The functionality of .transform depends on whether you saved the model using pickle or using pytorch/safetensors. In the former, .transform uses the underlying cluster model, which will behave exactly as the cluster model you have chosen. The HDBSCAN model, for instance, does an approximation for instance. In the latter, it uses a different technique to get the predictions, namely by finding the cosine similarities between the document and topic embeddings.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

.fit_transform vs .transform On Same Data #2452

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

.fit_transform vs .transform On Same Data #2452

Uh oh!

Uh oh!

carobs9 Oct 31, 2025

Replies: 1 comment

Uh oh!

MaartenGr Nov 11, 2025 Maintainer

carobs9
Oct 31, 2025

MaartenGr
Nov 11, 2025
Maintainer