Skip to content

[ENH] Add TextFeatures transformer for text feature extraction#880

Open
ankitlade12 wants to merge 8 commits intofeature-engine:mainfrom
ankitlade12:add-text-features
Open

[ENH] Add TextFeatures transformer for text feature extraction#880
ankitlade12 wants to merge 8 commits intofeature-engine:mainfrom
ankitlade12:add-text-features

Conversation

@ankitlade12
Copy link
Contributor

  • Add TextFeatures class to extract features from text columns
  • Support for features: char_count, word_count, digit_count, uppercase_count, etc.
  • Add comprehensive tests with pytest parametrize
  • Add user guide documentation

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankitlade12

Thanks a lot!

This transformer, function-wise, I'd say it's ready. I made a few suggestions regarding how to optimize the feature creation functions. Let me know if they make sense.

Other than that, we need the various docs file and we'll be good to go :)

Thanks again!

"""
This transformer does not learn parameters.

Stores feature names and validates that the specified variables are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this sentence

@ankitlade12 ankitlade12 requested a review from solegalli January 23, 2026 16:40
@solegalli
Copy link
Collaborator

We need to rebase main so the 2 remaining tests pass.

Methods
-------
fit:
This transformer does not learn parameters. It stores the feature names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This transformer does not learn parameters. It stores the feature names
This transformer does not learn parameters.

-------
fit:
This transformer does not learn parameters. It stores the feature names
and validates input.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and validates input.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankitlade12

I am very sorry for the delayed review. I am travelling till end of April, so I am a bit slower than usual.

I think, for the first version of the transformer, let's enforce the user to pass the names of the text variables. They can pass one or more variables in case there are more than one text column.

Other than that, we need to add the tranformer in the docs/index file, in the readme, and in the docs/api, and adjust the tests and the demo to the newer functionality. Then it is good to merge.

Thank you very much for this great addition.

- Test string variable auto-conversion to list
- Test invalid features type error
- Test multiple text columns
- Test transform on new data after fit
- Test punctuation, ratio, avg_word_length, and lowercase features
Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankitlade12

Thank you very much for adding the doc files back to the PR.

Would you mind going over the comments again? Many are marked as resolved, but haven't been resolved.

We are getting closer to to the final version :)

Thanks a lot!

- **starts_with_uppercase**: Binary indicator if text starts with uppercase
- **ends_with_punctuation**: Binary indicator if text ends with .!?
- **unique_word_count**: Number of unique words (case-insensitive)
- **unique_word_ratio**: Ratio of unique words to total words
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **unique_word_ratio**: Ratio of unique words to total words
- **lexical_divesity**: Ratio of unique words to total words

(e.g., 'Dr.', 'U.S.', 'e.g.', 'i.e.') or text without punctuation. These abbreviations
will be counted as sentence endings, resulting in an overestimate of the actual sentence count.

The features **number of unique words** and **unique word ratio** are intended to capture the complexity of the text. Simpler texts have few unique words and tend to repeat them. More complex texts use a wider array of words and tend not to repeat them. Hence, in more complex texts, both the number of unique words and the unique word ratio are greater.
Copy link
Collaborator

@solegalli solegalli Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The features **number of unique words** and **unique word ratio** are intended to capture the complexity of the text. Simpler texts have few unique words and tend to repeat them. More complex texts use a wider array of words and tend not to repeat them. Hence, in more complex texts, both the number of unique words and the unique word ratio are greater.
The features **number of unique words** and **lexical diversity** are intended to capture the complexity of the text. Simpler texts have few unique words and tend to repeat them. More complex texts use a wider array of words and tend not to repeat them. Hence, in more complex texts, both the number of unique words and the lexical diversity are greater.

By default, :class:`TextFeatures()` raises an error if the variables contain missing values.
This behavior can be changed by setting the parameter ``missing_values`` to ``'ignore'``.
In this case, missing values will be treated as empty strings, and the numerical features
will be calculated accordingly (e.g., word count and character count will be 0).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
will be calculated accordingly (e.g., word count and character count will be 0).
will be calculated accordingly (e.g., word count and character count will be 0) as shown in the following example:


print(X_transformed)

Output:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Output:
In the resulting dataframe, we see that the row with NaN returned 0 in the character count:

]
})

Now let's extract 5 specific text features, the number of words, the number of characters, the number of sentences, whether the text has digits, and the ratio of upper- to lowercase:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Now let's extract 5 specific text features, the number of words, the number of characters, the number of sentences, whether the text has digits, and the ratio of upper- to lowercase:
Now let's extract 5 specific text features: the number of words, the number of characters, the number of sentences, whether the text has digits, and the ratio of upper- to lowercase:

'Awful',
]
})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to show the input dataframe here.


print(X_transformed)

Output:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Output:
In the following output, we see the resulting dataframe containing the numerical features extracted from the pieces of text:

1 Not great. Would not recommend. Disappointed 5 31 2 0 0.032258
2 OK for the price. 3 out of 5 stars. Average 8 35 2 3 0.057143
3 TERRIBLE!!! DO NOT BUY! Awful 4 23 2 0 0.608696

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dataframe doesn't look nice. I think it needs the 4-space indentation to display correctly in the docs. Also, if it does not fit in the doc, I tend to copy the last columns underneath, instead of having the double row system.

3 TERRIBLE!!! DO NOT BUY! Awful 4 23 2 0 0.608696

Extracting all features
-----------------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subheading of Python demo, underline should be tilde.

Feature-engine allows you to quickly extract numerical features from short pieces of text, to complement your predictive models. These features aim to capture a piece of text’s complexity by looking at some statistical parameters of the text, such as the word length and count, the number of words and unique words used, the number of sentences, and so on. :class:`TextFeatures()` extracts many numerical features from text out-of-the-box.

TextFeatures
============
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
============
------------

3 TERRIBLE!!! DO NOT BUY! Awful 4 23 2 0 0.608696

Extracting all features
-----------------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-----------------------
~~~~~~~~~~~~~~~~~~~~~~~

X_transformed = tf.transform(X)

print(X_transformed.head())

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we display the output ?

tf.fit(X)
X_transformed = tf.transform(X)

print(X_transformed)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we display the output?

df = pd.DataFrame({'text': data.data, 'target': data.target})
X_train, X_test, y_train, y_test = train_test_split(
df[['text']], df['target'], test_size=0.3, random_state=42
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I may continue to be picky: I would break the code here, to show the resulting input dataframe, and then continue setting the pipeline. It's a pain for us, but super useful for the readers of the docs :)

combined_pipe.fit(X_train, y_train)
print(f"Combined Accuracy: {combined_pipe.score(X_test, y_test):.3f}")

Output:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Output:
Below we see the accuracy of a model trained using only the bag of words, respect to a model trained using both the bag of words and the additional meta data:

lambda s: len(set(s)) if isinstance(s, list) else 0
)
),
"unique_word_ratio": lambda x: (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"unique_word_ratio": lambda x: (
"lexical_diversity": lambda x: (

- 'starts_with_uppercase': Binary indicator if text starts with uppercase
- 'ends_with_punctuation': Binary indicator if text ends with .!?
- 'unique_word_count': Number of unique words (case-insensitive)
- 'unique_word_ratio': Ratio of unique words to total words
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 'unique_word_ratio': Ratio of unique words to total words
- 'lexical_diversity': Ratio of unique words to total words


Parameters
----------
variables: list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variables: list
variables: string, list

X = X[self.feature_names_in_]

# Extract features for each text variable
for var in self.variables_:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do this without the loop:

X[self.variables_]. = X[self.variables_].fillna("")

It should be more efficient.

from feature_engine.text import TextFeatures


class TestTextFeatures:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't test within a class. Could we refactor?

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ankitlade12

Thank you very much for addressing all the comments. We are getting to the final version. It's looking really good!

I made a few suggestions for the user guide.

We changed the name of the last feature to lexical diversity at some point, and then the change was lost, so I am bringing it back.

We also changed the tests from having the class to having single tests. That change was also lost, could you please bring it back?

The init file in the tests needs to be deleted.

I think that's pretty much it! Then we should be good to merge!

Thanks a lot for this great addition!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants