[ENH] Add TextFeatures transformer for text feature extraction#880
[ENH] Add TextFeatures transformer for text feature extraction#880ankitlade12 wants to merge 8 commits intofeature-engine:mainfrom
Conversation
ankitlade12
commented
Jan 8, 2026
- Add TextFeatures class to extract features from text columns
- Support for features: char_count, word_count, digit_count, uppercase_count, etc.
- Add comprehensive tests with pytest parametrize
- Add user guide documentation
solegalli
left a comment
There was a problem hiding this comment.
Hi @ankitlade12
Thanks a lot!
This transformer, function-wise, I'd say it's ready. I made a few suggestions regarding how to optimize the feature creation functions. Let me know if they make sense.
Other than that, we need the various docs file and we'll be good to go :)
Thanks again!
| """ | ||
| This transformer does not learn parameters. | ||
|
|
||
| Stores feature names and validates that the specified variables are |
There was a problem hiding this comment.
Please remove this sentence
|
We need to rebase main so the 2 remaining tests pass. |
| Methods | ||
| ------- | ||
| fit: | ||
| This transformer does not learn parameters. It stores the feature names |
There was a problem hiding this comment.
| This transformer does not learn parameters. It stores the feature names | |
| This transformer does not learn parameters. |
| ------- | ||
| fit: | ||
| This transformer does not learn parameters. It stores the feature names | ||
| and validates input. |
There was a problem hiding this comment.
| and validates input. |
solegalli
left a comment
There was a problem hiding this comment.
Hi @ankitlade12
I am very sorry for the delayed review. I am travelling till end of April, so I am a bit slower than usual.
I think, for the first version of the transformer, let's enforce the user to pass the names of the text variables. They can pass one or more variables in case there are more than one text column.
Other than that, we need to add the tranformer in the docs/index file, in the readme, and in the docs/api, and adjust the tests and the demo to the newer functionality. Then it is good to merge.
Thank you very much for this great addition.
- Test string variable auto-conversion to list - Test invalid features type error - Test multiple text columns - Test transform on new data after fit - Test punctuation, ratio, avg_word_length, and lowercase features
877550f to
0b8a183
Compare
solegalli
left a comment
There was a problem hiding this comment.
Hi @ankitlade12
Thank you very much for adding the doc files back to the PR.
Would you mind going over the comments again? Many are marked as resolved, but haven't been resolved.
We are getting closer to to the final version :)
Thanks a lot!
| - **starts_with_uppercase**: Binary indicator if text starts with uppercase | ||
| - **ends_with_punctuation**: Binary indicator if text ends with .!? | ||
| - **unique_word_count**: Number of unique words (case-insensitive) | ||
| - **unique_word_ratio**: Ratio of unique words to total words |
There was a problem hiding this comment.
| - **unique_word_ratio**: Ratio of unique words to total words | |
| - **lexical_divesity**: Ratio of unique words to total words |
| (e.g., 'Dr.', 'U.S.', 'e.g.', 'i.e.') or text without punctuation. These abbreviations | ||
| will be counted as sentence endings, resulting in an overestimate of the actual sentence count. | ||
|
|
||
| The features **number of unique words** and **unique word ratio** are intended to capture the complexity of the text. Simpler texts have few unique words and tend to repeat them. More complex texts use a wider array of words and tend not to repeat them. Hence, in more complex texts, both the number of unique words and the unique word ratio are greater. |
There was a problem hiding this comment.
| The features **number of unique words** and **unique word ratio** are intended to capture the complexity of the text. Simpler texts have few unique words and tend to repeat them. More complex texts use a wider array of words and tend not to repeat them. Hence, in more complex texts, both the number of unique words and the unique word ratio are greater. | |
| The features **number of unique words** and **lexical diversity** are intended to capture the complexity of the text. Simpler texts have few unique words and tend to repeat them. More complex texts use a wider array of words and tend not to repeat them. Hence, in more complex texts, both the number of unique words and the lexical diversity are greater. |
| By default, :class:`TextFeatures()` raises an error if the variables contain missing values. | ||
| This behavior can be changed by setting the parameter ``missing_values`` to ``'ignore'``. | ||
| In this case, missing values will be treated as empty strings, and the numerical features | ||
| will be calculated accordingly (e.g., word count and character count will be 0). |
There was a problem hiding this comment.
| will be calculated accordingly (e.g., word count and character count will be 0). | |
| will be calculated accordingly (e.g., word count and character count will be 0) as shown in the following example: |
|
|
||
| print(X_transformed) | ||
|
|
||
| Output: |
There was a problem hiding this comment.
| Output: | |
| In the resulting dataframe, we see that the row with NaN returned 0 in the character count: |
| ] | ||
| }) | ||
|
|
||
| Now let's extract 5 specific text features, the number of words, the number of characters, the number of sentences, whether the text has digits, and the ratio of upper- to lowercase: |
There was a problem hiding this comment.
| Now let's extract 5 specific text features, the number of words, the number of characters, the number of sentences, whether the text has digits, and the ratio of upper- to lowercase: | |
| Now let's extract 5 specific text features: the number of words, the number of characters, the number of sentences, whether the text has digits, and the ratio of upper- to lowercase: |
| 'Awful', | ||
| ] | ||
| }) | ||
|
|
There was a problem hiding this comment.
It would be great to show the input dataframe here.
|
|
||
| print(X_transformed) | ||
|
|
||
| Output: |
There was a problem hiding this comment.
| Output: | |
| In the following output, we see the resulting dataframe containing the numerical features extracted from the pieces of text: |
| 1 Not great. Would not recommend. Disappointed 5 31 2 0 0.032258 | ||
| 2 OK for the price. 3 out of 5 stars. Average 8 35 2 3 0.057143 | ||
| 3 TERRIBLE!!! DO NOT BUY! Awful 4 23 2 0 0.608696 | ||
|
|
There was a problem hiding this comment.
This dataframe doesn't look nice. I think it needs the 4-space indentation to display correctly in the docs. Also, if it does not fit in the doc, I tend to copy the last columns underneath, instead of having the double row system.
| 3 TERRIBLE!!! DO NOT BUY! Awful 4 23 2 0 0.608696 | ||
|
|
||
| Extracting all features | ||
| ----------------------- |
There was a problem hiding this comment.
This is subheading of Python demo, underline should be tilde.
| Feature-engine allows you to quickly extract numerical features from short pieces of text, to complement your predictive models. These features aim to capture a piece of text’s complexity by looking at some statistical parameters of the text, such as the word length and count, the number of words and unique words used, the number of sentences, and so on. :class:`TextFeatures()` extracts many numerical features from text out-of-the-box. | ||
|
|
||
| TextFeatures | ||
| ============ |
There was a problem hiding this comment.
| ============ | |
| ------------ |
| 3 TERRIBLE!!! DO NOT BUY! Awful 4 23 2 0 0.608696 | ||
|
|
||
| Extracting all features | ||
| ----------------------- |
There was a problem hiding this comment.
| ----------------------- | |
| ~~~~~~~~~~~~~~~~~~~~~~~ |
| X_transformed = tf.transform(X) | ||
|
|
||
| print(X_transformed.head()) | ||
|
|
There was a problem hiding this comment.
could we display the output ?
| tf.fit(X) | ||
| X_transformed = tf.transform(X) | ||
|
|
||
| print(X_transformed) |
There was a problem hiding this comment.
could we display the output?
| df = pd.DataFrame({'text': data.data, 'target': data.target}) | ||
| X_train, X_test, y_train, y_test = train_test_split( | ||
| df[['text']], df['target'], test_size=0.3, random_state=42 | ||
| ) |
There was a problem hiding this comment.
If I may continue to be picky: I would break the code here, to show the resulting input dataframe, and then continue setting the pipeline. It's a pain for us, but super useful for the readers of the docs :)
| combined_pipe.fit(X_train, y_train) | ||
| print(f"Combined Accuracy: {combined_pipe.score(X_test, y_test):.3f}") | ||
|
|
||
| Output: |
There was a problem hiding this comment.
| Output: | |
| Below we see the accuracy of a model trained using only the bag of words, respect to a model trained using both the bag of words and the additional meta data: |
| lambda s: len(set(s)) if isinstance(s, list) else 0 | ||
| ) | ||
| ), | ||
| "unique_word_ratio": lambda x: ( |
There was a problem hiding this comment.
| "unique_word_ratio": lambda x: ( | |
| "lexical_diversity": lambda x: ( |
| - 'starts_with_uppercase': Binary indicator if text starts with uppercase | ||
| - 'ends_with_punctuation': Binary indicator if text ends with .!? | ||
| - 'unique_word_count': Number of unique words (case-insensitive) | ||
| - 'unique_word_ratio': Ratio of unique words to total words |
There was a problem hiding this comment.
| - 'unique_word_ratio': Ratio of unique words to total words | |
| - 'lexical_diversity': Ratio of unique words to total words |
|
|
||
| Parameters | ||
| ---------- | ||
| variables: list |
There was a problem hiding this comment.
| variables: list | |
| variables: string, list |
| X = X[self.feature_names_in_] | ||
|
|
||
| # Extract features for each text variable | ||
| for var in self.variables_: |
There was a problem hiding this comment.
I think we can do this without the loop:
X[self.variables_]. = X[self.variables_].fillna("")
It should be more efficient.
| from feature_engine.text import TextFeatures | ||
|
|
||
|
|
||
| class TestTextFeatures: |
There was a problem hiding this comment.
We don't test within a class. Could we refactor?
solegalli
left a comment
There was a problem hiding this comment.
Hey @ankitlade12
Thank you very much for addressing all the comments. We are getting to the final version. It's looking really good!
I made a few suggestions for the user guide.
We changed the name of the last feature to lexical diversity at some point, and then the change was lost, so I am bringing it back.
We also changed the tests from having the class to having single tests. That change was also lost, could you please bring it back?
The init file in the tests needs to be deleted.
I think that's pretty much it! Then we should be good to merge!
Thanks a lot for this great addition!