[ENH] Add TextFeatures transformer for text feature extraction by ankitlade12 · Pull Request #880 · feature-engine/feature_engine

ankitlade12 · 2026-01-08T15:52:24Z

Add TextFeatures class to extract features from text columns
Support for features: char_count, word_count, digit_count, uppercase_count, etc.
Add comprehensive tests with pytest parametrize
Add user guide documentation

solegalli

Thanks a lot!

This transformer, function-wise, I'd say it's ready. I made a few suggestions regarding how to optimize the feature creation functions. Let me know if they make sense.

Other than that, we need the various docs file and we'll be good to go :)

Thanks again!

feature_engine/text/text_features.py

solegalli · 2026-01-11T15:23:37Z

feature_engine/text/text_features.py

+        """
+        This transformer does not learn parameters.
+
+        Stores feature names and validates that the specified variables are


Please remove this sentence

feature_engine/text/text_features.py

tests/test_text/test_text_features.py

solegalli · 2026-01-26T03:39:08Z

We need to rebase main so the 2 remaining tests pass.

docs/user_guide/text/TextFeatures.rst

docs/user_guide/text/index.rst

docs/user_guide/text/TextFeatures.rst

feature_engine/text/text_features.py

solegalli · 2026-01-26T04:06:12Z

feature_engine/text/text_features.py

+    Methods
+    -------
+    fit:
+        This transformer does not learn parameters. It stores the feature names


Suggested change

This transformer does not learn parameters. It stores the feature names

This transformer does not learn parameters.

solegalli · 2026-01-26T04:06:24Z

feature_engine/text/text_features.py

+    -------
+    fit:
+        This transformer does not learn parameters. It stores the feature names
+        and validates input.


Suggested change

and validates input.

feature_engine/text/text_features.py

solegalli

Hi @ankitlade12

I am very sorry for the delayed review. I am travelling till end of April, so I am a bit slower than usual.

I think, for the first version of the transformer, let's enforce the user to pass the names of the text variables. They can pass one or more variables in case there are more than one text column.

Other than that, we need to add the tranformer in the docs/index file, in the readme, and in the docs/api, and adjust the tests and the demo to the newer functionality. Then it is good to merge.

Thank you very much for this great addition.

tests/test_text/test_text_features.py

- Test string variable auto-conversion to list - Test invalid features type error - Test multiple text columns - Test transform on new data after fit - Test punctuation, ratio, avg_word_length, and lowercase features

solegalli

Hi @ankitlade12

Thank you very much for adding the doc files back to the PR.

Would you mind going over the comments again? Many are marked as resolved, but haven't been resolved.

We are getting closer to to the final version :)

Thanks a lot!

docs/user_guide/text/index.rst

docs/user_guide/text/TextFeatures.rst

solegalli · 2026-02-13T12:35:37Z

docs/user_guide/text/TextFeatures.rst

+- **starts_with_uppercase**: Binary indicator if text starts with uppercase
+- **ends_with_punctuation**: Binary indicator if text ends with .!?
+- **unique_word_count**: Number of unique words (case-insensitive)
+- **unique_word_ratio**: Ratio of unique words to total words


Suggested change

- **unique_word_ratio**: Ratio of unique words to total words

- **lexical_divesity**: Ratio of unique words to total words

solegalli · 2026-02-13T12:36:55Z

docs/user_guide/text/TextFeatures.rst

+(e.g., 'Dr.', 'U.S.', 'e.g.', 'i.e.') or text without punctuation. These abbreviations
+will be counted as sentence endings, resulting in an overestimate of the actual sentence count.
+
+The features **number of unique words** and **unique word ratio** are intended to capture the complexity of the text. Simpler texts have few unique words and tend to repeat them. More complex texts use a wider array of words and tend not to repeat them. Hence, in more complex texts, both the number of unique words and the unique word ratio are greater.


Suggested change

The features **number of unique words** and **unique word ratio** are intended to capture the complexity of the text. Simpler texts have few unique words and tend to repeat them. More complex texts use a wider array of words and tend not to repeat them. Hence, in more complex texts, both the number of unique words and the unique word ratio are greater.

The features **number of unique words** and **lexical diversity** are intended to capture the complexity of the text. Simpler texts have few unique words and tend to repeat them. More complex texts use a wider array of words and tend not to repeat them. Hence, in more complex texts, both the number of unique words and the lexical diversity are greater.

solegalli · 2026-02-13T12:38:48Z

docs/user_guide/text/TextFeatures.rst

+By default, :class:`TextFeatures()` raises an error if the variables contain missing values.
+This behavior can be changed by setting the parameter ``missing_values`` to ``'ignore'``.
+In this case, missing values will be treated as empty strings, and the numerical features
+will be calculated accordingly (e.g., word count and character count will be 0).


Suggested change

will be calculated accordingly (e.g., word count and character count will be 0).

will be calculated accordingly (e.g., word count and character count will be 0) as shown in the following example:

solegalli · 2026-02-13T12:39:52Z

docs/user_guide/text/TextFeatures.rst

+
+    print(X_transformed)
+
+Output:


Suggested change

Output:

In the resulting dataframe, we see that the row with NaN returned 0 in the character count:

solegalli · 2026-02-13T12:40:52Z

docs/user_guide/text/TextFeatures.rst

+        ]
+    })
+
+Now let's extract 5 specific text features, the number of words, the number of characters, the number of sentences, whether the text has digits, and the ratio of upper- to lowercase:


Suggested change

Now let's extract 5 specific text features, the number of words, the number of characters, the number of sentences, whether the text has digits, and the ratio of upper- to lowercase:

Now let's extract 5 specific text features: the number of words, the number of characters, the number of sentences, whether the text has digits, and the ratio of upper- to lowercase:

solegalli · 2026-02-13T12:42:14Z

docs/user_guide/text/TextFeatures.rst

+            'Awful',
+        ]
+    })
+


It would be great to show the input dataframe here.

solegalli · 2026-02-13T12:42:56Z

docs/user_guide/text/TextFeatures.rst

+
+    print(X_transformed)
+
+Output:


Suggested change

Output:

In the following output, we see the resulting dataframe containing the numerical features extracted from the pieces of text:

solegalli · 2026-02-13T12:44:22Z

docs/user_guide/text/TextFeatures.rst

+    1             Not great. Would not recommend.   Disappointed                  5                 31                      2                  0                0.032258
+    2       OK for the price. 3 out of 5 stars.        Average                  8                 35                      2                  3                0.057143
+    3                     TERRIBLE!!! DO NOT BUY!          Awful                  4                 23                      2                  0                0.608696
+


This dataframe doesn't look nice. I think it needs the 4-space indentation to display correctly in the docs. Also, if it does not fit in the doc, I tend to copy the last columns underneath, instead of having the double row system.

solegalli · 2026-02-13T12:44:41Z

docs/user_guide/text/TextFeatures.rst

+    3                     TERRIBLE!!! DO NOT BUY!          Awful                  4                 23                      2                  0                0.608696
+
+Extracting all features
+-----------------------


This is subheading of Python demo, underline should be tilde.

solegalli · 2026-02-13T12:45:32Z

docs/user_guide/text/TextFeatures.rst

+Feature-engine allows you to quickly extract numerical features from short pieces of text, to complement your predictive models. These features aim to capture a piece of text’s complexity by looking at some statistical parameters of the text, such as the word length and count, the number of words and unique words used, the number of sentences, and so on. :class:`TextFeatures()` extracts many numerical features from text out-of-the-box.
+
+TextFeatures
+============


Suggested change

============

------------

solegalli · 2026-02-13T12:47:24Z

docs/user_guide/text/TextFeatures.rst

+    3                     TERRIBLE!!! DO NOT BUY!          Awful                  4                 23                      2                  0                0.608696
+
+Extracting all features
+-----------------------


Suggested change

-----------------------

~~~~~~~~~~~~~~~~~~~~~~~

solegalli · 2026-02-13T12:48:04Z

docs/user_guide/text/TextFeatures.rst

+    X_transformed = tf.transform(X)
+
+    print(X_transformed.head())
+


could we display the output ?

solegalli · 2026-02-13T12:48:29Z

docs/user_guide/text/TextFeatures.rst

+    tf.fit(X)
+    X_transformed = tf.transform(X)
+
+    print(X_transformed)


could we display the output?

solegalli · 2026-02-13T12:50:17Z

docs/user_guide/text/TextFeatures.rst

+    df = pd.DataFrame({'text': data.data, 'target': data.target})
+    X_train, X_test, y_train, y_test = train_test_split(
+        df[['text']], df['target'], test_size=0.3, random_state=42
+    )


If I may continue to be picky: I would break the code here, to show the resulting input dataframe, and then continue setting the pipeline. It's a pain for us, but super useful for the readers of the docs :)

solegalli · 2026-02-13T12:53:09Z

docs/user_guide/text/TextFeatures.rst

+    combined_pipe.fit(X_train, y_train)
+    print(f"Combined Accuracy: {combined_pipe.score(X_test, y_test):.3f}")
+
+Output:


Suggested change

Output:

Below we see the accuracy of a model trained using only the bag of words, respect to a model trained using both the bag of words and the additional meta data:

solegalli · 2026-02-13T12:55:19Z

feature_engine/text/text_features.py

+            lambda s: len(set(s)) if isinstance(s, list) else 0
+        )
+    ),
+    "unique_word_ratio": lambda x: (


Suggested change

"unique_word_ratio": lambda x: (

"lexical_diversity": lambda x: (

solegalli · 2026-02-13T12:56:04Z

feature_engine/text/text_features.py

+        - 'starts_with_uppercase': Binary indicator if text starts with uppercase
+        - 'ends_with_punctuation': Binary indicator if text ends with .!?
+        - 'unique_word_count': Number of unique words (case-insensitive)
+        - 'unique_word_ratio': Ratio of unique words to total words


Suggested change

- 'unique_word_ratio': Ratio of unique words to total words

- 'lexical_diversity': Ratio of unique words to total words

solegalli · 2026-02-13T12:58:10Z

feature_engine/text/text_features.py

+
+    Parameters
+    ----------
+    variables: list


Suggested change

variables: list

variables: string, list

solegalli · 2026-02-13T13:02:43Z

feature_engine/text/text_features.py

+        X = X[self.feature_names_in_]
+
+        # Extract features for each text variable
+        for var in self.variables_:


I think we can do this without the loop:

X[self.variables_]. = X[self.variables_].fillna("")

It should be more efficient.

solegalli · 2026-02-13T13:05:02Z

tests/test_text/__init__.py

+from feature_engine.text import TextFeatures
+
+
+class TestTextFeatures:


We don't test within a class. Could we refactor?

solegalli

Hey @ankitlade12

Thank you very much for addressing all the comments. We are getting to the final version. It's looking really good!

I made a few suggestions for the user guide.

We changed the name of the last feature to lexical diversity at some point, and then the change was lost, so I am bringing it back.

We also changed the tests from having the class to having single tests. That change was also lost, could you please bring it back?

The init file in the tests needs to be deleted.

I think that's pretty much it! Then we should be good to merge!

Thanks a lot for this great addition!

solegalli mentioned this pull request Jan 11, 2026

Add ArcSinhTransformer, TextFeatures, and GeoDistanceTransformer #875

Closed

solegalli reviewed Jan 11, 2026

View reviewed changes

ankitlade12 requested a review from solegalli January 23, 2026 16:40