Skip to content

Commit c7ebb57

Browse files
committed
Expand on what this library can be used for
1 parent c5cc532 commit c7ebb57

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

README.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,14 @@ or ``.get_text()`` from Beautiful Soup?
2626
Text extracted with ``html_text`` does not contain inline styles,
2727
javascript, comments and other text that is not normally visible to the users.
2828

29+
Apart from just getting text from the page (e.g. for display or search),
30+
one intended usage of this library is for machine learning (feature extraction).
31+
If you want to use the text of the html page as a feature (e.g. for classification),
32+
this library gives you plain text that you can later feed into a standard text
33+
classification pipeline.
34+
If you feel that you need html structure as well, check out
35+
`webstruct <http://webstruct.readthedocs.io/en/latest/>`_ library.
36+
2937

3038
Install
3139
-------

0 commit comments

Comments
 (0)