Skip to content

Commit 208c7ed

Browse files
rfctr(csv): minify HTML and table text is cct (#3733)
**Summary** Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_csv()`. Produce minified `.text_as_html` consistent with that formed by chunking. **Additional Context** - CSV `.metadata.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements). - `table.text` is clean-concatenated-text (CCT) of table. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: scanny <[email protected]>
1 parent c85f29e commit 208c7ed

File tree

38 files changed

+905
-926
lines changed

38 files changed

+905
-926
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.16.1-dev2
1+
## 0.16.1-dev3
22

33
### Enhancements
44

@@ -11,6 +11,7 @@
1111
* **Minify text_as_html from DOCX.** Previously `.metadata.text_as_html` for DOCX tables was "bloated" with whitespace and noise elements introduced by `tabulate` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
1212
* **Fall back to filename extension-based file-type detection for unidentified OLE files.** Resolves a problem where a DOC file that could not be detected as such by `filetype` was incorrectly identified as a MSG file.
1313
* **Minify text_as_html from XLSX.** Previously `.metadata.text_as_html` for DOCX tables was "bloated" with whitespace and noise elements introduced by `pandas` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
14+
* **Minify text_as_html from CSV.** Previously `.metadata.text_as_html` for CSV tables was "bloated" with whitespace and noise elements introduced by `pandas` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
1415

1516
## 0.16.0
1617

test_unstructured/partition/test_constants.py

Lines changed: 29 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,32 @@
1-
EXPECTED_TABLE = """<table border="1" class="dataframe">
2-
<tbody>
3-
<tr>
4-
<td>Stanley Cups</td>
5-
<td></td>
6-
<td></td>
7-
</tr>
8-
<tr>
9-
<td>Team</td>
10-
<td>Location</td>
11-
<td>Stanley Cups</td>
12-
</tr>
13-
<tr>
14-
<td>Blues</td>
15-
<td>STL</td>
16-
<td>1</td>
17-
</tr>
18-
<tr>
19-
<td>Flyers</td>
20-
<td>PHI</td>
21-
<td>2</td>
22-
</tr>
23-
<tr>
24-
<td>Maple Leafs</td>
25-
<td>TOR</td>
26-
<td>13</td>
27-
</tr>
28-
</tbody>
29-
</table>"""
1+
EXPECTED_TABLE = (
2+
"<table>"
3+
"<tr><td>Stanley Cups</td><td/><td/></tr>"
4+
"<tr><td>Team</td><td>Location</td><td>Stanley Cups</td></tr>"
5+
"<tr><td>Blues</td><td>STL</td><td>1</td></tr>"
6+
"<tr><td>Flyers</td><td>PHI</td><td>2</td></tr>"
7+
"<tr><td>Maple Leafs</td><td>TOR</td><td>13</td></tr>"
8+
"</table>"
9+
)
10+
11+
EXPECTED_TABLE_SEMICOLON_DELIMITER = (
12+
"<table>"
13+
"<tr><td>Year</td><td>Month</td><td>Revenue</td><td>Costs</td><td/></tr>"
14+
"<tr><td>2022</td><td>1</td><td>123</td><td>-123</td><td/></tr>"
15+
"<tr><td>2023</td><td>2</td><td>143,1</td><td>-814,38</td><td/></tr>"
16+
"<tr><td>2024</td><td>3</td><td>215,32</td><td>-11,08</td><td/></tr>"
17+
"</table>"
18+
)
19+
20+
EXPECTED_TABLE_WITH_EMOJI = (
21+
"<table>"
22+
"<tr><td>Stanley Cups</td><td/><td/></tr>"
23+
"<tr><td>Team</td><td>Location</td><td>Stanley Cups</td></tr>"
24+
"<tr><td>Blues</td><td>STL</td><td>1</td></tr>"
25+
"<tr><td>Flyers</td><td>PHI</td><td>2</td></tr>"
26+
"<tr><td>Maple Leafs</td><td>TOR</td><td>13</td></tr>"
27+
"<tr><td>👨\\U+1F3FB🔧</td><td>TOR</td><td>15</td></tr>"
28+
"</table>"
29+
)
3030

3131
EXPECTED_TABLE_XLSX = (
3232
"<table>"
@@ -54,74 +54,6 @@
5454
"Year Month Revenue Costs 2022 1 123 -123 2023 2 143,1 -814,38 2024 3 215,32 -11,08"
5555
)
5656

57-
EXPECTED_TABLE_SEMICOLON_DELIMITER = """<table border="1" class="dataframe">
58-
<tbody>
59-
<tr>
60-
<td>Year</td>
61-
<td>Month</td>
62-
<td>Revenue</td>
63-
<td>Costs</td>
64-
<td></td>
65-
</tr>
66-
<tr>
67-
<td>2022</td>
68-
<td>1</td>
69-
<td>123</td>
70-
<td>-123</td>
71-
<td></td>
72-
</tr>
73-
<tr>
74-
<td>2023</td>
75-
<td>2</td>
76-
<td>143,1</td>
77-
<td>-814,38</td>
78-
<td></td>
79-
</tr>
80-
<tr>
81-
<td>2024</td>
82-
<td>3</td>
83-
<td>215,32</td>
84-
<td>-11,08</td>
85-
<td></td>
86-
</tr>
87-
</tbody>
88-
</table>"""
89-
90-
EXPECTED_TABLE_WITH_EMOJI = """<table border="1" class="dataframe">
91-
<tbody>
92-
<tr>
93-
<td>Stanley Cups</td>
94-
<td></td>
95-
<td></td>
96-
</tr>
97-
<tr>
98-
<td>Team</td>
99-
<td>Location</td>
100-
<td>Stanley Cups</td>
101-
</tr>
102-
<tr>
103-
<td>Blues</td>
104-
<td>STL</td>
105-
<td>1</td>
106-
</tr>
107-
<tr>
108-
<td>Flyers</td>
109-
<td>PHI</td>
110-
<td>2</td>
111-
</tr>
112-
<tr>
113-
<td>Maple Leafs</td>
114-
<td>TOR</td>
115-
<td>13</td>
116-
</tr>
117-
<tr>
118-
<td>👨\\U+1F3FB🔧</td>
119-
<td>TOR</td>
120-
<td>15</td>
121-
</tr>
122-
</tbody>
123-
</table>"""
124-
12557
EXPECTED_XLS_TABLE = (
12658
"<table><tr>"
12759
"<td>MC</td>"

test_unstructured/partition/test_csv.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -200,11 +200,8 @@ def test_partition_csv_header():
200200
)
201201

202202
table = elements[0]
203-
assert clean_extra_whitespace(table.text) == (
204-
"Stanley Cups Unnamed: 1 Unnamed: 2 " + EXPECTED_TEXT_XLSX
205-
)
203+
assert table.text == "Stanley Cups Unnamed: 1 Unnamed: 2 " + EXPECTED_TEXT_XLSX
206204
assert table.metadata.text_as_html is not None
207-
assert "<thead>" in table.metadata.text_as_html
208205

209206

210207
# ================================================================================================

test_unstructured/partition/test_tsv.py

Lines changed: 19 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414
)
1515
from test_unstructured.unit_utils import assert_round_trips_through_JSON, example_doc_path
1616
from unstructured.chunking.title import chunk_by_title
17-
from unstructured.cleaners.core import clean_extra_whitespace
1817
from unstructured.documents.elements import Table
1918
from unstructured.partition.tsv import partition_tsv
2019

@@ -31,21 +30,20 @@
3130
def test_partition_tsv_from_filename(filename: str, expected_text: str, expected_table: str):
3231
elements = partition_tsv(example_doc_path(filename), include_header=False)
3332

34-
assert clean_extra_whitespace(elements[0].text) == expected_text
35-
assert elements[0].metadata.text_as_html == expected_table
36-
assert elements[0].metadata.filetype == EXPECTED_FILETYPE
37-
for element in elements:
38-
assert element.metadata.filename == filename
33+
table = elements[0]
34+
assert table.text == expected_text
35+
assert table.metadata.text_as_html == expected_table
36+
assert table.metadata.filetype == EXPECTED_FILETYPE
37+
assert all(e.metadata.filename == filename for e in elements)
3938

4039

4140
def test_partition_tsv_from_filename_with_metadata_filename():
4241
elements = partition_tsv(
4342
example_doc_path("stanley-cups.tsv"), metadata_filename="test", include_header=False
4443
)
4544

46-
assert clean_extra_whitespace(elements[0].text) == EXPECTED_TEXT
47-
for element in elements:
48-
assert element.metadata.filename == "test"
45+
assert elements[0].text == EXPECTED_TEXT
46+
assert all(e.metadata.filename == "test" for e in elements)
4947

5048

5149
@pytest.mark.parametrize(
@@ -59,21 +57,20 @@ def test_partition_tsv_from_file(filename: str, expected_text: str, expected_tab
5957
with open(example_doc_path(filename), "rb") as f:
6058
elements = partition_tsv(file=f, include_header=False)
6159

62-
assert clean_extra_whitespace(elements[0].text) == expected_text
63-
assert isinstance(elements[0], Table)
64-
assert elements[0].metadata.text_as_html == expected_table
65-
assert elements[0].metadata.filetype == EXPECTED_FILETYPE
66-
for element in elements:
67-
assert element.metadata.filename is None
60+
table = elements[0]
61+
assert isinstance(table, Table)
62+
assert table.text == expected_text
63+
assert table.metadata.text_as_html == expected_table
64+
assert table.metadata.filetype == EXPECTED_FILETYPE
65+
assert all(e.metadata.filename is None for e in elements)
6866

6967

7068
def test_partition_tsv_from_file_with_metadata_filename():
7169
with open(example_doc_path("stanley-cups.tsv"), "rb") as f:
7270
elements = partition_tsv(file=f, metadata_filename="test", include_header=False)
7371

74-
assert clean_extra_whitespace(elements[0].text) == EXPECTED_TEXT
75-
for element in elements:
76-
assert element.metadata.filename == "test"
72+
assert elements[0].text == EXPECTED_TEXT
73+
assert all(element.metadata.filename == "test" for element in elements)
7774

7875

7976
# -- .metadata.last_modified ---------------------------------------------------------------------
@@ -142,12 +139,10 @@ def test_partition_tsv_header():
142139
example_doc_path("stanley-cups.tsv"), strategy="fast", include_header=True
143140
)
144141

145-
e = elements[0]
146-
assert (
147-
clean_extra_whitespace(e.text) == "Stanley Cups Unnamed: 1 Unnamed: 2 " + EXPECTED_TEXT_XLSX
148-
)
149-
assert e.metadata.text_as_html is not None
150-
assert "<thead>" in e.metadata.text_as_html
142+
table = elements[0]
143+
assert table.text == "Stanley Cups Unnamed: 1 Unnamed: 2 " + EXPECTED_TEXT_XLSX
144+
assert table.metadata.text_as_html is not None
145+
assert "<table>" in table.metadata.text_as_html
151146

152147

153148
def test_partition_tsv_supports_chunking_strategy_while_partitioning():

test_unstructured_ingest/expected-structured-output/delta-table/0-9d594ee0-ad36-4e7e-a6be-f53975fe3d10-0.json

Lines changed: 3 additions & 3 deletions
Large diffs are not rendered by default.

test_unstructured_ingest/expected-structured-output/delta-table/0-9d594ee0-ad36-4e7e-a6be-f53975fe3d10-1.json

Lines changed: 3 additions & 3 deletions
Large diffs are not rendered by default.

test_unstructured_ingest/expected-structured-output/delta-table/0-9d594ee0-ad36-4e7e-a6be-f53975fe3d10-2.json

Lines changed: 3 additions & 3 deletions
Large diffs are not rendered by default.

test_unstructured_ingest/expected-structured-output/delta-table/0-9d594ee0-ad36-4e7e-a6be-f53975fe3d10-3.json

Lines changed: 3 additions & 3 deletions
Large diffs are not rendered by default.

test_unstructured_ingest/expected-structured-output/delta-table/0-9d594ee0-ad36-4e7e-a6be-f53975fe3d10-4.json

Lines changed: 3 additions & 3 deletions
Large diffs are not rendered by default.

test_unstructured_ingest/expected-structured-output/delta-table/0-9d594ee0-ad36-4e7e-a6be-f53975fe3d10-5.json

Lines changed: 3 additions & 3 deletions
Large diffs are not rendered by default.

test_unstructured_ingest/expected-structured-output/delta-table/0-9d594ee0-ad36-4e7e-a6be-f53975fe3d10-6.json

Lines changed: 3 additions & 3 deletions
Large diffs are not rendered by default.

test_unstructured_ingest/expected-structured-output/delta-table/0-9d594ee0-ad36-4e7e-a6be-f53975fe3d10-7.json

Lines changed: 3 additions & 3 deletions
Large diffs are not rendered by default.

test_unstructured_ingest/expected-structured-output/delta-table/0-9d594ee0-ad36-4e7e-a6be-f53975fe3d10-8.json

Lines changed: 3 additions & 3 deletions
Large diffs are not rendered by default.

test_unstructured_ingest/expected-structured-output/delta-table/0-9d594ee0-ad36-4e7e-a6be-f53975fe3d10-9.json

Lines changed: 3 additions & 3 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)