Skip to content

Commit 3240e3d

Browse files
authored
rfctr(pptx): minify HTML and table.text is cct (#3734)
**Summary** Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_pptx()`. Produce minified `.text_as_html` consistent with that formed by chunking. **Additional Context** - PPTX `.metadata.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements). - `table.text` is clean-concatenated-text (CCT) of table. - Last use of `tabulate` library is removed and that dependency is removed from `base.in`.
1 parent 3dea723 commit 3240e3d

File tree

7 files changed

+59
-94
lines changed

7 files changed

+59
-94
lines changed

CHANGELOG.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.16.1-dev3
1+
## 0.16.1-dev4
22

33
### Enhancements
44

@@ -8,10 +8,11 @@
88

99
* **Remove unsupported chipper model**
1010
* **Rewrite of `partition.email` module and tests.** Use modern Python stdlib `email` module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.
11-
* **Minify text_as_html from DOCX.** Previously `.metadata.text_as_html` for DOCX tables was "bloated" with whitespace and noise elements introduced by `tabulate` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
11+
* **Minify text_as_html from DOCX.** Previously `.metadata.text_as_html` for DOCX tables was "bloated" with whitespace and noise elements introduced by `tabulate` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
1212
* **Fall back to filename extension-based file-type detection for unidentified OLE files.** Resolves a problem where a DOC file that could not be detected as such by `filetype` was incorrectly identified as a MSG file.
13-
* **Minify text_as_html from XLSX.** Previously `.metadata.text_as_html` for DOCX tables was "bloated" with whitespace and noise elements introduced by `pandas` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
14-
* **Minify text_as_html from CSV.** Previously `.metadata.text_as_html` for CSV tables was "bloated" with whitespace and noise elements introduced by `pandas` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
13+
* **Minify text_as_html from XLSX.** Previously `.metadata.text_as_html` for DOCX tables was "bloated" with whitespace and noise elements introduced by `pandas` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
14+
* **Minify text_as_html from CSV.** Previously `.metadata.text_as_html` for CSV tables was "bloated" with whitespace and noise elements introduced by `pandas` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
15+
* **Minify text_as_html from PPTX.** Previously `.metadata.text_as_html` for PPTX tables was "bloated" with whitespace and noise elements introduced by `tabulate` that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text and structure.
1516

1617
## 0.16.0
1718

requirements/base.in

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ filetype
44
python-magic
55
lxml
66
nltk
7-
tabulate
87
requests
98
beautifulsoup4
109
emoji

test_unstructured/partition/common/test_common.py

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -388,17 +388,6 @@ def test_convert_office_docs_respects_wait_timeout():
388388
assert np.sum([(path / "simple.docx").is_file() for path in paths_to_save]) < 3
389389

390390

391-
class MockDocxEmptyTable:
392-
def __init__(self):
393-
self.rows = []
394-
395-
396-
def test_convert_ms_office_table_to_text_works_with_empty_tables():
397-
table = MockDocxEmptyTable()
398-
assert common.convert_ms_office_table_to_text(table, as_html=True) == ""
399-
assert common.convert_ms_office_table_to_text(table, as_html=False) == ""
400-
401-
402391
@pytest.mark.parametrize(
403392
("text", "expected"),
404393
[

test_unstructured/partition/test_pptx.py

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -247,15 +247,11 @@ def test_partition_pptx_grabs_tables():
247247
assert elements[1].text.startswith("Column 1")
248248
assert elements[1].text.strip().endswith("Aqua")
249249
assert elements[1].metadata.text_as_html == (
250-
"<table>\n"
251-
"<thead>\n"
252-
"<tr><th>Column 1 </th><th>Column 2 </th><th>Column 3 </th></tr>\n"
253-
"</thead>\n"
254-
"<tbody>\n"
255-
"<tr><td>Red </td><td>Green </td><td>Blue </td></tr>\n"
256-
"<tr><td>Purple </td><td>Orange </td><td>Yellow </td></tr>\n"
257-
"<tr><td>Tangerine </td><td>Pink </td><td>Aqua </td></tr>\n"
258-
"</tbody>\n"
250+
"<table>"
251+
"<tr><td>Column 1</td><td>Column 2</td><td>Column 3</td></tr>"
252+
"<tr><td>Red</td><td>Green</td><td>Blue</td></tr>"
253+
"<tr><td>Purple</td><td>Orange</td><td>Yellow</td></tr>"
254+
"<tr><td>Tangerine</td><td>Pink</td><td>Aqua</td></tr>"
259255
"</table>"
260256
)
261257
assert elements[1].metadata.filename == "fake-power-point-table.pptx"
@@ -516,7 +512,7 @@ def test_partition_pptx_hierarchy_sample_document():
516512
(2, "6ec455f5f19782facf184886876c9a66", "5614b00c3f6bff23ebba1360e10f6428"),
517513
(0, "8319096532fe2e55f66c491ea8313150", "2f57a8d4182e6fd5bd5842b0a2d9841b"),
518514
(None, None, "4120066d251ba675ade42e8a167ca61f"),
519-
(None, None, "2ed3bd10daace79ac129cbf8faf22bfc"),
515+
(None, None, "efb9d74b4f8be6308c9a9006da994e12"),
520516
(0, None, "fd08cacbaddafee5cbacc02528536ee5"),
521517
]
522518

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.16.1-dev3" # pragma: no cover
1+
__version__ = "0.16.1-dev4" # pragma: no cover

unstructured/partition/common/common.py

Lines changed: 0 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@
99

1010
import emoji
1111
import psutil
12-
from tabulate import tabulate
1312

1413
from unstructured.documents.coordinates import CoordinateSystem, PixelSpace
1514
from unstructured.documents.elements import (
@@ -29,9 +28,6 @@
2928
from unstructured.partition.utils.constants import SORT_MODE_DONT, SORT_MODE_XY_CUT
3029
from unstructured.utils import dependency_exists, first
3130

32-
if dependency_exists("pptx") and dependency_exists("pptx.table"):
33-
from pptx.table import Table as PptxTable
34-
3531
if dependency_exists("numpy") and dependency_exists("cv2"):
3632
from unstructured.partition.utils.sorting import sort_page_elements
3733

@@ -396,27 +392,6 @@ def convert_to_bytes(file: bytes | IO[bytes]) -> bytes:
396392
raise ValueError("Invalid file-like object type")
397393

398394

399-
def convert_ms_office_table_to_text(table: PptxTable, as_html: bool = True) -> str:
400-
"""Convert a PPTX table object to an HTML table string using the tabulate library.
401-
402-
Args:
403-
table (Table): A pptx.table.Table object.
404-
as_html (bool): Whether to return the table as an HTML string (True) or a
405-
plain text string (False)
406-
407-
Returns:
408-
str: An table string representation of the input table.
409-
"""
410-
rows = list(table.rows)
411-
412-
if not rows:
413-
return ""
414-
415-
headers = [cell.text for cell in rows[0].cells]
416-
data = [[cell.text for cell in row.cells] for row in rows[1:]]
417-
return tabulate(data, headers=headers, tablefmt="html" if as_html else "plain")
418-
419-
420395
def contains_emoji(s: str) -> bool:
421396
"""
422397
Check if the input string contains any emoji characters.

unstructured/partition/pptx.py

Lines changed: 47 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
from pptx.text.text import _Paragraph # pyright: ignore [reportPrivateUsage]
2323

2424
from unstructured.chunking import add_chunking_strategy
25+
from unstructured.common.html_table import HtmlTable, htmlify_matrix_of_cell_texts
2526
from unstructured.documents.elements import (
2627
Element,
2728
ElementMetadata,
@@ -34,7 +35,6 @@
3435
Title,
3536
)
3637
from unstructured.file_utils.model import FileType
37-
from unstructured.partition.common.common import convert_ms_office_table_to_text
3838
from unstructured.partition.common.metadata import apply_metadata, get_last_modified_date
3939
from unstructured.partition.text_type import (
4040
is_email_address,
@@ -213,38 +213,6 @@ def _iter_picture_elements(self, picture: Picture) -> Iterator[Element]:
213213
PicturePartitionerCls = self._opts.picture_partitioner
214214
yield from PicturePartitionerCls.iter_elements(picture, self._opts)
215215

216-
def _iter_title_shape_element(self, shape: Shape) -> Iterator[Element]:
217-
"""Generate Title element for each paragraph in title `shape`.
218-
219-
Text is most likely a title, but in the rare case that the title shape was used
220-
for the slide body text, also check for bulleted paragraphs."""
221-
if self._shape_is_off_slide(shape):
222-
return
223-
224-
depth = 0
225-
for paragraph in shape.text_frame.paragraphs:
226-
text = paragraph.text
227-
if text.strip() == "":
228-
continue
229-
230-
if self._is_bulleted_paragraph(paragraph):
231-
bullet_depth = paragraph.level or 0
232-
yield ListItem(
233-
text=text,
234-
metadata=self._opts.text_metadata(category_depth=bullet_depth),
235-
detection_origin=DETECTION_ORIGIN,
236-
)
237-
elif is_email_address(text):
238-
yield EmailAddress(text=text, detection_origin=DETECTION_ORIGIN)
239-
else:
240-
# increment the category depth by the paragraph increment in the shape
241-
yield Title(
242-
text=text,
243-
metadata=self._opts.text_metadata(category_depth=depth),
244-
detection_origin=DETECTION_ORIGIN,
245-
)
246-
depth += 1 # Cannot enumerate because we want to skip empty paragraphs
247-
248216
def _iter_shape_elements(self, shape: Shape) -> Iterator[Element]:
249217
"""Generate Text or subtype element for each paragraph in `shape`."""
250218
if self._shape_is_off_slide(shape):
@@ -280,18 +248,55 @@ def _iter_table_element(self, graphfrm: GraphicFrame) -> Iterator[Table]:
280248
281249
An empty table does not produce an element.
282250
"""
283-
text_table = convert_ms_office_table_to_text(graphfrm.table, as_html=False).strip()
284-
if not text_table:
251+
if not (rows := list(graphfrm.table.rows)):
252+
return
253+
254+
html_text = htmlify_matrix_of_cell_texts(
255+
[[cell.text for cell in row.cells] for row in rows]
256+
)
257+
html_table = HtmlTable.from_html_text(html_text)
258+
259+
if not html_table.text:
285260
return
286-
html_table = None
287-
if self._opts.infer_table_structure:
288-
html_table = convert_ms_office_table_to_text(graphfrm.table, as_html=True)
289-
yield Table(
290-
text=text_table,
291-
metadata=self._opts.table_metadata(html_table),
292-
detection_origin=DETECTION_ORIGIN,
261+
262+
metadata = self._opts.table_metadata(
263+
html_table.html if self._opts.infer_table_structure else None
293264
)
294265

266+
yield Table(text=html_table.text, metadata=metadata, detection_origin=DETECTION_ORIGIN)
267+
268+
def _iter_title_shape_element(self, shape: Shape) -> Iterator[Element]:
269+
"""Generate Title element for each paragraph in title `shape`.
270+
271+
Text is most likely a title, but in the rare case that the title shape was used
272+
for the slide body text, also check for bulleted paragraphs."""
273+
if self._shape_is_off_slide(shape):
274+
return
275+
276+
depth = 0
277+
for paragraph in shape.text_frame.paragraphs:
278+
text = paragraph.text
279+
if text.strip() == "":
280+
continue
281+
282+
if self._is_bulleted_paragraph(paragraph):
283+
bullet_depth = paragraph.level or 0
284+
yield ListItem(
285+
text=text,
286+
metadata=self._opts.text_metadata(category_depth=bullet_depth),
287+
detection_origin=DETECTION_ORIGIN,
288+
)
289+
elif is_email_address(text):
290+
yield EmailAddress(text=text, detection_origin=DETECTION_ORIGIN)
291+
else:
292+
# increment the category depth by the paragraph increment in the shape
293+
yield Title(
294+
text=text,
295+
metadata=self._opts.text_metadata(category_depth=depth),
296+
detection_origin=DETECTION_ORIGIN,
297+
)
298+
depth += 1 # Cannot enumerate because we want to skip empty paragraphs
299+
295300
def _order_shapes(self, slide: Slide) -> tuple[Shape | None, Sequence[BaseShape]]:
296301
"""Orders the shapes on `slide` from top to bottom and left to right.
297302

0 commit comments

Comments
 (0)