-
Hi, I'm using the ‘hdr_info’ property of the pdf4llm.to_markdown() method to customize the sections (add X amount of ‘#’). But when I use the following method(test version to see what results it returned):
there are sentences that never come back to me. That is, when the markdown is created, the sentences appear but when I print all the ‘span’ it doesn't return all of them (there are some lines that it doesn't "return"). In the following pdf for example it does not return the full title of the section and the first sentence of the first list, for example, is also not printed ( I'm using pdf4llm version==0.0.9 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I suggest you make a list of the spans that actually are created to see what the situation is: blocks = page.get_text("dict",flags=pymupdf.TEXTFLAGS_TEXT)["blocks"]
spans = [s for b in blocks for l in b["lines"] for s in l["spans"]]
for s in spans:
print(f"{s['text']=}") |
Beta Was this translation helpful? Give feedback.
I suggest you make a list of the spans that actually are created to see what the situation is: