Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

跨页表格的表头丢失 #1876

Closed
ubbiiy opened this issue Mar 8, 2025 · 3 comments
Closed

跨页表格的表头丢失 #1876

ubbiiy opened this issue Mar 8, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@ubbiiy
Copy link

ubbiiy commented Mar 8, 2025

Description of the bug | 错误描述

跨页表格的表头在上一页(只有表头),表格内容全部在下一页,content_list里找不到表头部分。见下图

Image

Image

How to reproduce the bug | 如何复现

设置为
"device-mode": "cuda",
"layout-config": {
"model": "doclayout_yolo"
},
"formula-config": {
"mfd_model": "yolo_v8_mfd",
"mfr_model": "unimernet_small",
"enable": false
},
"table-config": {
"model": "rapid_table",
"sub_model": "slanet_plus",
"enable": true,
"max_time": 400
},

解析以下文件,表格位于第17页最底部

2024年度河南航空港投资集团有限公司信用评级报告.pdf

获得的输出

content_list.json

layout.json

model.json

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

1.0.x

Device mode | 设备模式

cuda

@ubbiiy ubbiiy added the bug Something isn't working label Mar 8, 2025
@myhloli
Copy link
Collaborator

myhloli commented Mar 9, 2025

17页表12的表头太靠近页面底部被识别成页脚了,这个不太好解决,只能自己手动处理下

@ubbiiy
Copy link
Author

ubbiiy commented Mar 9, 2025

该怎样手动解决呢?表12的第一行,也就是“产业园名称”,“项目状态”这一行在content_list、middle和md文件里都没有出现,也就是说解析所有输出文件这一行都是不存在的,在middle文件里只输出了table_caption,没有table_body。所以也不知道该怎样处理,有没有其他办法能读到这一行的信息呢?谢谢

Image

@myhloli
Copy link
Collaborator

myhloli commented Mar 9, 2025

手动处理当然是手动把表头和表的caption信息补全到后面那页的表信息里,输出不直接可用的情况下手动微调是必须的

@ubbiiy ubbiiy closed this as completed Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants