Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf公式识别问题 #1877

Open
moro0v0 opened this issue Mar 8, 2025 · 4 comments
Open

pdf公式识别问题 #1877

moro0v0 opened this issue Mar 8, 2025 · 4 comments

Comments

@moro0v0
Copy link

moro0v0 commented Mar 8, 2025

由于我们需要用来处理大量的学术文件,里面有很多数学公式,当前工具已经能识别很多公式了但是还是存在不少公式识别错误。我想了解一是:1.有没有可能通过修改公式识别的模型来提升公式识别准确率?2.如果不行的话能不能提取公式为图片?我确实不太了解整个项目和里面用到的技术,如果可以的话请给我一些指导

@moro0v0
Copy link
Author

moro0v0 commented Mar 8, 2025

我在cut_image.py里面也增加了def ocr_cut_image_and_table(spans, page, page_id, pdf_bytes_md5, imageWriter):
def return_path(type):
return join_path(pdf_bytes_md5, type)

for span in spans:
span_type = span['type']
if span_type == ContentType.Image:
if not check_img_bbox(span['bbox']) or not imageWriter:
continue
span['image_path'] = cut_image(span['bbox'], page_id, page, return_path=return_path('images'),
imageWriter=imageWriter)
elif span_type == ContentType.Table:
if not check_img_bbox(span['bbox']) or not imageWriter:
continue
span['image_path'] = cut_image(span['bbox'], page_id, page, return_path=return_path('tables'),
imageWriter=imageWriter)
增加提取公式保存为图片ru
elif span_type == ContentType.InterlineEquation:
if not check_img_bbox(span['bbox']) or not imageWriter:
continue
span['image_path'] = cut_image(span['bbox'], page_id, page, return_path=return_path('interline_equation'),
imageWriter=imageWriter)
return spans
保存行间公式的代码,这样子是不是可以像图片一样保存在名为images的文件夹下了?然后会在md里面出现链接?

@moro0v0
Copy link
Author

moro0v0 commented Mar 8, 2025

我现在看到在images里面存在了行间图片的文件了,但是我运行完发现行间公式还是被识别了出来如果我想让它像普通图片一样是一个链接的形式的话我应该修改哪里?

@moro0v0
Copy link
Author

moro0v0 commented Mar 8, 2025

ocr_mkcontent.py的ocr_mk_markdown_with_para_core_v2的 elif para_type == BlockType.InterlineEquation:要怎么改才能把公式图片的位置放在那里, elif para_type == BlockType.InterlineEquation:
# para_text = merge_para_with_text(para_block)
para_text += "被我找到了吧!"
# for block in para_block['blocks']: # 1st.拼image_body
# if block['type'] == BlockType.ImageBody:
# for line in block['lines']:
# for span in line['spans']:
# if span['type'] == ContentType.Image:
# if span.get('image_path', ''):
# para_text += f"\n![]({join_path(img_buket_path, span['image_path'])}) \n"直接这么改是会报没有blocks参数

@jonny4589
Copy link

另外公式中带文字解析会乱码有解决办法吗各位大佬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants