Skip to content

fix(ppstructurev3): escape html-sensitive OCR text in table markdown output#17924

Open
jimmyzhuu wants to merge 2 commits intoPaddlePaddle:mainfrom
jimmyzhuu:fix/ppstructurev3-table-markdown-escaping
Open

fix(ppstructurev3): escape html-sensitive OCR text in table markdown output#17924
jimmyzhuu wants to merge 2 commits intoPaddlePaddle:mainfrom
jimmyzhuu:fix/ppstructurev3-table-markdown-escaping

Conversation

@jimmyzhuu
Copy link
Copy Markdown

Addresses #16096

Summary

This PR fixes an HTML escaping issue in PPStructureV3 table markdown export.

When OCR text inside table cells contains HTML-sensitive content such as <recv .../> or <pause .../>, the current table HTML assembly path may inject raw OCR text directly into <td> nodes. This makes the markdown output render incorrectly and can blur the boundary between original OCR content and generated HTML structure.

This PR narrows the fix to the PPStructureV3 table export path only.

Changes

  • add a local PaddleOCR patch for PaddleX table post-processing
  • escape HTML-sensitive table cell OCR text with html.escape(..., quote=True)
  • preserve a single outer <b>...</b> wrapper so bold formatting is not lost
  • apply the patch to both table post-processing modules
  • add regression tests for:
    • raw HTML-sensitive OCR text such as <recv .../>
    • bold-wrapped content
    • patch application on both post-processing modules

Scope

This PR does not change PaddleOCRVL.
It only fixes the PPStructureV3 table markdown export path.

Tests

pytest tests/pipelines/test_patch_layout_parsing.py tests/pipelines/test_patch_table_markdown.py -q

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 15, 2026

CLA assistant check
All committers have signed the CLA.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 15, 2026

Thanks for your contribution!

@jimmyzhuu
Copy link
Copy Markdown
Author

已补一个纯格式化提交,处理掉本次 CI 里的 black / pre-commit 改写问题。当前 test-pr 的失败点仍在依赖安装阶段(paddlepaddle 版本解析失败),看起来不是这次表格 markdown 转义改动本身引起的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants