Skip to content
This repository was archived by the owner on Mar 6, 2026. It is now read-only.

Commit bc44dab

Browse files
authored
fix: Change ocr_line <span> to include all ocr_word (#169)
Fixes the xml for ocr_line. The span of ocr_line should enclose all spans of ocr_word
1 parent 3c3f09d commit bc44dab

2 files changed

Lines changed: 77 additions & 76 deletions

File tree

google/cloud/documentai_toolbox/templates/hocr_document_template.xml.j2

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,10 @@
1818
{% set paridx = loop.index0 -%}
1919
<span class='ocr_par' id='par_{{ page_number }}_{{ bidx }}_{{ paridx }}' title='{{ paragraph.hocr_bounding_box -}}'>{% for line in paragraph.lines -%}
2020
{% set lidx = loop.index0 -%}
21-
<span class='ocr_line' id='line_{{ page_number }}_{{ bidx }}_{{ paridx }}_{{ lidx }}' title='{{ line.hocr_bounding_box }}'>{{ line.text }}</span>{% for token in line.tokens -%}
21+
<span class='ocr_line' id='line_{{ page_number }}_{{ bidx }}_{{ paridx }}_{{ lidx }}' title='{{ line.hocr_bounding_box }}'>{{ line.text }}{% for token in line.tokens -%}
2222
{% set tidx = loop.index0 -%}
23-
<span class='ocrx_word' id='word_{{ page_number }}_{{ bidx }}_{{ paridx }}_{{ lidx }}_{{ tidx }}' title='{{ token.hocr_bounding_box }}'>{{ token.text }}</span>{% endfor -%}{% endfor -%}
23+
<span class='ocrx_word' id='word_{{ page_number }}_{{ bidx }}_{{ paridx }}_{{ lidx }}_{{ tidx }}' title='{{ token.hocr_bounding_box }}'>{{ token.text }}</span>{% endfor -%}
24+
</span>{% endfor -%}
2425
</span>{% endfor -%}
2526
</span>{% endfor -%}
2627
</div>

0 commit comments

Comments
 (0)