fix: Docx segmented font title recognition#2949
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| if pt >= 30: | ||
| for _value, index in zip(title_font_list, range(len(title_font_list))): | ||
| if pt >= _value[0] and pt < _value[1]: | ||
| return index + 1 |
There was a problem hiding this comment.
The provided code has some minor improvements and corrections:
-
The
get_image_idfunction is defined at the top but used further down, which means it might not be needed there unless you're reusing it elsewhere. -
In the
get_title_levelfunction:- You've removed three sets of conditional checks that are essentially duplicating the check for
pt >= 30. Only the last condition remains useful.
- You've removed three sets of conditional checks that are essentially duplicating the check for
-
The list comprehension in
title_font_listshould likely include all available sizes rather than just smaller ones to cover all possible titles if they exist beyond the given range. -
It's unclear why
pt >= 16or any specific conditions (like< 36) were included for fonts larger than 30 points intitle_font_list, as it would always match with[30, 36].
Here's an improved version of the get_title_level function based on these considerations:
def get_title_level(paragraph: Paragraph):
if len(paragraph.runs) == 1:
font_size = paragraph.runs[0].font.size
pt = font_size.pt
# Use binary search to find the appropriate title level
left, right = 0, len(title_font_list) - 1
while left <= right:
mid = left + (right - left) // 2
size_range = title_font_list[mid]
if pt >= 30 and pt < size_range[1]:
return mid + 1
elif pt < size_range[0]:
right = mid - 1
else:
left = mid + 1
return 1 # Default level, typically H1Potential Optimization Suggestions:
- For better readability and maintainability, separate out each case into different functions or methods.
- If the number of title levels extends significantly, consider using a dictionary mapping instead of a list for faster lookups.
- Ensure that the logic handles edge cases correctly, such as when no relevant paragraphs are found.
fix: Docx segmented font title recognition