Optimize standardize_quotes

codeflash-ai[bot] · KRRT7 · commit d0136b051020 · 2026-03-26T17:08:45.000-05:00
The optimized code achieves a **144% speedup** by replacing a loop-based character replacement approach with Python's built-in `str.translate()` method using a pre-computed translation table.

## Key Optimizations

**1. Pre-computed Translation Table at Module Load**
- The quote dictionaries and translation table are now created once at module import time (module-level constants prefixed with `_`)
- Original code recreated these 40+ entry dictionaries on every function call (6.1% + 6.5% = 12.6% of runtime just for dictionary creation)
- Translation table maps Unicode codepoints directly to ASCII quote codepoints, eliminating repeated string operations

**2. Single-Pass O(n) Algorithm with `str.translate()`**
- Original: Two loops iterating through ~40 quote types, calling `unicode_to_char()` 3,096 times (67.5% of total runtime) and performing substring searches with `in` operator (5.9% of runtime)
- Optimized: Single `str.translate()` call that processes the entire string in one pass using efficient C-level implementation
- Eliminates 3,096 function calls to `unicode_to_char()` and all associated string parsing/conversion overhead

**3. Algorithmic Complexity Improvement**
- Original: O(n × m) where n = text length, m = number of quote types (~40), with repeated `text.replace()` creating new string objects
- Optimized: O(n) single pass through the text, with translation table lookups being O(1)

## Performance Context

Based on `function_references`, this function is called from `calculate_edit_distance()`, which is likely in a **hot path** for text extraction metrics. The function processes strings before edit distance calculations, meaning:
- Any text comparison workflow will call this repeatedly
- The 144% speedup compounds when processing multiple documents or performing batch comparisons
- Reduced memory allocation pressure from eliminating repeated dictionary creation and intermediate string objects

## Test Case Insights

The test with input `"«'"` (containing both double and single quote variants) shows the optimization handles mixed quote types efficiently in a single pass, whereas the original code would iterate through all 40 quote types regardless of actual presence in the text.
diff --git a/unstructured/metrics/text_extraction.py b/unstructured/metrics/text_extraction.py
@@ -4,6 +4,54 @@
 
 from unstructured.cleaners.core import clean_bullets, remove_sentence_punctuation
 
+_DOUBLE_QUOTES = {
+    '"': "U+0022",  # noqa 601 # Standard typewriter/programmer's quote
+    '"': "U+201C",  # noqa 601 # Left double quotation mark
+    '"': "U+201D",  # noqa 601 # Right double quotation mark
+    "„": "U+201E",  # Double low-9 quotation mark
+    "‟": "U+201F",  # Double high-reversed-9 quotation mark
+    "«": "U+00AB",  # Left-pointing double angle quotation mark
+    "»": "U+00BB",  # Right-pointing double angle quotation mark
+    "❝": "U+275D",  # Heavy double turned comma quotation mark ornament
+    "❞": "U+275E",  # Heavy double comma quotation mark ornament
+    "⹂": "U+2E42",  # Double low-reversed-9 quotation mark
+    "🙶": "U+1F676",  # SANS-SERIF HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
+    "🙷": "U+1F677",  # SANS-SERIF HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
+    "🙸": "U+1F678",  # SANS-SERIF HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
+    "⠦": "U+2826",  # Braille double closing quotation mark
+    "⠴": "U+2834",  # Braille double opening quotation mark
+    "〝": "U+301D",  # REVERSED DOUBLE PRIME QUOTATION MARK
+    "〞": "U+301E",  # DOUBLE PRIME QUOTATION MARK
+    "〟": "U+301F",  # LOW DOUBLE PRIME QUOTATION MARK
+    "＂": "U+FF02",  # FULLWIDTH QUOTATION MARK
+    ",,": "U+275E",  # LOW HEAVY DOUBLE COMMA ORNAMENT
+}
+
+_SINGLE_QUOTES = {
+    "'": "U+0027",  # noqa 601 # Standard typewriter/programmer's quote
+    "'": "U+2018",  # noqa 601 # Left single quotation mark
+    "'": "U+2019",  # noqa 601 # Right single quotation mark # noqa: W605
+    "‚": "U+201A",  # Single low-9 quotation mark
+    "‛": "U+201B",  # Single high-reversed-9 quotation mark
+    "‹": "U+2039",  # Single left-pointing angle quotation mark
+    "›": "U+203A",  # Single right-pointing angle quotation mark
+    "❛": "U+275B",  # Heavy single turned comma quotation mark ornament
+    "❜": "U+275C",  # Heavy single comma quotation mark ornament
+    "「": "U+300C",  # Left corner bracket
+    "」": "U+300D",  # Right corner bracket
+    "『": "U+300E",  # Left white corner bracket
+    "』": "U+300F",  # Right white corner bracket
+    "﹁": "U+FE41",  # PRESENTATION FORM FOR VERTICAL LEFT CORNER BRACKET
+    "﹂": "U+FE42",  # PRESENTATION FORM FOR VERTICAL RIGHT CORNER BRACKET
+    "﹃": "U+FE43",  # PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET
+    "﹄": "U+FE44",  # PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET
+    "＇": "U+FF07",  # FULLWIDTH APOSTROPHE
+    "｢": "U+FF62",  # HALFWIDTH LEFT CORNER BRACKET
+    "｣": "U+FF63",  # HALFWIDTH RIGHT CORNER BRACKET
+}
+
+_TRANSLATION_TABLE = {}
+
 
 def calculate_accuracy(
     output: Optional[str],
@@ -172,70 +220,7 @@ def standardize_quotes(text: str) -> str:
     Returns:
         str: The text with standardized quotes.
     """
-    # Double Quotes Dictionary
-    double_quotes = {
-        '"': "U+0022",  # noqa 601 # Standard typewriter/programmer's quote
-        '"': "U+201C",  # noqa 601 # Left double quotation mark
-        '"': "U+201D",  # noqa 601 # Right double quotation mark
-        "„": "U+201E",  # Double low-9 quotation mark
-        "‟": "U+201F",  # Double high-reversed-9 quotation mark
-        "«": "U+00AB",  # Left-pointing double angle quotation mark
-        "»": "U+00BB",  # Right-pointing double angle quotation mark
-        "❝": "U+275D",  # Heavy double turned comma quotation mark ornament
-        "❞": "U+275E",  # Heavy double comma quotation mark ornament
-        "⹂": "U+2E42",  # Double low-reversed-9 quotation mark
-        "🙶": "U+1F676",  # SANS-SERIF HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
-        "🙷": "U+1F677",  # SANS-SERIF HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
-        "🙸": "U+1F678",  # SANS-SERIF HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
-        "⠦": "U+2826",  # Braille double closing quotation mark
-        "⠴": "U+2834",  # Braille double opening quotation mark
-        "〝": "U+301D",  # REVERSED DOUBLE PRIME QUOTATION MARK
-        "〞": "U+301E",  # DOUBLE PRIME QUOTATION MARK
-        "〟": "U+301F",  # LOW DOUBLE PRIME QUOTATION MARK
-        "＂": "U+FF02",  # FULLWIDTH QUOTATION MARK
-        ",,": "U+275E",  # LOW HEAVY DOUBLE COMMA ORNAMENT
-    }
-
-    # Single Quotes Dictionary
-    single_quotes = {
-        "'": "U+0027",  # noqa 601 # Standard typewriter/programmer's quote
-        "'": "U+2018",  # noqa 601 # Left single quotation mark
-        "'": "U+2019",  # noqa 601 # Right single quotation mark # noqa: W605
-        "‚": "U+201A",  # Single low-9 quotation mark
-        "‛": "U+201B",  # Single high-reversed-9 quotation mark
-        "‹": "U+2039",  # Single left-pointing angle quotation mark
-        "›": "U+203A",  # Single right-pointing angle quotation mark
-        "❛": "U+275B",  # Heavy single turned comma quotation mark ornament
-        "❜": "U+275C",  # Heavy single comma quotation mark ornament
-        "「": "U+300C",  # Left corner bracket
-        "」": "U+300D",  # Right corner bracket
-        "『": "U+300E",  # Left white corner bracket
-        "』": "U+300F",  # Right white corner bracket
-        "﹁": "U+FE41",  # PRESENTATION FORM FOR VERTICAL LEFT CORNER BRACKET
-        "﹂": "U+FE42",  # PRESENTATION FORM FOR VERTICAL RIGHT CORNER BRACKET
-        "﹃": "U+FE43",  # PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET
-        "﹄": "U+FE44",  # PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET
-        "＇": "U+FF07",  # FULLWIDTH APOSTROPHE
-        "｢": "U+FF62",  # HALFWIDTH LEFT CORNER BRACKET
-        "｣": "U+FF63",  # HALFWIDTH RIGHT CORNER BRACKET
-    }
-
-    double_quote_standard = '"'
-    single_quote_standard = "'"
-
-    # Apply double quote replacements
-    for unicode_val in double_quotes.values():
-        unicode_char = unicode_to_char(unicode_val)
-        if unicode_char in text:
-            text = text.replace(unicode_char, double_quote_standard)
-
-    # Apply single quote replacements
-    for unicode_val in single_quotes.values():
-        unicode_char = unicode_to_char(unicode_val)
-        if unicode_char in text:
-            text = text.replace(unicode_char, single_quote_standard)
-
-    return text
+    return text.translate(_TRANSLATION_TABLE)
 
 
 def unicode_to_char(unicode_val: str) -> str:
@@ -249,3 +234,12 @@ def unicode_to_char(unicode_val: str) -> str:
         str: The character corresponding to the Unicode value.
     """
     return chr(int(unicode_val.replace("U+", ""), 16))
+
+
+for unicode_val in _DOUBLE_QUOTES.values():
+    char_code = int(unicode_val.replace("U+", ""), 16)
+    _TRANSLATION_TABLE[char_code] = ord('"')
+
+for unicode_val in _SINGLE_QUOTES.values():
+    char_code = int(unicode_val.replace("U+", ""), 16)
+    _TRANSLATION_TABLE[char_code] = ord("'")