Optimize JavaAssertTransformer._infer_type_from_assertion_args

codeflash-ai[bot] · web-flow · commit 52ec0a904c4d · 2026-02-25T20:34:12.000Z
Runtime improvement (primary): the optimized version runs ~11% faster overall (10.3ms -&gt; 9.23ms). Line-profiles show the hot work (argument splitting and literal checks) is measurably reduced.

What changed (concrete):
- Added a fast-path in _split_top_level_args: if the args string contains none of the "special" delimiters (quotes, braces, parens), we skip the character-by-character parser and return either args_str.split(",") or [args_str].
- Moved several literal/cast regexes into __init__ as precompiled attributes (self._FLOAT_LITERAL_RE, self._DOUBLE_LITERAL_RE, self._LONG_LITERAL_RE, self._INT_LITERAL_RE, self._CHAR_LITERAL_RE, self._cast_re) and replaced re.match(...) for casts with self._cast_re.match(...).

Why this speeds things up:
- str.split is implemented in C and is orders of magnitude faster than a Python-level loop that iterates characters, manages stack depth, and joins fragments. The fast-path catches the common simple cases (no nested parentheses/quotes/generics) and lets the interpreter use the highly-optimized C split, which is why very large comma-separated inputs show the biggest wins (e.g., the 1000-arg test goes from ~1.39ms to ~67.5μs).
- Precompiling regexes removes repeated compilation overhead and lets .match be executed directly on a compiled object. The original code used re.match(...) in-place for cast detection which implicitly compiles the pattern or goes through the module-level cache; using a stored compiled pattern is cheaper and eliminates that runtime cost.
- Combined, these changes reduce the time spent inside _split_top_level_args and _type_from_literal (the line profilers show reduced wall time for those functions), producing the measured global runtime improvement.

Behavioral/compatibility notes:
- The fast-path preserves original behavior: when no special delimiter is present it simply splits on commas (or returns a single entry), otherwise it falls back to the full, safe parser that respects nested delimiters and strings.
- Some microbenchmarks regress slightly (a few single-case timings in the annotated tests are a bit slower); this is expected because we add a small _special_re.search check for every call. The overall trade-off was accepted because it yields substantial savings in the common and expensive cases (especially large/simple comma-separated argument lists).
- The optimization is most valuable when this function is exercised many times or on long/simple argument lists (hot paths that produce many simple comma-separated tokens). It is neutral or slightly negative for a handful of small or highly-nested inputs, but those are rare in the benchmarks.

Tests and workload guidance:
- Big wins: large-scale, many-argument inputs or many repeated calls where arguments are simple comma-separated literals (annotated tests show up to ~20x speedups for such cases).
- No/low impact: complex first arguments with nested parentheses/generics or many quoted strings — the safe parser still runs there, so correctness is preserved; timings remain similar.
- Small regressions: a few microbench cases (very short inputs or certain char-literal checks) are marginally slower due to the extra quick search, but these regressions are small relative to the global runtime improvement.

Summary:
By routing simple/common inputs to str.split (C-level speed) and eliminating per-call regex compilation for literal/cast detection, the optimized code reduces time in the hot parsing and literal-detection paths, producing the observed ~11% runtime improvement while maintaining correctness for nested/quoted input via the fallback parser.
diff --git a/codeflash/languages/java/remove_asserts.py b/codeflash/languages/java/remove_asserts.py
@@ -198,6 +198,15 @@ def __init__(
         # Precompile regex to find next special character (quotes, parens, braces).
         self._special_re = re.compile(r"[\"'{}()]")
 
+
+        # Precompile literal/cast regexes to avoid recompilation on each literal check.
+        self._LONG_LITERAL_RE = re.compile(r"^-?\d+[lL]$")
+        self._INT_LITERAL_RE = re.compile(r"^-?\d+$")
+        self._DOUBLE_LITERAL_RE = re.compile(r"^-?\d+\.\d*[dD]?$|^-?\d+[dD]$")
+        self._FLOAT_LITERAL_RE = re.compile(r"^-?\d+\.?\d*[fF]$")
+        self._CHAR_LITERAL_RE = re.compile(r"^'.'$|^'\\.'$")
+        self._cast_re = re.compile(r"^\((\w+)\)")
+
     def transform(self, source: str) -> str:
         """Remove assertions from source code, preserving target function calls.
 
@@ -972,13 +981,22 @@ def _type_from_literal(self, value: str) -> str:
         if value.startswith('"'):
             return "String"
         # Cast expression like (byte)0, (short)1
-        cast_match = re.match(r"^\((\w+)\)", value)
+        cast_match = self._cast_re.match(value)
         if cast_match:
             return cast_match.group(1)
         return "Object"
 
     def _split_top_level_args(self, args_str: str) -> list[str]:
         """Split assertion arguments at top-level commas, respecting parens/strings/generics."""
+        # Fast-path: if there are no special delimiters that require parsing,
+        # we can use a simple split which is much faster for common simple cases.
+        if not self._special_re.search(args_str):
+            # Preserve original behavior of returning a list with the single unstripped string
+            # when there are no commas, otherwise split on commas.
+            if "," in args_str:
+                return args_str.split(",")
+            return [args_str]
+
         args: list[str] = []
         depth = 0
         current: list[str] = []