Skip to content

Commit 61ae088

Browse files
fix: 4 upstream bugs surfaced by real-run code review
1) cli.py cmd_writeup — dispatch by isinstance, not dict.get() with a default-expression that requires ideas_data to already be a dict. Python eagerly evaluates both sides of the default argument, so the clever-looking `ideas_data.get("ideas", ideas_data if isinstance(...) else [])` still triggered AttributeError on bare-list input. Pure vibe-sci ideate output is a dict (save_ideas wraps it), so this never bit the happy path — but any user-authored fixture that's a top-level JSON list crashes. Replaced with explicit isinstance dispatch; both shapes now parse to an idea_list with no crash. 2) pyproject.toml — add the `[review]` extras the code already promises. review.py::_extract_pdf_text raises with "pip install 'vibe-sci[review]'", but `[project.optional-dependencies]` only defined `[dev]`, so that instruction would have failed at uv sync / pip install. Defined `review = ["pypdf>=3.0", "pymupdf4llm>=0.1"]` so the error message becomes an actionable fix. 3) sanitize/unicode_math.py + data/unicode_to_latex.yaml — Unicode → LaTeX math-command pass. article.cls with default inputenc cannot compile raw Greek / relation / operator glyphs, which LLMs emit freely in prose (≥, ρ, α, x², ∞, etc.). New pass splits the string into alternating prose / $...$ regions, replaces each Unicode symbol with $\cmd$ in prose and bare \cmd inside existing math. Inserted at pipeline position 6 (after md_to_latex, before strip_bad_commands so \geq etc. aren't flagged). Mapping data (~90 entries covering lowercase/uppercase Greek, relations, operators, arrows, super/sub digits, misc) is YAML so adding a symbol needs no Python change. 4) prompts/retry_system.md — teach the log-driven LaTeX-retry pass to recognise a Unicode-symbol error as "almost certainly the cause", with a concrete replacement table. Previously the retry prompt only knew about \cite/\SI/\num artefacts, so a paper whose only issue was "≥ on the prose side of math" would come back unfixed and the retry would bail with "output not LaTeX-like". Verification: - 9 existing tests still pass - ruff clean - convert_unicode_math round-trip: "accuracy ≥ 95%" → "accuracy $\geq$ 95%" "$x ≥ 0.5$" → "$x \geq 0.5$" (no double wrap) "ρ = 0.7" → "$\rho$ = 0.7" "x² + y³" → "x$^{2}$ + y$^{3}$" - cmd_writeup parses both {"ideas":[...]} dict AND [...] list shapes Bonus repro note: a bash-vs-Python escape subtlety the reviewer noted — r'\$\geq\$' only produced the intended LaTeX literal because bash's double quotes were swallowing the backslashes on its way in; in a .py file the same literal would write the strings verbatim. Unrelated to vibe-sci correctness but worth capturing for the next reader. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 36a63f1 commit 61ae088

7 files changed

Lines changed: 461 additions & 6 deletions

File tree

pyproject.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,11 @@ dependencies = [
4949

5050
[project.optional-dependencies]
5151
dev = ["pytest>=8.2", "ruff>=0.5"]
52+
# review.py::_extract_pdf_text needs at least one PDF text extractor; both
53+
# are pure-Python, pick either. Raised as "pip install 'vibe-sci[review]'"
54+
# in the error message when a user tries to review a .pdf without either
55+
# being installed.
56+
review = ["pypdf>=3.0", "pymupdf4llm>=0.1"]
5257

5358
[project.scripts]
5459
vibe-sci = "vibe_sci.cli:main"

uv.lock

Lines changed: 221 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

vibe_sci/cli.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,16 @@ def cmd_writeup(args) -> int:
109109
cfg = resolve_backend(backend=args.backend, model_override=args.model)
110110
apply_env(cfg)
111111
ideas_data = json.loads(pathlib.Path(args.ideas_json).read_text(encoding="utf-8"))
112-
idea_list = ideas_data.get("ideas", ideas_data if isinstance(ideas_data, list) else [])
112+
# Accept both shapes: ideate's own output is {"num_ideas": N, "ideas": [...]},
113+
# but user-authored fixtures often are a bare top-level list. Check the type
114+
# first — calling .get() on a list raises AttributeError (the default-arg
115+
# expression doesn't save us because Python eagerly evaluates both sides).
116+
if isinstance(ideas_data, dict):
117+
idea_list = ideas_data.get("ideas", [])
118+
elif isinstance(ideas_data, list):
119+
idea_list = ideas_data
120+
else:
121+
idea_list = []
113122
if args.idx >= len(idea_list):
114123
print(f"❌ idx={args.idx} out of range (have {len(idea_list)} ideas)", file=sys.stderr)
115124
return 2
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# Unicode → LaTeX math-command mapping.
2+
#
3+
# Consumed by vibe_sci/sanitize/unicode_math.py. Applied per-character:
4+
# prose occurrences get wrapped as $\cmd$; occurrences inside an existing
5+
# $...$ region drop the wrap. article.cls + inputenc cannot compile raw
6+
# Greek / math symbols, so the LLM handing us "accuracy ≥ 95%" must become
7+
# "accuracy $\geq$ 95\%" before pdflatex sees it.
8+
#
9+
# Only symbols likely to appear in an ML research paper are listed.
10+
# Extend per-project as needed; no code change required.
11+
12+
# ── Greek lowercase ──────────────────────────────
13+
α: \alpha
14+
β: \beta
15+
γ: \gamma
16+
δ: \delta
17+
ε: \varepsilon
18+
ϵ: \epsilon
19+
ζ: \zeta
20+
η: \eta
21+
θ: \theta
22+
ϑ: \vartheta
23+
ι: \iota
24+
κ: \kappa
25+
λ: \lambda
26+
μ: \mu
27+
ν: \nu
28+
ξ: \xi
29+
π: \pi
30+
ϖ: \varpi
31+
ρ: \rho
32+
ϱ: \varrho
33+
σ: \sigma
34+
ς: \varsigma
35+
τ: \tau
36+
υ: \upsilon
37+
φ: \varphi
38+
ϕ: \phi
39+
χ: \chi
40+
ψ: \psi
41+
ω: \omega
42+
43+
# ── Greek uppercase ──────────────────────────────
44+
Γ: \Gamma
45+
Δ: \Delta
46+
Θ: \Theta
47+
Λ: \Lambda
48+
Ξ: \Xi
49+
Π: \Pi
50+
Σ: \Sigma
51+
Υ: \Upsilon
52+
Φ: \Phi
53+
Ψ: \Psi
54+
Ω: \Omega
55+
56+
# ── Relations ────────────────────────────────────
57+
"": \geq
58+
"": \leq
59+
"": \neq
60+
"": \approx
61+
"": \equiv
62+
"": \propto
63+
"": \infty
64+
"": \in
65+
"": \notin
66+
"": \subset
67+
"": \subseteq
68+
"": \supset
69+
"": \supseteq
70+
"": \forall
71+
"": \exists
72+
73+
# ── Operators ────────────────────────────────────
74+
"±": \pm
75+
"": \mp
76+
"×": \times
77+
"÷": \div
78+
"·": \cdot
79+
"": \circ
80+
"": \oplus
81+
"": \otimes
82+
"": \nabla
83+
"": \partial
84+
"": \sum
85+
"": \prod
86+
"": \int
87+
"": \cup
88+
"": \cap
89+
"": \emptyset
90+
91+
# ── Arrows ───────────────────────────────────────
92+
"": \to
93+
"": \leftarrow
94+
"": \leftrightarrow
95+
"": \Rightarrow
96+
"": \Leftarrow
97+
"": \Leftrightarrow
98+
"": \mapsto
99+
100+
# ── Super / subscript digits + signs ─────────────
101+
"": "^{0}"
102+
"¹": "^{1}"
103+
"²": "^{2}"
104+
"³": "^{3}"
105+
"": "^{4}"
106+
"": "^{5}"
107+
"": "^{6}"
108+
"": "^{7}"
109+
"": "^{8}"
110+
"": "^{9}"
111+
"": "^{+}"
112+
"": "^{-}"
113+
"": "_{0}"
114+
"": "_{1}"
115+
"": "_{2}"
116+
"": "_{3}"
117+
"": "_{4}"
118+
"": "_{5}"
119+
"": "_{6}"
120+
"": "_{7}"
121+
"": "_{8}"
122+
"": "_{9}"
123+
124+
# ── Miscellaneous ────────────────────────────────
125+
"°": ^{\circ}
126+
"": ^{\prime}
127+
"": ^{\prime\prime}
128+
"": \sqrt{}
129+
"": \ldots

vibe_sci/prompts/retry_system.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,14 @@ cleanly, based on the error list. Output ONLY the corrected LaTeX body with
33
the same output rules as the original writer. Remove any `\cite{...}` with
44
empty or placeholder keys (KEY, CITE_KEY, empty string, bare commas). Replace
55
any `\SI{num}{unit}` or `\num{...}` with plain text ("16.3 s", "21346").
6+
7+
Replace every non-ASCII math symbol with its LaTeX equivalent inside inline
8+
math. article.cls with default inputenc cannot render raw Unicode math:
9+
≥ → $\geq$ ≤ → $\leq$ ≠ → $\neq$ ≈ → $\approx$ ± → $\pm$
10+
× → $\times$ ÷ → $\div$ ∞ → $\infty$ ∈ → $\in$ ∂ → $\partial$
11+
α → $\alpha$ β → $\beta$ ρ → $\rho$ σ → $\sigma$ μ → $\mu$
12+
Γ → $\Gamma$ Δ → $\Delta$ Σ → $\Sigma$ Ω → $\Omega$
13+
x² → x$^{2}$ x³ → x$^{3}$ A₀ → A$_{0}$
14+
Any symbol already inside a `$...$` region drops the extra wrap — use the
15+
bare `\cmd`. If the errored section contains Greek letters or math operators
16+
in prose, this is almost certainly the cause; fixing them is sufficient.

vibe_sci/sanitize/pipeline.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,14 @@
77
3. strip CJK leakage
88
4. unwrap unsupported package commands (siunitx, etc.)
99
5. markdown → LaTeX (bold / italic / heading)
10-
6. strip bad \\input \\include
11-
7. wrap lonely \\item
12-
8. balance inline math — drop orphan `$` from truncated LLM output
10+
6. Unicode → LaTeX math (≥ → $\\geq$, ρ → $\\rho$, x² → x$^{2}$).
11+
Runs after markdown so md_to_latex doesn't see already-wrapped `$`s,
12+
and before strip_bad_commands so `\\geq` etc. aren't flagged
13+
7. strip bad \\input \\include
14+
8. wrap lonely \\item
15+
9. balance inline math — drop orphan `$` from truncated LLM output
1316
BEFORE the escape pass, whose math-segment scanner assumes balanced `$`
14-
9. escape prose specials (%, &, <, >, _) — runs last so earlier passes'
17+
10. escape prose specials (%, &, <, >, _) — runs last so earlier passes'
1518
output is also escaped
1619
1720
To add a pass: write a module with a `str → str` function, append to
@@ -31,6 +34,7 @@
3134
from .math_balance import balance_inline_math
3235
from .packages import apply_package_fallbacks
3336
from .reasoning import strip_reasoning
37+
from .unicode_math import convert_unicode_math
3438

3539
Pass = Callable[[str], str]
3640

@@ -40,6 +44,7 @@
4044
strip_cjk,
4145
apply_package_fallbacks,
4246
md_to_latex,
47+
convert_unicode_math,
4348
strip_bad_commands,
4449
wrap_lonely_items,
4550
balance_inline_math,

vibe_sci/sanitize/unicode_math.py

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
"""Unicode → LaTeX math-command conversion pass.
2+
3+
article.cls with the default inputenc cannot compile raw Greek or math
4+
symbols that LLMs routinely emit in prose (``accuracy ≥ 95%``,
5+
``ρ = 0.7``, ``x²``). This pass rewrites those characters into their
6+
LaTeX equivalents with the right math-mode wrap:
7+
8+
- A symbol in prose becomes ``$\\cmd$`` (inline math wrap)
9+
- A symbol already inside an existing ``$...$`` region becomes just
10+
``\\cmd`` (the surrounding region already provides math mode)
11+
12+
Mapping data lives in ``vibe_sci/data/unicode_to_latex.yaml`` so adding
13+
a symbol doesn't require a Python change.
14+
"""
15+
from __future__ import annotations
16+
17+
import pathlib
18+
import re
19+
20+
import yaml
21+
22+
_DATA_PATH = pathlib.Path(__file__).parent.parent / "data" / "unicode_to_latex.yaml"
23+
24+
# Compiled once at import. Dict preserves YAML order so iteration
25+
# ordering is stable across runs.
26+
_MAPPING: dict[str, str] = {}
27+
28+
29+
def _load() -> dict[str, str]:
30+
global _MAPPING
31+
if _MAPPING:
32+
return _MAPPING
33+
if not _DATA_PATH.exists():
34+
return {}
35+
raw = yaml.safe_load(_DATA_PATH.read_text(encoding="utf-8")) or {}
36+
# YAML file is a flat dict { "α": "\\alpha", ... }; coerce values to str.
37+
_MAPPING = {str(k): str(v) for k, v in raw.items() if k and v}
38+
return _MAPPING
39+
40+
41+
# Split a string into alternating (prose, math) regions where math is
42+
# ``$...$`` inline — greedy-but-single-line to avoid swallowing
43+
# display math ``$$...$$`` or paragraph breaks.
44+
_INLINE_MATH_SPLIT = re.compile(r"(\$[^$\n]*\$)")
45+
46+
47+
def convert_unicode_math(s: str) -> str:
48+
"""Replace Unicode math symbols with LaTeX commands.
49+
50+
Prose regions get ``$\\cmd$`` wraps; content already inside ``$...$``
51+
gets bare ``\\cmd`` (no extra wrap, since surrounding ``$`` still
52+
provides math mode).
53+
"""
54+
mapping = _load()
55+
if not mapping:
56+
return s
57+
58+
parts = _INLINE_MATH_SPLIT.split(s)
59+
# parts[0::2] = prose (always present, possibly empty)
60+
# parts[1::2] = existing $...$ regions (including the $s themselves)
61+
for i, part in enumerate(parts):
62+
if not part:
63+
continue
64+
inside_math = i % 2 == 1 # odd index = $...$ region
65+
if inside_math:
66+
# Strip surrounding $ for symbol rewriting; re-wrap after
67+
inner = part[1:-1]
68+
for uc, cmd in mapping.items():
69+
if uc in inner:
70+
inner = inner.replace(uc, cmd)
71+
parts[i] = f"${inner}$"
72+
else:
73+
for uc, cmd in mapping.items():
74+
if uc in part:
75+
parts[i] = part = part.replace(uc, f"${cmd}$")
76+
return "".join(parts)

0 commit comments

Comments
 (0)