Skip to content

Commit de97fde

Browse files
tbitcsoz-agent
andcommitted
merge: phase-next → main (June 2026 release)
Conflicts resolved by taking phase-next (authoritative) for all files. Main had 2 divergent commits (sign image reharvest, Dravidianist packet) whose content is superseded by the more recent phase-next work. Co-Authored-By: Oz <oz-agent@warp.dev>
2 parents 9dd5b1a + 41770ad commit de97fde

1,331 files changed

Lines changed: 14914 additions & 4715 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,13 @@ backend/glossa_lab/data/phase16_corpora/kalyanaraman_devanagari_corpus.txt
189189
# Phase-18 derived: large stream txt regenerable from CSV
190190
backend/glossa_lab/data/phase18_corpora/rv_padapatha_stream.txt
191191

192+
# ---- Sign image raw download cache (reconstructable via harvest_ivc2tyc_signs.py) ----
193+
backend/static/signs/originals/ivc2tyc_cache/
194+
# The manifest.json and processed M*.png files ARE committed (authoritative).
195+
# The originals/ folder itself is also gitignored to keep the repo lean.
196+
backend/static/signs/originals/
197+
!backend/static/signs/originals/.gitkeep
198+
192199
# ---- ML model weights (large binaries, not for version control) ----
193200
*.pt
194201
*.pth

ATTRIBUTION.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Attribution, Data Sources & Contact
2+
3+
**Glossa-Lab** is an open-source AI-assisted research platform for the computational
4+
analysis of ancient and undeciphered writing systems. This project depends on the
5+
work of many scholars and data providers whose contributions we are committed to
6+
crediting accurately.
7+
8+
---
9+
10+
## If a citation or credit is missing — contact us immediately
11+
12+
If you are a researcher, data provider, or rights-holder and you believe your work
13+
has been used without proper attribution, or if you have any concern about how your
14+
material appears in this project:
15+
16+
**Please contact Tristen Kyle Pierson directly:**
17+
18+
> **Email:** tpierson@bitconcepts.tech
19+
> **Subject line:** "Attribution concern — Glossa-Lab"
20+
21+
We treat attribution concerns as urgent. You will receive a response within 48 hours.
22+
If the concern is valid, we will correct the attribution, update the repository, and
23+
update any published outputs immediately.
24+
25+
You may also open a GitHub issue at:
26+
https://github.com/BitConcepts/glossa-lab/issues
27+
28+
---
29+
30+
## Primary data sources
31+
32+
All data sources used in this project are documented in detail in
33+
[CITATIONS.md](./CITATIONS.md). Key sources include:
34+
35+
| Source | Authors | License | Used for |
36+
|--------|---------|---------|---------|
37+
| Holdat LLC Indus Corpus v3 | Miller 2025 | Proprietary — statistical derivatives only, no raw data redistributed | Primary inscription corpus |
38+
| Mahadevan 1977 (M77) | Iravatham Mahadevan | Public domain (ASI / Govt. of India) | Sign numbering (M001–M397) |
39+
| DEDR | Burrow & Emeneau 1984 | © Clarendon Press — reference use | Dravidian etymological evidence |
40+
| Parpola 1994 / 2010 | Asko Parpola | © CUP / open conference paper | Decipherment framework, phoneme map |
41+
| ePSD2 | Tinney et al. / Penn | CC BY-SA | Sumerian/Akkadian name corpus |
42+
| CDLI | Englund et al. | CC BY-NC-SA 3.0 | Bibliographic reference only (no data committed) |
43+
| CISI Vols 1–3 | Joshi, Shah, Parpola et al. | © Suomalainen Tiedeakatemia | Reference only (no data redistributed) |
44+
| Wells 2006 / 2015 | Bryan K. Wells | Open access / © Archaeopress | Sign list cross-reference |
45+
| Fuls 2022/2023 | Andreas Fuls | © independently published | Sign catalog cross-reference |
46+
| ICIT | Wells & Fuls | Restricted (TU Berlin) | API reference; no data committed |
47+
| Nair 2026 | Ashish Nair | CC BY (arXiv) | Independent replication study cited |
48+
| Laursen 2010 | Steffen Terp Laursen | © Wiley / AAE | Gulf seal catalog, fish-sign validation |
49+
| Crawford 2001 | Harriet Crawford | © Archaeology International | Dilmun/Saar seal reference |
50+
| ePSD2 names subset | Penn Babylonian Section | CC BY-SA | Meluhhan name matching (null results) |
51+
| Tamburini 2025 | Fabio Tamburini | CC BY (Frontiers) | SA algorithm methodology reference |
52+
53+
For the complete bibliography with BibTeX entries, license analysis, and per-file
54+
attribution, see [CITATIONS.md](./CITATIONS.md) and
55+
[research/indus/DATA_LICENSES.md](./research/indus/DATA_LICENSES.md).
56+
57+
---
58+
59+
## License compliance summary
60+
61+
- **Holdat LLC corpus (proprietary):** Not redistributed. Only statistical
62+
derivatives (positional frequencies, bigram counts, candidate readings) appear
63+
in outputs.
64+
- **ePSD2 (CC BY-SA):** Used only for Meluhhan name matching experiments that
65+
produced null results. Not incorporated into released research outputs.
66+
The CC BY 4.0 licence on `research/indus/` outputs is unaffected.
67+
- **CDLI (CC BY-NC-SA):** No CDLI tablet text committed to this repository.
68+
All CDLI references are bibliographic only.
69+
- **Copyrighted academic sources (CISI, Parpola 1994, Mahadevan 2003):** Used
70+
as structured analytical references (sign numbers, phoneme assignments, crosswalk
71+
mappings). No verbatim text reproduced. Defensible as academic fair use / fair
72+
dealing.
73+
- **PyMuPDF (AGPL):** Used only in standalone research scripts, not in the
74+
deployed backend. AGPL network-use provisions do not apply.
75+
76+
Released research outputs (`research/indus/`, anchor tables, phase reports,
77+
supplemental datasets) are original analysis released under **CC BY 4.0**.
78+
79+
---
80+
81+
## Acknowledgements
82+
83+
This project is indebted to the following scholars and institutions
84+
(see [CITATIONS.md §Acknowledgements](./CITATIONS.md) for full details):
85+
86+
Iravatham Mahadevan (1930–2018) · Asko Parpola · Bryan K. Wells ·
87+
Andreas Fuls · William Miller Sr (Holdat LLC) · Ashish Nair ·
88+
Steffen Terp Laursen · Harriet Crawford · Petteri Koskikallio ·
89+
Roja Muthiah Research Library (Chennai) · University of Pennsylvania Museum ·
90+
TIFR (Rao, Yadav, Vahia, Joglekar, Adhikari) · Tamburini (Frontiers AI)
91+
92+
---
93+
94+
## How to cite Glossa-Lab
95+
96+
```bibtex
97+
@software{glossalab2026,
98+
author = {Pierson, Tristen Kyle},
99+
title = {Glossa-Lab: An agentic computational linguistics research platform
100+
for statistical analysis and decipherment of ancient writing systems},
101+
year = {2026},
102+
url = {https://github.com/BitConcepts/glossa-lab},
103+
note = {BitConcepts LLC. MIT licence (source); CC BY 4.0 (research outputs).}
104+
}
105+
```
106+
107+
---
108+
109+
*Last reviewed: June 2026. Contact tpierson@bitconcepts.tech for any attribution
110+
concern — we respond within 48 hours.*

CITATION.cff

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,11 +32,13 @@ references:
3232
- type: article
3333
title: >
3434
A Falsifiable Computational Decipherment Hypothesis for the Indus Valley Script:
35-
605 Proto-Dravidian Sign Readings Validated Across Two Independent Corpora
35+
161 Candidate Proto-Dravidian Anchors and a Three-Slot Positional Grammar
3636
authors:
3737
- family-names: Pierson
3838
given-names: Tristen Kyle
3939
affiliation: "BitConcepts LLC"
40+
email: tpierson@bitconcepts.tech
4041
year: 2026
42+
doi: "10.5281/ZENODO.20414696"
4143
notes: "Preprint v2 — Not peer-reviewed"
42-
url: "https://github.com/BitConcepts/glossa-lab/tree/main/research/indus"
44+
url: "https://doi.org/10.5281/ZENODO.20414696"

CITATIONS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1218,7 +1218,7 @@ Additional acknowledgements since the last update:
12181218

12191219
---
12201220

1221-
*Last updated: 2026-05-13.*
1221+
*Last updated: June 2026. For attribution concerns contact tpierson@bitconcepts.tech — we respond within 48 hours. See also ATTRIBUTION.md.*
12221222

12231223
---
12241224

backend/glossa_lab/ai_utils.py

Lines changed: 59 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,51 @@ def _get_provider_prefs() -> dict[str, Any]:
171171
return _load_keys().get(_PROVIDERS_KEY, {})
172172

173173

174+
# ── Per-provider circuit breaker ────────────────────────────────────────────
175+
# Providers that consistently fail (e.g. wrong API key, unsupported model)
176+
# waste latency on every LLM call. After _CIRCUIT_THRESHOLD consecutive
177+
# failures across calls we open the circuit for _CIRCUIT_DURATION seconds.
178+
# The circuit resets automatically on the first successful response.
179+
_CIRCUIT_THRESHOLD = 5 # open after this many consecutive failures
180+
_CIRCUIT_DURATION = 600 # 10 minutes in the open state before retrying
181+
_provider_fail_counts: dict[str, int] = {} # provider_id → consecutive fails
182+
_provider_circuit_until: dict[str, float] = {} # provider_id → wall-clock deadline
183+
184+
185+
import time as _time_utils # noqa: E402
186+
187+
188+
def _circuit_is_open(provider_id: str, provider_name: str = "") -> bool:
189+
"""Return True and log a skip if the provider's circuit is open."""
190+
until = _provider_circuit_until.get(provider_id, 0.0)
191+
if until <= _time_utils.time():
192+
return False
193+
remaining = until - _time_utils.time()
194+
_log.debug(
195+
"call_llm: provider %s circuit open — %.0fs remaining, skipping",
196+
provider_name or provider_id, remaining,
197+
)
198+
return True
199+
200+
201+
def _circuit_record_failure(provider_id: str, provider_name: str = "") -> None:
202+
count = _provider_fail_counts.get(provider_id, 0) + 1
203+
_provider_fail_counts[provider_id] = count
204+
if count >= _CIRCUIT_THRESHOLD:
205+
_provider_circuit_until[provider_id] = _time_utils.time() + _CIRCUIT_DURATION
206+
_provider_fail_counts.pop(provider_id, None) # reset counter for next window
207+
_log.warning(
208+
"call_llm: provider %s CIRCUIT OPEN after %d consecutive failures — "
209+
"will skip for %.0f min. Fix the API key / model name in Settings → Providers.",
210+
provider_name or provider_id, count, _CIRCUIT_DURATION / 60,
211+
)
212+
213+
214+
def _circuit_record_success(provider_id: str) -> None:
215+
_provider_fail_counts.pop(provider_id, None)
216+
_provider_circuit_until.pop(provider_id, None)
217+
218+
174219
# Models known to use chain-of-thought / thinking tokens internally.
175220
# These require special handling with json_mode to avoid empty responses.
176221
_THINKING_MODEL_PATTERNS = (
@@ -302,7 +347,7 @@ def call_llm(
302347
max_tokens=max_tokens, temperature=temperature,
303348
)
304349

305-
# ── 1. Bucket-based resolution (new system) ──────────────────────
350+
# ── 1. Bucket-based resolution (new system) ──────────────────
306351
if bucket:
307352
_excluded = set(exclude_provider_ids) if exclude_provider_ids else set()
308353
# Try up to 4 slots: bucket-primary, bucket-fallback, global-primary, global-fallback
@@ -311,25 +356,35 @@ def call_llm(
311356
if not resolved:
312357
break
313358
prov = resolved["_provider"]
359+
prov_id = prov["id"]
314360
model = resolved["model"]
315361
params = resolved.get("params") or {}
316362
eff_temp = params.get("temperature", temperature)
317363
eff_max = params.get("max_tokens", max_tokens)
318364
is_fb = resolved.get("rank", 1) == 2 or resolved.get("bucket") != bucket
365+
366+
# Skip providers whose circuit is open (too many consecutive failures).
367+
if _circuit_is_open(prov_id, prov["name"]):
368+
_excluded.add(prov_id)
369+
continue
370+
319371
_log.info(
320372
"call_llm → bucket=%s provider=%s model=%s%s",
321373
bucket, prov["name"], model,
322374
" (fallback)" if is_fb else "",
323375
)
324376
try:
325-
return _dispatch_provider(
377+
result = _dispatch_provider(
326378
prov, model, messages,
327379
json_mode=json_mode, json_schema=json_schema,
328380
max_tokens=eff_max, temperature=eff_temp,
329381
)
382+
_circuit_record_success(prov_id) # reset failure counter on success
383+
return result
330384
except RuntimeError as _rt_err:
331-
# Connection refused or provider down → exclude and try next
332-
_excluded.add(prov["id"])
385+
# Record failure; open circuit when threshold is reached.
386+
_circuit_record_failure(prov_id, prov["name"])
387+
_excluded.add(prov_id)
333388
_log.warning(
334389
"call_llm: provider %s failed (%s), trying fallback",
335390
prov["name"], type(_rt_err).__name__,

backend/glossa_lab/api/ai_tools.py

Lines changed: 47 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,7 @@ def _build_settings_context() -> str:
157157
compare_results: {"type":"compare_results", "params":{"file_a":"<report.json>","file_b":"..."}, "label":"...", "description":"..."} ← diff two experiment JSON result files
158158
summarize_session: {"type":"summarize_session","params":{"title":"..."}, "label":"...", "description":"..."} ← save conversation as notebook
159159
acquire_corpus: {"type":"acquire_corpus", "params":{"source_id":"<id>","name":"...","corpus_type":"ancient","url":"<opt>"}, "label":"...", "description":"..."} ← download a corpus
160+
build_tooling: {"type":"build_tooling", "params":{"name":"...","description":"...","code":"<opt python>","pipeline":"<opt>","experiment_id":"<opt>"}, "label":"...", "description":"..."} ← build/configure a research tool or utility
160161
161162
ACQUIRABLE CORPUS SOURCE IDs:
162163
cdli_proto_elamite, cdli_sumerian_ur3, oracc_akkadian, sigla_linear_a,
@@ -176,19 +177,21 @@ def _build_settings_context() -> str:
176177
177178
3. Map the discovery topic to the closest REGISTERED experiments using this guide:
178179
Fragmentary/incomplete texts, text gaps, restoration:
179-
decoded_text_repetition, blocker_sign_context, reading_frequency_zipf
180+
indus_validation_neg_controls, indus_structural_atlas, indus_cisi_structural
180181
RNN / neural / ML / computational linguistics methods:
181-
decoded_text_repetition, compound_semantic_coherence, rare_sign_neighbor_profile
182+
indus_cgsa_cluster_analysis, indus_structural_atlas, indus_sign_function_dravidian
182183
Cross-language comparison, phonological mapping:
183184
→ indus_dravidian_vs_sanskrit, indus_cisi_dravidian_vs_sanskrit, indus_sign_function_dravidian
184185
Rural distribution, ceramic economy, trade, provenance:
185-
blocker_sign_context, reading_frequency_zipf, indus_cisi_structural
186+
indus_contact_zone_v2, indus_cisi_structural, indus_structural_atlas
186187
Sign frequency, Zipf law, statistical patterns:
187-
reading_frequency_zipf, rare_sign_neighbor_profile, decoded_text_repetition
188+
indus_structural_atlas, indus_contact_zone_v2, indus_cisi_structural
188189
Structural analysis, sign position, inscription layout:
189-
→ indus_cisi_structural, blocker_sign_context, compound_semantic_coherence
190+
→ indus_cisi_structural, indus_cgsa_cluster_analysis, indus_structural_atlas
190191
Anchors, validation, confidence building:
191-
→ indus_cisi_anchored_10, indus_validation_a1_a3_holdout
192+
→ indus_cisi_anchored_10, indus_validation_a1_a3_holdout, indus_validation_neg_controls
193+
General IVC archaeology, Indus script overview:
194+
→ indus_contact_zone_v2, indus_structural_atlas, indus_sa_dravidian
192195
193196
4. Always include a create_hypothesis action alongside run_experiment actions to record
194197
the research question the discovery raised.
@@ -199,8 +202,8 @@ def _build_settings_context() -> str:
199202
"Based on this paper on fragmentary text analysis, I'll run three experiments that
200203
probe the same question from our existing corpus angle...
201204
%%ACTIONS%%
202-
[{"type":"run_experiment","params":{"id":"decoded_text_repetition"},"label":"Decoded Text Repetition","description":"Checks if decoded readings produce expected text repetition patterns consistent with real language."},
203-
{"type":"run_experiment","params":{"id":"blocker_sign_context"},"label":"Blocker Sign Context","description":"Identifies signs that appear in positions suggesting they carry structural/grammatical function."},
205+
[{"type":"run_experiment","params":{"id":"indus_validation_neg_controls"},"label":"Negative Controls Validation","description":"Checks if decoded readings pass negative-control statistical tests consistent with real language."},
206+
{"type":"run_experiment","params":{"id":"indus_structural_atlas"},"label":"Structural Atlas","description":"Analyses sign position, frequency distribution and structural roles across the CISI corpus."},
204207
{"type":"create_hypothesis","params":{"title":"RNN restoration insight","statement":"Fragmented Indus inscriptions may be restorable using positional frequency priors similar to Babylonian RNN approach."},"label":"Record Hypothesis","description":"Save the research question raised by this paper."}]
205208
%%END_ACTIONS%%"
206209
@@ -212,6 +215,12 @@ def _build_settings_context() -> str:
212215
- NEVER use experiment IDs not in REGISTERED EXPERIMENT IDs.
213216
- NEVER reference file paths that don't exist (path_to_file.csv, data.json, etc.).
214217
- NEVER claim you cannot execute — you CAN run registered experiments.
218+
219+
=== BUILD_SA_EXPERIMENT CORPUS NAMES (use EXACTLY one of these) ===
220+
Valid corpus values: indus, indus_cisi, indus_m77, hebrew, geez, phoenician,
221+
nw_semitic, ugaritic, meroitic, proto_sinaitic, linear_b, sanskrit, dravidian
222+
DO NOT use natural-language phrases like "indus valley civilization" — use indus_cisi instead.
223+
DO NOT use "indus script" or "IVC" — use indus or indus_cisi.
215224
=== END GLOSSA LAB ACTIONS ==="""
216225

217226
_REPORTS = Path(__file__).resolve().parent.parent.parent.parent / "reports"
@@ -1413,6 +1422,35 @@ def _flatten(d: Any, prefix: str = "") -> dict[str, Any]:
14131422
),
14141423
}
14151424

1425+
# ── build_tooling ──────────────────────────────────────────────────────────────
1426+
# The LLM sometimes emits build_tooling when it wants to create an analysis
1427+
# tool, utility script, or build/configure a research pipeline component.
1428+
# We route it to the most appropriate concrete action based on params:
1429+
# code → execute_script
1430+
# pipeline → run_pipeline
1431+
# experiment_id / id → run_experiment
1432+
# otherwise → create_notebook (records the tooling intent)
1433+
if t == "build_tooling":
1434+
if p.get("code"):
1435+
return await _execute_action_inner("execute_script", p)
1436+
if p.get("pipeline"):
1437+
return await _execute_action_inner("run_pipeline", p)
1438+
exp_id = p.get("experiment_id") or p.get("id", "")
1439+
if exp_id:
1440+
return await _execute_action_inner("run_experiment", {**p, "id": exp_id})
1441+
# Fallback: save as notebook entry so the intent is not lost
1442+
return await _execute_action_inner(
1443+
"create_notebook",
1444+
{
1445+
"title": p.get("title", "Tooling: " + p.get("name", "AI-proposed tool")),
1446+
"content": (
1447+
f"**Tool requested by AI:**\n\n"
1448+
f"{p.get('description', p.get('label', 'No description provided.'))}\n\n"
1449+
f"Params: {json.dumps(p, indent=2)}"
1450+
),
1451+
},
1452+
)
1453+
14161454
# ── summarize_session ──────────────────────────────────────────────────────────
14171455
if t == "summarize_session":
14181456
from glossa_lab.database import get_db # noqa: PLC0415
@@ -1437,7 +1475,7 @@ def _flatten(d: Any, prefix: str = "") -> dict[str, Any]:
14371475
"open_view", "run_experiment", "run_pipeline", "change_setting",
14381476
"generate_report", "create_hypothesis", "create_notebook",
14391477
"clear_jobs", "execute_script", "query_corpus", "compare_results",
1440-
"acquire_corpus", "summarize_session",
1478+
"acquire_corpus", "summarize_session", "build_tooling",
14411479
]
14421480
raise HTTPException(
14431481
400,

0 commit comments

Comments
 (0)