BitConcepts
diff --git a/‎.gitignore‎
Lines changed: 7 additions & 0 deletions b/‎.gitignore‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎ATTRIBUTION.md‎
Lines changed: 110 additions & 0 deletions b/‎ATTRIBUTION.md‎
Lines changed: 110 additions & 0 deletions
diff --git a/‎CITATION.cff‎
Lines changed: 4 additions & 2 deletions b/‎CITATION.cff‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎CITATIONS.md‎
Lines changed: 1 addition & 1 deletion b/‎CITATIONS.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎backend/glossa_lab/ai_utils.py‎
Lines changed: 59 additions & 4 deletions b/‎backend/glossa_lab/ai_utils.py‎
Lines changed: 59 additions & 4 deletions
diff --git a/‎backend/glossa_lab/api/ai_tools.py‎
Lines changed: 47 additions & 9 deletions b/‎backend/glossa_lab/api/ai_tools.py‎
Lines changed: 47 additions & 9 deletions
@@ -189,6 +189,13 @@ backend/glossa_lab/data/phase16_corpora/kalyanaraman_devanagari_corpus.txt
 # Phase-18 derived: large stream txt regenerable from CSV
 backend/glossa_lab/data/phase18_corpora/rv_padapatha_stream.txt
 
+# ---- Sign image raw download cache (reconstructable via harvest_ivc2tyc_signs.py) ----
+backend/static/signs/originals/ivc2tyc_cache/
+# The manifest.json and processed M*.png files ARE committed (authoritative).
+# The originals/ folder itself is also gitignored to keep the repo lean.
+backend/static/signs/originals/
+!backend/static/signs/originals/.gitkeep
+
 # ---- ML model weights (large binaries, not for version control) ----
 *.pt
 *.pth
 
@@ -0,0 +1,110 @@
+# Attribution, Data Sources & Contact
+
+**Glossa-Lab** is an open-source AI-assisted research platform for the computational
+analysis of ancient and undeciphered writing systems. This project depends on the
+work of many scholars and data providers whose contributions we are committed to
+crediting accurately.
+
+---
+
+## If a citation or credit is missing — contact us immediately
+
+If you are a researcher, data provider, or rights-holder and you believe your work
+has been used without proper attribution, or if you have any concern about how your
+material appears in this project:
+
+**Please contact Tristen Kyle Pierson directly:**
+
+> **Email:** tpierson@bitconcepts.tech  
+> **Subject line:** "Attribution concern — Glossa-Lab"
+
+We treat attribution concerns as urgent. You will receive a response within 48 hours.
+If the concern is valid, we will correct the attribution, update the repository, and
+update any published outputs immediately.
+
+You may also open a GitHub issue at:
+https://github.com/BitConcepts/glossa-lab/issues
+
+---
+
+## Primary data sources
+
+All data sources used in this project are documented in detail in
+[CITATIONS.md](./CITATIONS.md). Key sources include:
+
+| Source | Authors | License | Used for |
+|--------|---------|---------|---------|
+| Holdat LLC Indus Corpus v3 | Miller 2025 | Proprietary — statistical derivatives only, no raw data redistributed | Primary inscription corpus |
+| Mahadevan 1977 (M77) | Iravatham Mahadevan | Public domain (ASI / Govt. of India) | Sign numbering (M001–M397) |
+| DEDR | Burrow & Emeneau 1984 | © Clarendon Press — reference use | Dravidian etymological evidence |
+| Parpola 1994 / 2010 | Asko Parpola | © CUP / open conference paper | Decipherment framework, phoneme map |
+| ePSD2 | Tinney et al. / Penn | CC BY-SA | Sumerian/Akkadian name corpus |
+| CDLI | Englund et al. | CC BY-NC-SA 3.0 | Bibliographic reference only (no data committed) |
+| CISI Vols 1–3 | Joshi, Shah, Parpola et al. | © Suomalainen Tiedeakatemia | Reference only (no data redistributed) |
+| Wells 2006 / 2015 | Bryan K. Wells | Open access / © Archaeopress | Sign list cross-reference |
+| Fuls 2022/2023 | Andreas Fuls | © independently published | Sign catalog cross-reference |
+| ICIT | Wells & Fuls | Restricted (TU Berlin) | API reference; no data committed |
+| Nair 2026 | Ashish Nair | CC BY (arXiv) | Independent replication study cited |
+| Laursen 2010 | Steffen Terp Laursen | © Wiley / AAE | Gulf seal catalog, fish-sign validation |
+| Crawford 2001 | Harriet Crawford | © Archaeology International | Dilmun/Saar seal reference |
+| ePSD2 names subset | Penn Babylonian Section | CC BY-SA | Meluhhan name matching (null results) |
+| Tamburini 2025 | Fabio Tamburini | CC BY (Frontiers) | SA algorithm methodology reference |
+
+For the complete bibliography with BibTeX entries, license analysis, and per-file
+attribution, see [CITATIONS.md](./CITATIONS.md) and
+[research/indus/DATA_LICENSES.md](./research/indus/DATA_LICENSES.md).
+
+---
+
+## License compliance summary
+
+- **Holdat LLC corpus (proprietary):** Not redistributed. Only statistical
+  derivatives (positional frequencies, bigram counts, candidate readings) appear
+  in outputs.
+- **ePSD2 (CC BY-SA):** Used only for Meluhhan name matching experiments that
+  produced null results. Not incorporated into released research outputs.
+  The CC BY 4.0 licence on `research/indus/` outputs is unaffected.
+- **CDLI (CC BY-NC-SA):** No CDLI tablet text committed to this repository.
+  All CDLI references are bibliographic only.
+- **Copyrighted academic sources (CISI, Parpola 1994, Mahadevan 2003):** Used
+  as structured analytical references (sign numbers, phoneme assignments, crosswalk
+  mappings). No verbatim text reproduced. Defensible as academic fair use / fair
+  dealing.
+- **PyMuPDF (AGPL):** Used only in standalone research scripts, not in the
+  deployed backend. AGPL network-use provisions do not apply.
+
+Released research outputs (`research/indus/`, anchor tables, phase reports,
+supplemental datasets) are original analysis released under **CC BY 4.0**.
+
+---
+
+## Acknowledgements
+
+This project is indebted to the following scholars and institutions
+(see [CITATIONS.md §Acknowledgements](./CITATIONS.md) for full details):
+
+Iravatham Mahadevan (1930–2018) · Asko Parpola · Bryan K. Wells ·
+Andreas Fuls · William Miller Sr (Holdat LLC) · Ashish Nair ·
+Steffen Terp Laursen · Harriet Crawford · Petteri Koskikallio ·
+Roja Muthiah Research Library (Chennai) · University of Pennsylvania Museum ·
+TIFR (Rao, Yadav, Vahia, Joglekar, Adhikari) · Tamburini (Frontiers AI)
+
+---
+
+## How to cite Glossa-Lab
+
+```bibtex
+@software{glossalab2026,
+  author = {Pierson, Tristen Kyle},
+  title  = {Glossa-Lab: An agentic computational linguistics research platform
+            for statistical analysis and decipherment of ancient writing systems},
+  year   = {2026},
+  url    = {https://github.com/BitConcepts/glossa-lab},
+  note   = {BitConcepts LLC. MIT licence (source); CC BY 4.0 (research outputs).}
+}
+```
+
+---
+
+*Last reviewed: June 2026. Contact tpierson@bitconcepts.tech for any attribution
+concern — we respond within 48 hours.*
@@ -32,11 +32,13 @@ references:
   - type: article
     title: >
       A Falsifiable Computational Decipherment Hypothesis for the Indus Valley Script:
-      605 Proto-Dravidian Sign Readings Validated Across Two Independent Corpora
+      161 Candidate Proto-Dravidian Anchors and a Three-Slot Positional Grammar
     authors:
       - family-names: Pierson
         given-names: Tristen Kyle
         affiliation: "BitConcepts LLC"
+        email: tpierson@bitconcepts.tech
     year: 2026
+    doi: "10.5281/ZENODO.20414696"
     notes: "Preprint v2 — Not peer-reviewed"
-    url: "https://github.com/BitConcepts/glossa-lab/tree/main/research/indus"
+    url: "https://doi.org/10.5281/ZENODO.20414696"
@@ -1218,7 +1218,7 @@ Additional acknowledgements since the last update:
 
 ---
 
-*Last updated: 2026-05-13.*
+*Last updated: June 2026. For attribution concerns contact tpierson@bitconcepts.tech — we respond within 48 hours. See also ATTRIBUTION.md.*
 
 ---
 
 
@@ -171,6 +171,51 @@ def _get_provider_prefs() -> dict[str, Any]:
     return _load_keys().get(_PROVIDERS_KEY, {})
 
 
+# ── Per-provider circuit breaker ────────────────────────────────────────────
+# Providers that consistently fail (e.g. wrong API key, unsupported model)
+# waste latency on every LLM call.  After _CIRCUIT_THRESHOLD consecutive
+# failures across calls we open the circuit for _CIRCUIT_DURATION seconds.
+# The circuit resets automatically on the first successful response.
+_CIRCUIT_THRESHOLD = 5     # open after this many consecutive failures
+_CIRCUIT_DURATION  = 600   # 10 minutes in the open state before retrying
+_provider_fail_counts: dict[str, int]   = {}  # provider_id → consecutive fails
+_provider_circuit_until: dict[str, float] = {}  # provider_id → wall-clock deadline
+
+
+import time as _time_utils  # noqa: E402
+
+
+def _circuit_is_open(provider_id: str, provider_name: str = "") -> bool:
+    """Return True and log a skip if the provider's circuit is open."""
+    until = _provider_circuit_until.get(provider_id, 0.0)
+    if until <= _time_utils.time():
+        return False
+    remaining = until - _time_utils.time()
+    _log.debug(
+        "call_llm: provider %s circuit open — %.0fs remaining, skipping",
+        provider_name or provider_id, remaining,
+    )
+    return True
+
+
+def _circuit_record_failure(provider_id: str, provider_name: str = "") -> None:
+    count = _provider_fail_counts.get(provider_id, 0) + 1
+    _provider_fail_counts[provider_id] = count
+    if count >= _CIRCUIT_THRESHOLD:
+        _provider_circuit_until[provider_id] = _time_utils.time() + _CIRCUIT_DURATION
+        _provider_fail_counts.pop(provider_id, None)  # reset counter for next window
+        _log.warning(
+            "call_llm: provider %s CIRCUIT OPEN after %d consecutive failures — "
+            "will skip for %.0f min. Fix the API key / model name in Settings → Providers.",
+            provider_name or provider_id, count, _CIRCUIT_DURATION / 60,
+        )
+
+
+def _circuit_record_success(provider_id: str) -> None:
+    _provider_fail_counts.pop(provider_id, None)
+    _provider_circuit_until.pop(provider_id, None)
+
+
 # Models known to use chain-of-thought / thinking tokens internally.
 # These require special handling with json_mode to avoid empty responses.
 _THINKING_MODEL_PATTERNS = (
@@ -302,7 +347,7 @@ def call_llm(
             max_tokens=max_tokens, temperature=temperature,
         )
 
-    # ── 1. Bucket-based resolution (new system) ──────────────────────
+    # ── 1. Bucket-based resolution (new system) ──────────────────
     if bucket:
         _excluded = set(exclude_provider_ids) if exclude_provider_ids else set()
         # Try up to 4 slots: bucket-primary, bucket-fallback, global-primary, global-fallback
@@ -311,25 +356,35 @@ def call_llm(
             if not resolved:
                 break
             prov = resolved["_provider"]
+            prov_id = prov["id"]
             model = resolved["model"]
             params = resolved.get("params") or {}
             eff_temp = params.get("temperature", temperature)
             eff_max = params.get("max_tokens", max_tokens)
             is_fb = resolved.get("rank", 1) == 2 or resolved.get("bucket") != bucket
+
+            # Skip providers whose circuit is open (too many consecutive failures).
+            if _circuit_is_open(prov_id, prov["name"]):
+                _excluded.add(prov_id)
+                continue
+
             _log.info(
                 "call_llm → bucket=%s provider=%s model=%s%s",
                 bucket, prov["name"], model,
                 " (fallback)" if is_fb else "",
             )
             try:
-                return _dispatch_provider(
+                result = _dispatch_provider(
                     prov, model, messages,
                     json_mode=json_mode, json_schema=json_schema,
                     max_tokens=eff_max, temperature=eff_temp,
                 )
+                _circuit_record_success(prov_id)  # reset failure counter on success
+                return result
             except RuntimeError as _rt_err:
-                # Connection refused or provider down → exclude and try next
-                _excluded.add(prov["id"])
+                # Record failure; open circuit when threshold is reached.
+                _circuit_record_failure(prov_id, prov["name"])
+                _excluded.add(prov_id)
                 _log.warning(
                     "call_llm: provider %s failed (%s), trying fallback",
                     prov["name"], type(_rt_err).__name__,
 
@@ -157,6 +157,7 @@ def _build_settings_context() -> str:
   compare_results:   {"type":"compare_results",  "params":{"file_a":"<report.json>","file_b":"..."},    "label":"...", "description":"..."}   ← diff two experiment JSON result files
   summarize_session: {"type":"summarize_session","params":{"title":"..."},                              "label":"...", "description":"..."}   ← save conversation as notebook
   acquire_corpus:    {"type":"acquire_corpus",  "params":{"source_id":"<id>","name":"...","corpus_type":"ancient","url":"<opt>"},  "label":"...", "description":"..."}   ← download a corpus
+  build_tooling:     {"type":"build_tooling",   "params":{"name":"...","description":"...","code":"<opt python>","pipeline":"<opt>","experiment_id":"<opt>"},  "label":"...", "description":"..."}   ← build/configure a research tool or utility
 
 ACQUIRABLE CORPUS SOURCE IDs:
   cdli_proto_elamite, cdli_sumerian_ur3, oracc_akkadian, sigla_linear_a,
@@ -176,19 +177,21 @@ def _build_settings_context() -> str:
 
 3. Map the discovery topic to the closest REGISTERED experiments using this guide:
    Fragmentary/incomplete texts, text gaps, restoration:
-     → decoded_text_repetition, blocker_sign_context, reading_frequency_zipf
+     → indus_validation_neg_controls, indus_structural_atlas, indus_cisi_structural
    RNN / neural / ML / computational linguistics methods:
-     → decoded_text_repetition, compound_semantic_coherence, rare_sign_neighbor_profile
+     → indus_cgsa_cluster_analysis, indus_structural_atlas, indus_sign_function_dravidian
    Cross-language comparison, phonological mapping:
      → indus_dravidian_vs_sanskrit, indus_cisi_dravidian_vs_sanskrit, indus_sign_function_dravidian
    Rural distribution, ceramic economy, trade, provenance:
-     → blocker_sign_context, reading_frequency_zipf, indus_cisi_structural
+     → indus_contact_zone_v2, indus_cisi_structural, indus_structural_atlas
    Sign frequency, Zipf law, statistical patterns:
-     → reading_frequency_zipf, rare_sign_neighbor_profile, decoded_text_repetition
+     → indus_structural_atlas, indus_contact_zone_v2, indus_cisi_structural
    Structural analysis, sign position, inscription layout:
-     → indus_cisi_structural, blocker_sign_context, compound_semantic_coherence
+     → indus_cisi_structural, indus_cgsa_cluster_analysis, indus_structural_atlas
    Anchors, validation, confidence building:
-     → indus_cisi_anchored_10, indus_validation_a1_a3_holdout
+     → indus_cisi_anchored_10, indus_validation_a1_a3_holdout, indus_validation_neg_controls
+   General IVC archaeology, Indus script overview:
+     → indus_contact_zone_v2, indus_structural_atlas, indus_sa_dravidian
 
 4. Always include a create_hypothesis action alongside run_experiment actions to record
    the research question the discovery raised.
@@ -199,8 +202,8 @@ def _build_settings_context() -> str:
   "Based on this paper on fragmentary text analysis, I'll run three experiments that
    probe the same question from our existing corpus angle...
    %%ACTIONS%%
-   [{"type":"run_experiment","params":{"id":"decoded_text_repetition"},"label":"Decoded Text Repetition","description":"Checks if decoded readings produce expected text repetition patterns consistent with real language."},
-    {"type":"run_experiment","params":{"id":"blocker_sign_context"},"label":"Blocker Sign Context","description":"Identifies signs that appear in positions suggesting they carry structural/grammatical function."},
+   [{"type":"run_experiment","params":{"id":"indus_validation_neg_controls"},"label":"Negative Controls Validation","description":"Checks if decoded readings pass negative-control statistical tests consistent with real language."},
+    {"type":"run_experiment","params":{"id":"indus_structural_atlas"},"label":"Structural Atlas","description":"Analyses sign position, frequency distribution and structural roles across the CISI corpus."},
     {"type":"create_hypothesis","params":{"title":"RNN restoration insight","statement":"Fragmented Indus inscriptions may be restorable using positional frequency priors similar to Babylonian RNN approach."},"label":"Record Hypothesis","description":"Save the research question raised by this paper."}]
    %%END_ACTIONS%%"
 
@@ -212,6 +215,12 @@ def _build_settings_context() -> str:
 - NEVER use experiment IDs not in REGISTERED EXPERIMENT IDs.
 - NEVER reference file paths that don't exist (path_to_file.csv, data.json, etc.).
 - NEVER claim you cannot execute — you CAN run registered experiments.
+
+=== BUILD_SA_EXPERIMENT CORPUS NAMES (use EXACTLY one of these) ===
+Valid corpus values: indus, indus_cisi, indus_m77, hebrew, geez, phoenician,
+  nw_semitic, ugaritic, meroitic, proto_sinaitic, linear_b, sanskrit, dravidian
+DO NOT use natural-language phrases like "indus valley civilization" — use indus_cisi instead.
+DO NOT use "indus script" or "IVC" — use indus or indus_cisi.
 === END GLOSSA LAB ACTIONS ==="""
 
 _REPORTS = Path(__file__).resolve().parent.parent.parent.parent / "reports"
@@ -1413,6 +1422,35 @@ def _flatten(d: Any, prefix: str = "") -> dict[str, Any]:
             ),
         }
 
+    # ── build_tooling ──────────────────────────────────────────────────────────────
+    # The LLM sometimes emits build_tooling when it wants to create an analysis
+    # tool, utility script, or build/configure a research pipeline component.
+    # We route it to the most appropriate concrete action based on params:
+    #   code → execute_script
+    #   pipeline → run_pipeline
+    #   experiment_id / id → run_experiment
+    #   otherwise → create_notebook (records the tooling intent)
+    if t == "build_tooling":
+        if p.get("code"):
+            return await _execute_action_inner("execute_script", p)
+        if p.get("pipeline"):
+            return await _execute_action_inner("run_pipeline", p)
+        exp_id = p.get("experiment_id") or p.get("id", "")
+        if exp_id:
+            return await _execute_action_inner("run_experiment", {**p, "id": exp_id})
+        # Fallback: save as notebook entry so the intent is not lost
+        return await _execute_action_inner(
+            "create_notebook",
+            {
+                "title": p.get("title", "Tooling: " + p.get("name", "AI-proposed tool")),
+                "content": (
+                    f"**Tool requested by AI:**\n\n"
+                    f"{p.get('description', p.get('label', 'No description provided.'))}\n\n"
+                    f"Params: {json.dumps(p, indent=2)}"
+                ),
+            },
+        )
+
     # ── summarize_session ──────────────────────────────────────────────────────────
     if t == "summarize_session":
         from glossa_lab.database import get_db  # noqa: PLC0415
@@ -1437,7 +1475,7 @@ def _flatten(d: Any, prefix: str = "") -> dict[str, Any]:
         "open_view", "run_experiment", "run_pipeline", "change_setting",
         "generate_report", "create_hypothesis", "create_notebook",
         "clear_jobs", "execute_script", "query_corpus", "compare_results",
-        "acquire_corpus", "summarize_session",
+        "acquire_corpus", "summarize_session", "build_tooling",
     ]
     raise HTTPException(
         400,