Skip to content

Commit e96dc3b

Browse files
eugenep-revclaude
andauthored
NERD-3467: Clockspeed: Document speech team projects with document-it plugin (#64)
Add CLAUDE.md orientation file. Existing README.md left untouched. Generated with Claude Code Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 39576bb commit e96dc3b

1 file changed

Lines changed: 121 additions & 0 deletions

File tree

CLAUDE.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# fstalign
2+
3+
`fstalign` is a tool for creating alignment between two sequences of tokens (here out referred to as "reference" and "hypothesis"). It has two key functions: computing word error rate (WER) and aligning [NLP-formatted](docs/NLP-Format.md) references with CTM hypotheses.
4+
5+
## Stack
6+
7+
- **C++** 14 — `CMakeLists.txt`
8+
Frameworks: CLI11
9+
- **Docker**`Dockerfile`
10+
11+
## Architecture
12+
13+
`fstalign` is a single-binary CLI driven by `src/main.cpp` (CLI11). Two subcommands are exposed (`wer`, `align`); both follow the same pipeline:
14+
15+
1. **Load** the reference and hypothesis into OpenFST acceptors. Loaders are polymorphic on input type — `NlpFstLoader` (NLP-formatted reference), `OneBestFstLoader` (plain text), `FstFileLoader` (pre-built `.fst`), `Ctm` (timed CTM hypothesis). Optional `SynonymEngine` expands the reference acceptor with allowed alternates.
16+
2. **Compose** reference × hypothesis. Strategy is selected by `--composition-approach`: `StandardComposition` is OpenFST's stock composition; `AdaptedComposition` (default) is a lazy/streaming variant tuned for long sequences with many errors and is what the v2.0 speedup is about. Both implement `IComposition`.
17+
3. **Traverse** the composed graph to extract the best alignment path. `Walker` drives the search using a `PathHeap`, and `AlignmentTraversor` walks the resulting path to materialise per-token decisions (match / sub / ins / del) along with class and entity tags from the NLP reference.
18+
4. **Score & emit**. `wer.cpp` computes WER and class-broken-out stats; `fast-d.cpp` provides the lazy edit-distance kernel used during traversal. Outputs include side-by-side text (`sbs.txt`), a JSON log (schema: `docs/json_log_schema.json`), an optional NLP file with insertions, and stdout WER summaries.
19+
20+
```
21+
ref (NLP / 1-best / .fst) ──┐
22+
├─► Loaders ─► [SynonymEngine] ─► IComposition ─► Walker ─► AlignmentTraversor ─► wer / json_logging ─► sbs.txt, json log, stdout
23+
hyp (CTM / 1-best / .fst) ──┘ (Standard|Adapted) (PathHeap, fast-d)
24+
```
25+
26+
All long-running work happens in `fstaligner-common` (the library); `main.cpp` is just CLI parsing and dispatch into the top-level entry points in `fstalign.cpp`.
27+
28+
## External dependencies
29+
30+
- **OpenFST** (library) — finite-state transducer operations; provided via `OPENFST_ROOT` (env or `-DOPENFST_ROOT=` on the cmake line)
31+
- **ICU** (library) — Unicode and locale support (`find_package(ICU)`)
32+
- **Threads** (library) — system threading (`find_package(Threads)`)
33+
- **CLI11** (library) — command-line argument parsing (vendored submodule)
34+
- **spdlog** (library) — logging (vendored submodule)
35+
- **jsoncpp** (library) — JSON output construction (vendored submodule)
36+
- **csv** (library) — fast CTM and NLP CSV parsing (vendored submodule)
37+
- **inih** (library) — INI file parsing (vendored submodule)
38+
- **strtk** (library) — string utilities (vendored submodule)
39+
- **catch2** (library) — unit testing (vendored submodule)
40+
- **debian** (container) — Debian Bullseye base image (Dockerfile)
41+
42+
## Workspace
43+
44+
- Path: `fstalign`
45+
46+
**Build**
47+
48+
```bash
49+
git submodule update --init --recursive
50+
cmake -S . -B build -DOPENFST_ROOT="$OPENFST_ROOT" -DDYNAMIC_OPENFST=ON
51+
cmake --build build
52+
```
53+
54+
**Test**
55+
56+
```bash
57+
ctest --test-dir build
58+
```
59+
60+
**Lint**
61+
62+
```bash
63+
clang-format --dry-run --Werror src/*.cpp src/*.h
64+
```
65+
66+
**Run**
67+
68+
```bash
69+
./build/fstalign --help
70+
```
71+
72+
## Configuration
73+
74+
| Variable | Source | Default | Purpose |
75+
| --- | --- | --- | --- |
76+
| `OPENFST_ROOT` | env || Path to OpenFST install; required at build (CMake reads it) and embedded in Docker image at `/opt/openfst` |
77+
| `DYNAMIC_OPENFST` | flag | `OFF` | Pass `-DDYNAMIC_OPENFST=ON` to cmake when OpenFST at `OPENFST_ROOT` is built as shared libraries |
78+
| `--ref` | flag || Reference file (NLP format) |
79+
| `--hyp` | flag || Hypothesis file (CTM) |
80+
| `--composition-approach` | flag | `adapted` | Composition strategy: `adapted` or `standard` |
81+
| `--use-case` | flag | `false` | Case-sensitive comparison |
82+
| `--use-punctuation` | flag | `false` | Include punctuation in alignment |
83+
| `--disable-strict-punctuation` | flag | `false` | Allow punctuation to align with words |
84+
| `--disable-favored-subs` | flag | `false` | Disable preference for case-only substitutions |
85+
| `--favored-sub-cost` | flag | `0.1` | Cost for favored (case-only) substitutions |
86+
87+
## Project layout
88+
89+
- `src/` — C++ implementation
90+
- `test/` — Catch2 unit tests + test data
91+
- `docs/` — NLP/Synonyms format specs, Usage guide, JSON log schema
92+
- `sample_data/` — example reference and hypothesis inputs
93+
- `tools/` — auxiliary scripts (`gather_runtime_metrics.sh`, `generate_wer_test_data.pl`, `sbs2fst.py`)
94+
- `third-party/` — vendored dependencies (git submodules)
95+
- `ext/` — external source tarballs (OpenFST) used by the Docker build
96+
97+
## Entry points
98+
99+
- `src/main.cpp` (executable) — `fstalign` (declared via `add_executable` in `CMakeLists.txt:85`)
100+
- `Dockerfile` (CMD) — `PATH` puts `/fstalign/bin/fstalign` on the image path
101+
102+
## Conventions
103+
104+
- Third-party C++ dependencies are vendored as git submodules under `third-party/`; initialize with `git submodule update --init --recursive` before building
105+
- Tests live in `test/` as Catch2 `*.cc` files
106+
- OpenFST is intentionally NOT vendored — pulled at build time from `ext/openfst-<version>.tar.gz` (Docker) or pointed at via `OPENFST_ROOT` (host)
107+
108+
## Code style
109+
110+
- **Guide**: Google C++ Style Guide
111+
- **Formatter**: `clang-format -style=file` (config: `.clang-format`, `BasedOnStyle: Google`)
112+
- **Linters**: `clang-tidy`, `cppcheck`, `shellcheck` (for `tools/*.sh`)
113+
114+
## Current focus
115+
116+
_(last touched 2026-03-12 by Todd C. Parnell, Kirill Bykov, Jp)_
117+
118+
## Links
119+
120+
- Repository: git@github-work:revdotcom/fstalign.git
121+
- Documentation: https://github.com/revdotcom/fstalign/blob/develop/docs/Usage.md

0 commit comments

Comments
 (0)