|
| 1 | +# fstalign |
| 2 | + |
| 3 | +`fstalign` is a tool for creating alignment between two sequences of tokens (here out referred to as "reference" and "hypothesis"). It has two key functions: computing word error rate (WER) and aligning [NLP-formatted](docs/NLP-Format.md) references with CTM hypotheses. |
| 4 | + |
| 5 | +## Stack |
| 6 | + |
| 7 | +- **C++** 14 — `CMakeLists.txt` |
| 8 | + Frameworks: CLI11 |
| 9 | +- **Docker** — `Dockerfile` |
| 10 | + |
| 11 | +## Architecture |
| 12 | + |
| 13 | +`fstalign` is a single-binary CLI driven by `src/main.cpp` (CLI11). Two subcommands are exposed (`wer`, `align`); both follow the same pipeline: |
| 14 | + |
| 15 | +1. **Load** the reference and hypothesis into OpenFST acceptors. Loaders are polymorphic on input type — `NlpFstLoader` (NLP-formatted reference), `OneBestFstLoader` (plain text), `FstFileLoader` (pre-built `.fst`), `Ctm` (timed CTM hypothesis). Optional `SynonymEngine` expands the reference acceptor with allowed alternates. |
| 16 | +2. **Compose** reference × hypothesis. Strategy is selected by `--composition-approach`: `StandardComposition` is OpenFST's stock composition; `AdaptedComposition` (default) is a lazy/streaming variant tuned for long sequences with many errors and is what the v2.0 speedup is about. Both implement `IComposition`. |
| 17 | +3. **Traverse** the composed graph to extract the best alignment path. `Walker` drives the search using a `PathHeap`, and `AlignmentTraversor` walks the resulting path to materialise per-token decisions (match / sub / ins / del) along with class and entity tags from the NLP reference. |
| 18 | +4. **Score & emit**. `wer.cpp` computes WER and class-broken-out stats; `fast-d.cpp` provides the lazy edit-distance kernel used during traversal. Outputs include side-by-side text (`sbs.txt`), a JSON log (schema: `docs/json_log_schema.json`), an optional NLP file with insertions, and stdout WER summaries. |
| 19 | + |
| 20 | +``` |
| 21 | +ref (NLP / 1-best / .fst) ──┐ |
| 22 | + ├─► Loaders ─► [SynonymEngine] ─► IComposition ─► Walker ─► AlignmentTraversor ─► wer / json_logging ─► sbs.txt, json log, stdout |
| 23 | +hyp (CTM / 1-best / .fst) ──┘ (Standard|Adapted) (PathHeap, fast-d) |
| 24 | +``` |
| 25 | + |
| 26 | +All long-running work happens in `fstaligner-common` (the library); `main.cpp` is just CLI parsing and dispatch into the top-level entry points in `fstalign.cpp`. |
| 27 | + |
| 28 | +## External dependencies |
| 29 | + |
| 30 | +- **OpenFST** (library) — finite-state transducer operations; provided via `OPENFST_ROOT` (env or `-DOPENFST_ROOT=` on the cmake line) |
| 31 | +- **ICU** (library) — Unicode and locale support (`find_package(ICU)`) |
| 32 | +- **Threads** (library) — system threading (`find_package(Threads)`) |
| 33 | +- **CLI11** (library) — command-line argument parsing (vendored submodule) |
| 34 | +- **spdlog** (library) — logging (vendored submodule) |
| 35 | +- **jsoncpp** (library) — JSON output construction (vendored submodule) |
| 36 | +- **csv** (library) — fast CTM and NLP CSV parsing (vendored submodule) |
| 37 | +- **inih** (library) — INI file parsing (vendored submodule) |
| 38 | +- **strtk** (library) — string utilities (vendored submodule) |
| 39 | +- **catch2** (library) — unit testing (vendored submodule) |
| 40 | +- **debian** (container) — Debian Bullseye base image (Dockerfile) |
| 41 | + |
| 42 | +## Workspace |
| 43 | + |
| 44 | +- Path: `fstalign` |
| 45 | + |
| 46 | +**Build** |
| 47 | + |
| 48 | +```bash |
| 49 | +git submodule update --init --recursive |
| 50 | +cmake -S . -B build -DOPENFST_ROOT="$OPENFST_ROOT" -DDYNAMIC_OPENFST=ON |
| 51 | +cmake --build build |
| 52 | +``` |
| 53 | + |
| 54 | +**Test** |
| 55 | + |
| 56 | +```bash |
| 57 | +ctest --test-dir build |
| 58 | +``` |
| 59 | + |
| 60 | +**Lint** |
| 61 | + |
| 62 | +```bash |
| 63 | +clang-format --dry-run --Werror src/*.cpp src/*.h |
| 64 | +``` |
| 65 | + |
| 66 | +**Run** |
| 67 | + |
| 68 | +```bash |
| 69 | +./build/fstalign --help |
| 70 | +``` |
| 71 | + |
| 72 | +## Configuration |
| 73 | + |
| 74 | +| Variable | Source | Default | Purpose | |
| 75 | +| --- | --- | --- | --- | |
| 76 | +| `OPENFST_ROOT` | env | — | Path to OpenFST install; required at build (CMake reads it) and embedded in Docker image at `/opt/openfst` | |
| 77 | +| `DYNAMIC_OPENFST` | flag | `OFF` | Pass `-DDYNAMIC_OPENFST=ON` to cmake when OpenFST at `OPENFST_ROOT` is built as shared libraries | |
| 78 | +| `--ref` | flag | — | Reference file (NLP format) | |
| 79 | +| `--hyp` | flag | — | Hypothesis file (CTM) | |
| 80 | +| `--composition-approach` | flag | `adapted` | Composition strategy: `adapted` or `standard` | |
| 81 | +| `--use-case` | flag | `false` | Case-sensitive comparison | |
| 82 | +| `--use-punctuation` | flag | `false` | Include punctuation in alignment | |
| 83 | +| `--disable-strict-punctuation` | flag | `false` | Allow punctuation to align with words | |
| 84 | +| `--disable-favored-subs` | flag | `false` | Disable preference for case-only substitutions | |
| 85 | +| `--favored-sub-cost` | flag | `0.1` | Cost for favored (case-only) substitutions | |
| 86 | + |
| 87 | +## Project layout |
| 88 | + |
| 89 | +- `src/` — C++ implementation |
| 90 | +- `test/` — Catch2 unit tests + test data |
| 91 | +- `docs/` — NLP/Synonyms format specs, Usage guide, JSON log schema |
| 92 | +- `sample_data/` — example reference and hypothesis inputs |
| 93 | +- `tools/` — auxiliary scripts (`gather_runtime_metrics.sh`, `generate_wer_test_data.pl`, `sbs2fst.py`) |
| 94 | +- `third-party/` — vendored dependencies (git submodules) |
| 95 | +- `ext/` — external source tarballs (OpenFST) used by the Docker build |
| 96 | + |
| 97 | +## Entry points |
| 98 | + |
| 99 | +- `src/main.cpp` (executable) — `fstalign` (declared via `add_executable` in `CMakeLists.txt:85`) |
| 100 | +- `Dockerfile` (CMD) — `PATH` puts `/fstalign/bin/fstalign` on the image path |
| 101 | + |
| 102 | +## Conventions |
| 103 | + |
| 104 | +- Third-party C++ dependencies are vendored as git submodules under `third-party/`; initialize with `git submodule update --init --recursive` before building |
| 105 | +- Tests live in `test/` as Catch2 `*.cc` files |
| 106 | +- OpenFST is intentionally NOT vendored — pulled at build time from `ext/openfst-<version>.tar.gz` (Docker) or pointed at via `OPENFST_ROOT` (host) |
| 107 | + |
| 108 | +## Code style |
| 109 | + |
| 110 | +- **Guide**: Google C++ Style Guide |
| 111 | +- **Formatter**: `clang-format -style=file` (config: `.clang-format`, `BasedOnStyle: Google`) |
| 112 | +- **Linters**: `clang-tidy`, `cppcheck`, `shellcheck` (for `tools/*.sh`) |
| 113 | + |
| 114 | +## Current focus |
| 115 | + |
| 116 | +_(last touched 2026-03-12 by Todd C. Parnell, Kirill Bykov, Jp)_ |
| 117 | + |
| 118 | +## Links |
| 119 | + |
| 120 | +- Repository: git@github-work:revdotcom/fstalign.git |
| 121 | +- Documentation: https://github.com/revdotcom/fstalign/blob/develop/docs/Usage.md |
0 commit comments