Skip to content

Commit 150ee01

Browse files
TimoLassmannclaude
andcommitted
Cleanup for v3.5.0: license, modes, remove probmsa and dev scripts
- Change license from GPL-3.0 to Apache-2.0 - Remove per-file license headers (COPYING at root is sufficient) - Add unified mode interface (--fast, --precise) to CLI and Python API - Add mode tests, CHANGELOG, update README and Python docs - Remove probmsa (experimental, not released) - Remove 29 one-off benchmark sweep/experiment scripts - Keep core benchmark infrastructure and paper-figure scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 54c0939 commit 150ee01

98 files changed

Lines changed: 2541 additions & 13704 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Changelog
2+
3+
## [3.5.0] - Unreleased
4+
5+
### Headline: Three named modes
6+
7+
Kalign v3.5 introduces three named modes that package the best-performing
8+
configurations into simple flags:
9+
10+
| Mode | CLI | Python | Description |
11+
|------|-----|--------|-------------|
12+
| **default** | `kalign` | `mode="default"` or omit | Consistency anchors + VSM. Best general-purpose. |
13+
| **fast** | `kalign --fast` | `mode="fast"` | VSM only. Fastest, equivalent to kalign v3.4. |
14+
| **precise** | `kalign --precise` | `mode="precise"` | Ensemble(3) + VSM + realign. Highest precision. |
15+
16+
Explicit parameters always override mode defaults.
17+
18+
**Breaking change:** The default behavior now includes consistency anchors
19+
(K=5) and VSM. To get the old v3.4 behavior, use `--fast` or `mode="fast"`.
20+
21+
### New features
22+
23+
- **Three named modes** (`--fast`, `--precise`): Simple entry points inspired
24+
by minimap2's `-x preset` pattern. Most users only need to pick a mode.
25+
Expert parameters remain available and override mode defaults.
26+
- **Ensemble alignment** (`--ensemble N`, or `--precise`): Runs N alignments
27+
with varied gap penalties and tree noise, combines results via POAR
28+
(Pairs of Aligned Residues) consensus. Improves F1 by ~5 points on
29+
BAliBASE at ~10x time cost.
30+
- **POAR consensus save/load** (`--save-poar`, `--load-poar`): Save the
31+
consensus table after an ensemble run, then instantly re-threshold
32+
with different `--min-support` values without re-alignment.
33+
- **Per-column and per-residue confidence scores** from ensemble mode.
34+
Accessible in Python via `result.column_confidence` and
35+
`result.residue_confidence`. Writable as Stockholm PP annotations.
36+
- **Anchor consistency**: Computes K fast BPM anchor alignments and feeds
37+
consistency signals into the substitution matrix. Enabled by default
38+
(K=5) in the default mode.
39+
- **Variable Scoring Matrix (VSM)**: Adapts substitution scores based on
40+
estimated pairwise distance. Enabled by default for protein; disabled
41+
for DNA/RNA.
42+
- **Alignment-guided realignment**: Rebuilds the guide tree from pairwise
43+
identities in the aligned MSA, then re-aligns. Used by `--precise` mode.
44+
- **Sequence weight rebalancing**: Pseudocount-based profile rebalancing
45+
to reduce bias from redundant sequences. Auto-disabled in ensemble
46+
mode where POAR consensus already handles imbalance.
47+
- **Alignment refinement** (`--refine`): Post-alignment column-level
48+
refinement with modes "all", "confident", and "inline".
49+
- **PFASUM substitution matrices**: New `--type pfasum`, `pfasum43`,
50+
`pfasum60` options for PFASUM-family scoring.
51+
- **Python `align()` parity**: The in-memory `align()` function now
52+
supports `mode`, `vsm_amax`, `realign`, and `ensemble_seed` parameters
53+
(previously only available on file-based APIs).
54+
- **New Python constants**: `MODE_DEFAULT`, `MODE_FAST`, `MODE_PRECISE`,
55+
`REFINE_INLINE`, `PROTEIN_PFASUM43`, `PROTEIN_PFASUM60`,
56+
`PROTEIN_PFASUM_AUTO`.
57+
- **Container support**: Podman/Docker files for reproducible benchmarks.
58+
- **Downstream benchmark suite** (`benchmarks/downstream/`): Measures
59+
alignment quality on downstream tasks: HMMER profile search,
60+
phylogenetic tree accuracy (RF distance), and positive selection
61+
detection (HyPhy BUSTED).
62+
63+
### Changed
64+
65+
- **License changed from GPL-3.0-or-later to Apache-2.0.** All
66+
dependencies are permissive-licensed, so no compatibility issues.
67+
- **Default mode now uses consistency anchors (K=5) and VSM.** The
68+
previous default (no consistency, no VSM) is now `--fast`.
69+
- Refine default is `none` (Python docstring previously said
70+
`confident` incorrectly; now fixed).
71+
- `seq_weights` auto-disabled in ensemble mode.
72+
- CLI `--ensemble` without a value defaults to 5 runs.
73+
- C library API expanded with `kalign_run_seeded()`,
74+
`kalign_run_realign()`, `kalign_ensemble()`,
75+
`kalign_consensus_from_poar()`, `kalign_msa_compare_detailed()`,
76+
`kalign_msa_compare_with_mask()`.
77+
78+
### Fixed
79+
80+
- `kalign_arr_to_msa()` now initializes all MSA struct fields (biotype,
81+
alnlen, seq_distances, col_confidence, run_parallel, seq->confidence,
82+
seq->rank). Previously caused `free(): invalid pointer` on Linux
83+
where glibc does not zero malloc'd memory.
84+
- Python `align()` (in-memory path) no longer crashes on Linux due to
85+
the uninitialized fields above.
86+
87+
### BAliBASE results (218 cases, protein)
88+
89+
| Mode | Recall | Precision | F1 | TC | Time |
90+
|------|--------|-----------|-----|-----|------|
91+
| fast (v3.4 equiv) | 0.804 | 0.656 | 0.716 | 0.466 | 10s |
92+
| **default** | 0.810 | 0.665 | 0.724 | 0.472 | 29s |
93+
| **precise** | 0.796 | 0.752 | 0.768 | 0.467 | 193s |
94+
| ClustalO | 0.840 | 0.710 | 0.764 | 0.559 | -- |
95+
| MAFFT | 0.867 | 0.715 | 0.778 | 0.590 | -- |
96+
| MUSCLE5 | 0.870 | 0.721 | 0.783 | 0.581 | -- |

CITATION.cff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ version: 3.4.9
1010
date-released: 2020-03-20
1111
url: "https://github.com/TimoLassmann/kalign"
1212
repository-code: "https://github.com/TimoLassmann/kalign"
13-
license: GPL-3.0-or-later
13+
license: Apache-2.0
1414
preferred-citation:
1515
type: article
1616
authors:

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,7 @@ Contributors are recognized in:
227227

228228
## License
229229

230-
By contributing to Kalign, you agree that your contributions will be licensed under the GNU General Public License v3.0.
230+
By contributing to Kalign, you agree that your contributions will be licensed under the Apache License 2.0.
231231

232232
## Questions?
233233

0 commit comments

Comments
 (0)