Skip to content

Wrong output on assemblies with long scaffolds #18

@KirillKryukov

Description

@KirillKryukov

assembly-stats 1.0.1 produces the following wrong output on one of our assemblies:

sum = 3595598458, n = 896, ave = 4012944.71, largest = 1193142484
N50 = 844155996, n = 2
N60 = 836758171, n = 3
N70 = 836758171, n = 3
N80 = 762492267, n = 4
N90 = 762492267, n = 4
N100 = 18446744071783539901, n = 896
N_count = 178800
Gaps = 1788

Command: assembly-stats scaffolds.fa >scaffolds.stats. The machine is an rather ordinary Linux server.

Correct output from another tool:

Filepath	TotSeqs	TotLen	N50	N75	N90	I50	GC	Avg	Min	Max	AuN
scaffolds.fa	896	7890565754	1193142484	836758171	729054925	891	44.80	8806434.99	17617	2368955581	1224440811

In particular, please that see assembly-stats shows total assembly length as 3,595,598,458, while it should be 7,890,565,754. Also note the N100 of 18446744071783539901.

Also, assembly-stats works fine on our other assemblies of similar total size, but consisting of smaller scaffolds.

Here is the test input: https://biokirr.com/Supporting-Data/assembly-stats-bug-report/scaffolds-N.fa.zstd - It's the same assembly filled with N, so it's only 682 kB compressed. (Even more compact in NAF format: https://biokirr.com/Supporting-Data/assembly-stats-bug-report/scaffolds-N.fa.naf - 125 kB). Decompressed size is 8 GB.

I guess there is some kind of integer overflow, so I hope it will be easy to fix. Please let me know if you need any other information, or the full repro script.

EDIT: Added test data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions