UUID4 name tokeniser plus improvements to mixed data sets.#134
Merged
Conversation
A candidate for replacement of samtools#132. We still need to do more work here on optimising name compression in a general manner, especially on mixed data sets, but this is resolves a simple case while having no impact on any other name formats.
We already had code for spotting this in search_trie, so improved that a little and use it in encode_name instead of having a second scan. Also improve the compression of mixed data sets. This still isn't optimal as we'd need to start separating the name classes and adding NOP tokens, but it's often a 10-20% compression improvement.
991198f to
40b8c82
Compare
| ((d[f+0] >= '0' && d[f+0] <='9') || (d[f+0] >= 'a' && d[f+0] <= 'f')) && | ||
| ((d[f+35] >= '0' && d[f+35] <='9') || (d[f+35] >= 'a' && d[f+35] <= 'f'))) { | ||
| } else if (l >= 36 | ||
| && d[f+8]=='-' && d[f+13]=='-' && d[f+18]=='-' && d[f+23]=='-' |
There was a problem hiding this comment.
d[f+14] is always 4 in uuid4. See wikipedia. So this can also be added as a check.
Member
There was a problem hiding this comment.
We tried some alternate UUID versions. This change seems to be beneficial even with UUIDs with a common prefix (like version 7), so it seems best to leave it accepting all UUID versions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With this change on my local corpus:
Master:
For uuid4 see #132. The difference between uuid4 and uuid4+ is the latter has "-flowcell-10" appended to every line. The old code had explicit check for ONT UUIDs, but not bare ones without an extension, so it needed 37 chars and not 36. Hence master compresses the data with "-flowcell-10" better than without. Regardless of that however, both compress better still by using an array of CHAR tokens instead of separate string-char-string-char-... combinations (due to nuls).
We can demonstrate this easily with two names of foo: and bar: where is +1 in foo and +1-16 in bar.
Theoretically foo should have no information per record other than the initial meta-data describing the format.
Bar however has 4 bits of ddelta offset per record, so has 2.5Kb of information to encode. Mixing them should not increase the size as we are strictly alternating and the diff distance is a fixed -2.
We can do
cat _ | tokenise_namevsegrep foo _ | ...andegrep bar _ | ...to validate the mixed vs separate names.Master gives 13300, 86 and 2603 respectively.
This PR gives 2684, 86 and 2603 repsectively.
The change achieves this by improving the returned node rather than -1, which defaults to the previous record. It was previously always delta bar against foo and foo against bar.