Add support for SQ-AN in SAMSequenceDictionary#956
Conversation
Codecov Report
@@ Coverage Diff @@
## master #956 +/- ##
===============================================
+ Coverage 65.963% 65.976% +0.013%
- Complexity 7663 7675 +12
===============================================
Files 537 537
Lines 32553 32583 +30
Branches 5528 5533 +5
===============================================
+ Hits 21473 21497 +24
- Misses 8930 8932 +2
- Partials 2150 2154 +4
|
| * @return the contig associated to the 'originalName/altName', with the AN tag including the altName | ||
| */ | ||
| public SAMSequenceRecord addAlternativeSequenceName(final String originalName, | ||
| final String altName) { |
There was a problem hiding this comment.
According to codecov this new code isn't covered by a test.
lbergelson
left a comment
There was a problem hiding this comment.
@magicDGS Thanks for this work. I made a few comments.
I think it's better to store the AN values in the attributes map with the other attributes so it doesn't need special handling all over the place. It makes the add a single alias method gross because it has to do string manipulation now, but I think it's worth it to avoid adding additional complexity everywhere else.
|
|
||
| /** Returns unmodifiable set with alternative sequence names. */ | ||
| @XmlTransient //we use the field instead of getter/setter | ||
| public Set<String> getAlternativeSequeneNames() { |
There was a problem hiding this comment.
typo: getAlternativeSequeneNames()
| } | ||
| } | ||
| else { | ||
| // TODO: should this also check in the alternative sequences names? |
There was a problem hiding this comment.
It seems like this should match on alternative sequence names if present
There was a problem hiding this comment.
Done (although maybe not in the way you were expecting).
| if (mSequenceLength != that.mSequenceLength) return false; | ||
| if (!attributesEqual(that)) return false; | ||
| if (mSequenceName != that.mSequenceName) return false; // Compare using == since we intern() the name | ||
| // TODO: should this also compare the alternative names? |
There was a problem hiding this comment.
it seems strange to leave alternative names out of equals especially given the weaker isSameSequence() also exists
There was a problem hiding this comment.
Done by comparing the sets.
| * <code>1,chr1,chr01,01,CM000663,NC_000001.10</code> or | ||
| * <code>MT,chrM</code>. | ||
| * | ||
| * @param originalName |
There was a problem hiding this comment.
there are funny line breaks here for some reason
There was a problem hiding this comment.
oh, it matches the weird line breaks above, lets fix those too...
| { | ||
| public static final long serialVersionUID = 1L; // AbstractSAMHeaderRecord implements Serializable | ||
| private String mSequenceName = null; // Value must be interned() if it's ever set/modified | ||
| @XmlAttribute(name = "alternative_sequece_names") |
There was a problem hiding this comment.
typo: alternative_sequece_names
| fields[1] = SAMSequenceRecord.SEQUENCE_NAME_TAG + TAG_KEY_VALUE_SEPARATOR + sequenceRecord.getSequenceName(); | ||
| fields[2] = SAMSequenceRecord.SEQUENCE_LENGTH_TAG + TAG_KEY_VALUE_SEPARATOR + Integer.toString(sequenceRecord.getSequenceLength()); | ||
| encodeTags(sequenceRecord, fields, 3); | ||
| if (sequenceRecord.hasAlternativeSequenceNames()) fields[fields.length - 1] = SAMSequenceRecord.ALTERNATIVE_SEQUENCE_NAME_TAG + TAG_KEY_VALUE_SEPARATOR + String.join(",", sequenceRecord.getAlternativeSequeneNames()); |
There was a problem hiding this comment.
Why does this need special handling? it seems like the alternate contig names should be added to the attributes map and handled the same way as the other attributes.
| @@ -220,6 +223,9 @@ public boolean equals(Object o) { | |||
| * <code>1,chr1,chr01,01,CM000663,NC_000001.10</code> e.g: | |||
There was a problem hiding this comment.
I think the comment on the equals method above needs updating to say that alias's are not considered but Alternative Sequences are.
| public static final long serialVersionUID = 1L; // AbstractSAMHeaderRecord implements Serializable | ||
| private String mSequenceName = null; // Value must be interned() if it's ever set/modified | ||
| @XmlAttribute(name = "alternative_sequece_names") | ||
| private Set<String> mAlternativeSequenceName = new LinkedHashSet<>(); |
There was a problem hiding this comment.
This should be handled the same as the other attributes, stored in the attributes map. It means doing annoying string processing when you add a single alias to an existing set of aliases, but I think making it consistent is worthwhile
|
|
||
| /** Adds an alternative sequence name if it is not the same as the sequence name or it is not present already. */ | ||
| public void addAlternativeSequenceName(final String name) { | ||
| if (!mSequenceName.equals(name) && ! mAlternativeSequenceName.contains(name) ) { |
There was a problem hiding this comment.
These should change to store the names in the attributes map so that you don't need special handling for all the places that handle the tags.
|
|
||
| /** Adds an alternative sequence name if it is not the same as the sequence name or it is not present already. */ | ||
| public void addAlternativeSequenceName(final String name) { | ||
| if (!mSequenceName.equals(name) && ! mAlternativeSequenceName.contains(name) ) { |
There was a problem hiding this comment.
this and the set method should validate that the alternative names are legal according to the regex provided in the sam spec.
|
I implement your comments in two different PRs, @lbergelson. I am not sure if it is a good idea for efficiency to store the AN tag in the attributes map, but I did in the second commit as you requested, just in case we should revert that changes because you didn't mean that. Back to you @lbergelson! |
|
Friendly reminder here, @lbergelson! |
|
@magicDGS Hey, the changes you made here look good. I've been debating with myself what the right solution is though. I hadn't thought through what a large potential performance impact the changes I asked for would be. I'm torn because it's much cleaner than the alternative, but introducing a gross performance regression for minor code cleanlyness seems wrong. |
|
Thanks for the reply @lbergelson - should I revert the last commit? |
9e34333 to
ccad70a
Compare
|
Friendly ping here, @lbergelson! |
|
|
|
I can't remember - maybe it was still under discussion if they should be allowed or I just forgot to add them. Changed in next commit matching the current definition. |
|
Apologies and FYI: After further contemplation of the issue (see samtools/hts-specs#124 (comment)), I suspect we may end up allowing colons in AN after all. |
|
@jmarshall - also in the |
|
Yes, also in AN. See the draft PR samtools/hts-specs#333. I used to think as you do, but… For better or worse these things exist in the wild, so IMHO we have to give up on thinking AN might be a transition mechanism away from allowing them. Reread that comment, which contains pseudocode showing how tools should parse their interval notation even in the presence of stupid colons in names (and I plan to turn that into an appendix for the spec). |
|
FYI RNAME changes have now landed in the SAM spec: in particular the allowed characters (which include colons!) for SN and AN are now the same, so a separate |
| private static char[] WHITESPACE_CHARS = {' ', '\t', '\n', '\013', '\f', '\r'}; // \013 is vertical tab | ||
|
|
||
| // alternative sequence name regexp | ||
| private static final Pattern ALTERNATIVE_SEQUENCE_NAME_REGEXP = Pattern.compile("[0-9A-Za-z][0-9A-Za-z*+.@_|-]*"); |
There was a problem hiding this comment.
no longer needed, use SAMSequenceRecord.LEGAL_RNAME_PATTERN
|
let's resolve this and move it past the finish line.... |
|
@yfarjoun agreed. I am already using Picard and this PR to store alternative coffee tog names, and I actually use them! fulcrumgenomics/fgbio#467 |
| public void addAlternativeSequenceName(final String name) { | ||
| validateAltRegExp(name); | ||
| final Set<String> altSequences = getAlternativeSequenceNames(); | ||
| if (!mSequenceName.equals(name) && ! altSequences.contains(name) ) { |
There was a problem hiding this comment.
inconsistent whitespace
| if (!mSequenceName.equals(name) && ! altSequences.contains(name) ) { | |
| if (!mSequenceName.equals(name) && !altSequences.contains(name)) { |
| // comma-separated | ||
| {"chr1,alt"}, | ||
| // coordinate-like | ||
| {"chr1:1000"}, {"chr1:100-200"} |
There was a problem hiding this comment.
add a testcase with "{" and "}"
| @XmlAttribute(name = "alternative_sequence_names") | ||
| public Set<String> getAlternativeSequenceNames() { | ||
| final String anTag = getAttribute(ALTERNATIVE_SEQUENCE_NAME_TAG); | ||
| return (anTag == null) ? new LinkedHashSet<>() : new LinkedHashSet<>(Arrays.asList(anTag.split(ALTERNATIVE_SEQUENCE_NAME_SEPARATOR))); |
There was a problem hiding this comment.
This shouldn't return a modifiable set, should it? E.g. I would use Collections.emptySet() for the null case.
|
I'm closing this PR as I don't know how to push to daniel's fork....I'll open it with a new branch...sorry for the inconvenience. |
Description
Adding the support for the new
@SQAN tag for alternative sequences (samtools/hts-specs#220).Checklist