Skip to content

Add support for SQ-AN in SAMSequenceDictionary#956

Closed
magicDGS wants to merge 6 commits into
samtools:masterfrom
bioinformagik:dgs_sq_an_tag
Closed

Add support for SQ-AN in SAMSequenceDictionary#956
magicDGS wants to merge 6 commits into
samtools:masterfrom
bioinformagik:dgs_sq_an_tag

Conversation

@magicDGS
Copy link
Copy Markdown
Member

@magicDGS magicDGS commented Jul 28, 2017

Description

Adding the support for the new @SQ AN tag for alternative sequences (samtools/hts-specs#220).

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@codecov-io
Copy link
Copy Markdown

codecov-io commented Jul 28, 2017

Codecov Report

Merging #956 into master will increase coverage by 0.013%.
The diff coverage is 78.125%.

@@               Coverage Diff               @@
##              master      #956       +/-   ##
===============================================
+ Coverage     65.963%   65.976%   +0.013%     
- Complexity      7663      7675       +12     
===============================================
  Files            537       537               
  Lines          32553     32583       +30     
  Branches        5528      5533        +5     
===============================================
+ Hits           21473     21497       +24     
- Misses          8930      8932        +2     
- Partials        2150      2154        +4
Impacted Files Coverage Δ Complexity Δ
...in/java/htsjdk/samtools/SAMSequenceDictionary.java 73.649% <100%> (+0.921%) 46 <1> (+1) ⬆️
...c/main/java/htsjdk/samtools/SAMSequenceRecord.java 74.766% <74.074%> (+0.376%) 47 <10> (+11) ⬆️
...sjdk/samtools/util/Md5CalculatingOutputStream.java 69.231% <0%> (-7.692%) 8% <0%> (-1%)
src/main/java/htsjdk/samtools/BAMFileReader.java 64.286% <0%> (+0.857%) 39% <0%> (+1%) ⬆️

* @return the contig associated to the 'originalName/altName', with the AN tag including the altName
*/
public SAMSequenceRecord addAlternativeSequenceName(final String originalName,
final String altName) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to codecov this new code isn't covered by a test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magicDGS Thanks for this work. I made a few comments.

I think it's better to store the AN values in the attributes map with the other attributes so it doesn't need special handling all over the place. It makes the add a single alias method gross because it has to do string manipulation now, but I think it's worth it to avoid adding additional complexity everywhere else.


/** Returns unmodifiable set with alternative sequence names. */
@XmlTransient //we use the field instead of getter/setter
public Set<String> getAlternativeSequeneNames() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: getAlternativeSequeneNames()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}
else {
// TODO: should this also check in the alternative sequences names?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this should match on alternative sequence names if present

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (although maybe not in the way you were expecting).

if (mSequenceLength != that.mSequenceLength) return false;
if (!attributesEqual(that)) return false;
if (mSequenceName != that.mSequenceName) return false; // Compare using == since we intern() the name
// TODO: should this also compare the alternative names?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems strange to leave alternative names out of equals especially given the weaker isSameSequence() also exists

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by comparing the sets.

* <code>1,chr1,chr01,01,CM000663,NC_000001.10</code> or
* <code>MT,chrM</code>.
*
* @param originalName
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are funny line breaks here for some reason

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it matches the weird line breaks above, lets fix those too...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
public static final long serialVersionUID = 1L; // AbstractSAMHeaderRecord implements Serializable
private String mSequenceName = null; // Value must be interned() if it's ever set/modified
@XmlAttribute(name = "alternative_sequece_names")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: alternative_sequece_names

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

fields[1] = SAMSequenceRecord.SEQUENCE_NAME_TAG + TAG_KEY_VALUE_SEPARATOR + sequenceRecord.getSequenceName();
fields[2] = SAMSequenceRecord.SEQUENCE_LENGTH_TAG + TAG_KEY_VALUE_SEPARATOR + Integer.toString(sequenceRecord.getSequenceLength());
encodeTags(sequenceRecord, fields, 3);
if (sequenceRecord.hasAlternativeSequenceNames()) fields[fields.length - 1] = SAMSequenceRecord.ALTERNATIVE_SEQUENCE_NAME_TAG + TAG_KEY_VALUE_SEPARATOR + String.join(",", sequenceRecord.getAlternativeSequeneNames());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need special handling? it seems like the alternate contig names should be added to the attributes map and handled the same way as the other attributes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -220,6 +223,9 @@ public boolean equals(Object o) {
* <code>1,chr1,chr01,01,CM000663,NC_000001.10</code> e.g:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment on the equals method above needs updating to say that alias's are not considered but Alternative Sequences are.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public static final long serialVersionUID = 1L; // AbstractSAMHeaderRecord implements Serializable
private String mSequenceName = null; // Value must be interned() if it's ever set/modified
@XmlAttribute(name = "alternative_sequece_names")
private Set<String> mAlternativeSequenceName = new LinkedHashSet<>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled the same as the other attributes, stored in the attributes map. It means doing annoying string processing when you add a single alias to an existing set of aliases, but I think making it consistent is worthwhile

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/** Adds an alternative sequence name if it is not the same as the sequence name or it is not present already. */
public void addAlternativeSequenceName(final String name) {
if (!mSequenceName.equals(name) && ! mAlternativeSequenceName.contains(name) ) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should change to store the names in the attributes map so that you don't need special handling for all the places that handle the tags.


/** Adds an alternative sequence name if it is not the same as the sequence name or it is not present already. */
public void addAlternativeSequenceName(final String name) {
if (!mSequenceName.equals(name) && ! mAlternativeSequenceName.contains(name) ) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and the set method should validate that the alternative names are legal according to the regex provided in the sam spec.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@magicDGS
Copy link
Copy Markdown
Member Author

I implement your comments in two different PRs, @lbergelson. I am not sure if it is a good idea for efficiency to store the AN tag in the attributes map, but I did in the second commit as you requested, just in case we should revert that changes because you didn't mean that.

Back to you @lbergelson!

@magicDGS
Copy link
Copy Markdown
Member Author

Friendly reminder here, @lbergelson!

@lbergelson
Copy link
Copy Markdown
Member

@magicDGS Hey, the changes you made here look good. I've been debating with myself what the right solution is though. I hadn't thought through what a large potential performance impact the changes I asked for would be. I'm torn because it's much cleaner than the alternative, but introducing a gross performance regression for minor code cleanlyness seems wrong.

@magicDGS
Copy link
Copy Markdown
Member Author

Thanks for the reply @lbergelson - should I revert the last commit?

@magicDGS
Copy link
Copy Markdown
Member Author

magicDGS commented Oct 9, 2017

Friendly ping here, @lbergelson!

@jmarshall
Copy link
Copy Markdown
Member

ALTERNATIVE_SEQUENCE_NAME_REGEXP is missing _ and | compared to the definition of AN in the SAM spec. Is there a reason for this?

@magicDGS
Copy link
Copy Markdown
Member Author

I can't remember - maybe it was still under discussion if they should be allowed or I just forgot to add them. Changed in next commit matching the current definition.

@jmarshall
Copy link
Copy Markdown
Member

Apologies and FYI: After further contemplation of the issue (see samtools/hts-specs#124 (comment)), I suspect we may end up allowing colons in AN after all.

@magicDGS
Copy link
Copy Markdown
Member Author

magicDGS commented Jun 6, 2018

@jmarshall - also in the AN? I think that not allowing it in the alternative-name is a good solution for unambiguously define contigs with colons: although the main contig name has colons, if the AN does not then it is safe to use to define intervals...

@jmarshall
Copy link
Copy Markdown
Member

Yes, also in AN. See the draft PR samtools/hts-specs#333.

I used to think as you do, but… For better or worse these things exist in the wild, so IMHO we have to give up on thinking AN might be a transition mechanism away from allowing them. Reread that comment, which contains pseudocode showing how tools should parse their interval notation even in the presence of stupid colons in names (and I plan to turn that into an appendix for the spec).

@jmarshall
Copy link
Copy Markdown
Member

FYI RNAME changes have now landed in the SAM spec: in particular the allowed characters (which include colons!) for SN and AN are now the same, so a separate ALTERNATIVE_SEQUENCE_NAME_REGEXP is unnecessary.

private static char[] WHITESPACE_CHARS = {' ', '\t', '\n', '\013', '\f', '\r'}; // \013 is vertical tab

// alternative sequence name regexp
private static final Pattern ALTERNATIVE_SEQUENCE_NAME_REGEXP = Pattern.compile("[0-9A-Za-z][0-9A-Za-z*+.@_|-]*");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed, use SAMSequenceRecord.LEGAL_RNAME_PATTERN

@yfarjoun
Copy link
Copy Markdown
Contributor

let's resolve this and move it past the finish line....

@nh13
Copy link
Copy Markdown
Member

nh13 commented Jun 22, 2019

@yfarjoun agreed. I am already using Picard and this PR to store alternative coffee tog names, and I actually use them! fulcrumgenomics/fgbio#467

public void addAlternativeSequenceName(final String name) {
validateAltRegExp(name);
final Set<String> altSequences = getAlternativeSequenceNames();
if (!mSequenceName.equals(name) && ! altSequences.contains(name) ) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent whitespace

Suggested change
if (!mSequenceName.equals(name) && ! altSequences.contains(name) ) {
if (!mSequenceName.equals(name) && !altSequences.contains(name)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread src/main/java/htsjdk/samtools/SAMSequenceRecord.java
// comma-separated
{"chr1,alt"},
// coordinate-like
{"chr1:1000"}, {"chr1:100-200"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a testcase with "{" and "}"

@yfarjoun yfarjoun self-assigned this Jul 8, 2019
@XmlAttribute(name = "alternative_sequence_names")
public Set<String> getAlternativeSequenceNames() {
final String anTag = getAttribute(ALTERNATIVE_SEQUENCE_NAME_TAG);
return (anTag == null) ? new LinkedHashSet<>() : new LinkedHashSet<>(Arrays.asList(anTag.split(ALTERNATIVE_SEQUENCE_NAME_SEPARATOR)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't return a modifiable set, should it? E.g. I would use Collections.emptySet() for the null case.

@yfarjoun yfarjoun added the Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond label Sep 30, 2019
@yfarjoun
Copy link
Copy Markdown
Contributor

I'm closing this PR as I don't know how to push to daniel's fork....I'll open it with a new branch...sorry for the inconvenience.

@yfarjoun yfarjoun closed this Apr 23, 2020
yfarjoun pushed a commit that referenced this pull request Sep 2, 2020
* Add support for SQ-AN in SAMSequenceDictionary

Co-authored-by: Daniel Gómez-Sánchez <daniel.gomez.sanchez@hotmail.es>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants