We've recently come across an hg38 contig name of the form 'HLA-A*01:01:01:01' which caused problems when used on the commandline in the GATK, as the parser understood the suffix :01 to mean "at position 1" and then is complained that it cannot find HLA-A*01:01:01 (which thankfully isn't in hg38..., otherwise we would have probably not noticed this for a while. Thus the argument -L HLA-A*01:01:01:01 is not one that the GATK can handle.
Its too late for hg38 to ask for any name changes, but I think it might make sense to revisit the allowed characters and remove a few. Currently there are a two characters that are not allowed as the first character in the @sn (* and =) but the remaining printable characters (! through ~) are allowed, and they are all allowed as the following characters. I would propose that we add a few more characters to the "not allowed" list. In particular it would be good to avoid the various quotes (',,") anywhere in the @SN and I would like to suggest that the ending should not look like a genomic position, i.e. it should **not** match the regex ':[0-9](-[0-9])?`'. The reason I say "suggest" is that given that hg38 already includes this ending, it would be quite difficult to reverse that and even if we only put it in a future "version" of the SAM Spec it isn't clear what that means since the old references will still be around...
This isn't a formal proposal...I mostly want to know what people think about this issue and see if we can come up with good solutions.
We've recently come across an hg38 contig name of the form 'HLA-A*01:01:01:01' which caused problems when used on the commandline in the GATK, as the parser understood the suffix :01 to mean "at position 1" and then is complained that it cannot find HLA-A*01:01:01 (which thankfully isn't in hg38..., otherwise we would have probably not noticed this for a while. Thus the argument -L HLA-A*01:01:01:01 is not one that the GATK can handle.
Its too late for hg38 to ask for any name changes, but I think it might make sense to revisit the allowed characters and remove a few. Currently there are a two characters that are not allowed as the first character in the @sn (
*and=) but the remaining printable characters (!through~) are allowed, and they are all allowed as the following characters. I would propose that we add a few more characters to the "not allowed" list. In particular it would be good to avoid the various quotes (',,") anywhere in the @SN and I would like to suggest that the ending should not look like a genomic position, i.e. it should **not** match the regex ':[0-9](-[0-9])?`'. The reason I say "suggest" is that given that hg38 already includes this ending, it would be quite difficult to reverse that and even if we only put it in a future "version" of the SAM Spec it isn't clear what that means since the old references will still be around...This isn't a formal proposal...I mostly want to know what people think about this issue and see if we can come up with good solutions.