Skip to content

Modeling non-gene features as top-level type fields #8

@ifiddes

Description

@ifiddes

Under the recommendations for the type field, you say:

Best practice: Top-level feature types can include gene and pseudogene. Optionally, include a so_term_name attribute in column 9 to specify the child (type) of gene - e.g. protein_coding_gene, ncRNA_gene, miRNA_gene and snoRNA_gene (http://purl.obolibrary.org/obo/SO_0000704). Transcript features should include the appropriate SO term in column 3 (e.g. mRNA, snoRNA, etc).

I agree with all of this, but I think that the recommendation should be extended further to regularize non-transcribed features.

Right now non-transcribed features can be all over the map, and as a result become hard to parse. In the NCBI annotation of GRCh38, a wide array of top-level non-gene features are used. Additionally, I have not seen any spec define a collection of non-transcribed features (analogous to isoforms of a gene).

In the specification I built under the BioCantor repo, I attempted to regularize top-level features by calling any grouping of non-transcribed features a biological region (which I chose based on SO:0001411), and then deviated from SO by calling any interval in that grouping a feature_interval. I then also chose to define a "joined" interval of non-transcribed feature (analogous to an exon) a subregion.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions