Custom records

The GFA2 specification considers each line which starts with a non-standard record type a custom (i.e. user- or program-specific) record. Gfapy allows to retrieve these records and access their data using a similar interface to that for the predefined record types.

Retrieving, adding and deleting custom records

Gfa instances have the property custom_records(), a list of all line instances with a non-standard record type. Among these, records of a specific record type are retrieved using the method Gfa.custom_records_of_type(record_type). Lines are added and deleted using the same methods (add_line() and disconnect()) as for other line types.

>>> g.add_line("X\tcustom line") 
>>> g.add_line("Y\tcustom line") 
>>> [str(line) for line in g.custom_records] 
['X\tcustom line', 'Y\tcustom line']
>>> g.custom_record_keys) 
['X', 'Y']
>>> [str(line) for line in g.custom_records_of_type('X')]
['X\tcustom line']
>>> g.custom_records_of_type("X")[-1].disconnect()
>>> g.custom_records_of_type('X')
[]

Interface without extensions

If no extension (see Extensions section) has been defined to handle a custom record type, the interface has some limitations: the field content is not validated, and the field names are unknown. The generic custom record class is employed (CustomRecord).

As the name of the positional fields in a custom record is not known, a generic name field1, field2, … is used. The number of positional fields is found by getting the length of the positional_fieldnames list.

>>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100") 
>>> x = g.custom_records_of_type('X')[-1]
>>> len(x.positional_fieldnames)
2
>>> x.field1
'a'
>>> x.field2
'b'

Positional fields are allowed to contain any character (including non-printable characters and spacing characters), except tabs and newlines (as they are structural elements of the line). No further validation is performed.

As Gfapy cannot know how many positional fields are present when parsing custom records, a heuristic approach is followed, to identify tags. A field resembles a tag if it starts with tn:d: where tn is a valid tag name and d a valid tag datatype (see Tags chapter). The fields are parsed from the last to the first.

As soon as a field is found which does not resemble a tag, all remaining fields are considered positionals (even if another field parsed later resembles a tag). Due to this, invalid tags are sometimes wrongly taken as positional fields (this can be avoided by writing an extension).

>>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100") 
>>> x1 = g.custom_records_of_type("X")[-1]
>>> x1.cc
10
>>> x1.dd
100
>>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100\te") 
>>> x2 = g.custom_records_of_type("X")[-1]
>>> x2.cc
>>> x2.field3
'cc:i:10'
>>> g.add_line("Z\ta\tb\tcc:i:10\tddd:i:100") 
>>> x3 = g.custom_records_of_type("Z")[-1]
>>> x3.cc
>>> x3.field3
'cc:i:10'
>>> x3.field4
'ddd:i:100'

Extensions

The support for custom fields is limited, as Gfapy does not know which and how many fields are there and how shall they be validated. It is possible to create an extension of Gfapy, which defines new record types: this will allow to use these record types in a similar way to the built-in types.

As an example, an extension will be described, which defines two record types: T for taxa and M for assignments of segments to taxa. For further information about the possible usage case for this extension, see the Supplemental Information to the manuscript describing Gfapy.

The T records will contain a single positional field, tid, a GFA2 identifier, and an optional UL string tag. The M records will contain three positional fields (all three GFA2 identifier): a name field mid (optional), and two references, tid to a T line and sid to an S line. The SC integer tag will be also defined. Here is an example of a GFA containing M and T lines:

S sA 1000 *
S sB 1000 *
M assignment1 t123 sA SC:i:40
M assignment2 t123 sB
M * B12c sB SC:i:20
T B12c
T t123 UL:Z:http://www.taxon123.com

Writing subclasses of the Line class, it is possible to communicate to Gfapy, how records of the M and T class shall be handled. This only requires to define some constants and to call the class method register_extension().

The constants to define are RECORD TYPE, which shall be the content of the record type field (e.g. M); POSFIELDS shall contain an ordered dict, specifying the datatype for each positional field, in the order these fields are found in the line; TAGS_DATATYPE is a dict, specifying the datatype of the predefined optional tags; NAME_FIELD is a field name, and specifies which field contains the identifier of the line. For details on predefined and custom datatypes, see the next sections (Predefined datatypes for extensions and Custom datatypes for extensions).

To handle references, register_extension() can be supplied with a references parameter, a list of triples (fieldname, classname, backreferences). Thereby fieldname is the name of the field in the corresponding record containing the reference (e.g. sid), classname is the name of the class to which the reference goes (e.g. gfa.line.segment.GFA2), and texttt{backreferences} is how the collection of backreferences shall be called, in the records to which reference points to (e.g. metagenomic_assignments).

from collections include OrderedDict

class Taxon(gfapy.Line):
  RECORD_TYPE = "T"
  POSFIELDS = OrderedDict([("tid","identifier_gfa2")])
  TAGS_DATATYPE = {"UL":"Z"}
  NAME_FIELD = "tid"

Taxon.register_extension()

class MetagenomicAssignment(gfapy.Line):
  RECORD_TYPE = "M"
  POSFIELDS = OrderedDict([("mid","optional_identifier_gfa2"),
                           ("tid","identifier_gfa2"),
                           ("sid","identifier_gfa2")])
  TAGS_DATATYPE = {"SC":"i"}
  NAME_FIELD = "mid"

MetagenomicAssignment.register_extension(references=
    [("sid", gfapy.line.segment.GFA2, "metagenomic_assignments"),
     ("tid", Taxon, "metagenomic_assignments")])

Predefined datatypes for extensions

The datatype of fields is specified in Gfapy using classes, which provide functions for decoding, encoding and validating the corresponding data. Gfapy contains a number of datatypes which correspond to the description of the field content in the GFA1 and GFA2 specification.

When writing extensions only the GFA2 field datatypes are generally used (as GFA1 does not contain custom fields). They are summarized in the following table:

Name Example Description
alignment_gfa2 12M1I3M CIGAR string, Trace alignment or Placeholder (*)
identifier_gfa2 S1 ID of a line
oriented_identifier_gfa2 S1+ ID of a line followed by + or -
optional_identifier_gfa2 * ID of a line or Placeholder (*)
identifier_list_gfa2 S1 S2 space separated list of line IDs
oriented_identifier_list_gfa2 S1+ S2- space separated list of line IDs plus orientations
position_gfa2 120$ non-negative integer, optionally followed by $
sequence_gfa2 ACGNNYR sequence of printable chars., no whitespace
string a b_c;d string, no tabs and newlines (Z tags)
char A single character (A tags)
float 1.12 float (f tags)
integer -12 integer (i tags)
optional_integer * integer or placeholder
numeric_array c,10,3 array of integers or floats (B tags)
byte_array 12F1FF hexadecimal byte string (H tags)
json {’b’:2} JSON string, no tabs and newlines (J tags)

Custom datatypes for extensions

For custom records, one sometimes needs datatypes not yet available in the GFA specification. For example, a custom datatype can be defined for the taxon identifier used in the tid field of the T and M records: accordingly the taxon identifier shall be only either in the form taxon:<n>, where <n> is a positive integer, or consist of letters, numbers and underscores only (without :).

To define the datatype, a class is written, which contains the following functions:

  • validate_encoded(string): validates the content of the field, if this is a string (e.g., the name of the T line)
  • validate_decoded(object): validates the content of the field, if this is not a string (e.g., a reference to a T line)
  • decode(string): validates the content of the field (a string) and returns the decoded content; note that references must not be resolved (there is no access to the Gfa instance here), thus the name of the T line will be returned unchanged
  • encode(string): validates the content of the field (not in string form) and returns the string which codes it in the GFA file (also here references are validated but not converted into strings)

Finally the datatype is registered calling register_datatype(). The code for the taxon ID extension is the following:

import re

class TaxonID:

  def validate_encoded(string):
    if not re.match(r"^taxon:(\d+)$",string) and \
        not re.match(r"^[a-zA-Z0-9_]+$", string):
      raise gfapy.ValueError("Invalid taxon ID: {}".format(string))

  def decode(string):
    TaxonID.validate_encoded(string)
    return string

  def validate_decoded(obj):
    if isinstance(obj,Taxon):
      TaxonID.validate_encoded(obj.name)
    else:
      raise gfapy.TypeError(
        "Invalid type for taxon ID: "+"{}".format(repr(obj)))

  def encode(obj):
    TaxonID.validate_decoded(obj)
    return obj

gfapy.Field.register_datatype("taxon_id", TaxonID)

To use the new datatype in the T and M lines defined above (Extensions), the definition of the two subclasses can be changed: in POSFIELDS the value taxon_id shall be assigned to the key tid.