Gfapy documentation

Introduction

The Graphical Fragment Assembly (GFA) are formats for the representation of sequence graphs, including assembly, variation and splicing graphs. Two versions of GFA have been defined (GFA1 and GFA2) and several sequence analysis programs have been adopting the formats as an interchange format, which allow to easily combine different sequence analysis tools.

This library implements the GFA1 and GFA2 specification described at https://github.com/GFA-spec/GFA-spec/blob/master/GFA-spec.md. It allows to create a Gfa object from a file in the GFA format or from scratch, to enumerate the graph elements (segments, links, containments, paths and header lines), to traverse the graph (by traversing all links outgoing from or incoming to a segment), to search for elements (e.g. which links connect two segments) and to manipulate the graph (e.g. to eliminate a link or a segment or to duplicate a segment distributing the read counts evenly on the copies).

The GFA format can be easily extended by users by defining own custom tags and record types. In Gfapy, it is easy to write extensions modules, which allow to define custom record types and datatypes for the parsing and validation of custom fields. The custom lines can be connected, using references, to each other and to lines of the standard record types.

Requirements

Gfapy has been written for Python 3 and tested using Python version 3.7. It does not require any additional Python packages or other software.

Installation

Gfapy is distributed as a Python package and can be installed using the Python package manager pip, as well as conda (in the Bioconda channel). It is also available as a package in some Linux distributions (Debian, Ubuntu).

The following command installs the current stable version from the Python Packages index:

pip install gfapy

If you would like to install the current development version from Github, use the following command:

pip install -e git+https://github.com/ggonnella/gfapy.git#egg=gfapy

Alternatively it is possible to install gfapy using conda. Gfapy is included in the Bioconda (https://bioconda.github.io/) channel:

conda install -c bioconda gfapy

Usage

If you installed gfapy as described above, you can import it in your script using the conventional Python syntax:

>>> import gfapy

Documentation

The documentation, including this introduction to Gfapy, a user manual and the API documentation is hosted on the ReadTheDocs server, at the URL http://gfapy.readthedocs.io/en/latest/ and it can be downloaded as PDF from the URL https://github.com/ggonnella/gfapy/blob/master/manual/gfapy-manual.pdf.

References

Giorgio Gonnella and Stefan Kurtz “GfaPy: a flexible and extensible software library for handling sequence graphs in Python”, Bioinformatics (2017) btx398 https://doi.org/10.1093/bioinformatics/btx398

Changelog

== 1.2.3 ==

- make it possible to count input header lines correctly

== 1.2.2 ==

- remove to/from aliases for GFA1 containment/link fields
  since `to` potentially clashes with a tag name

== 1.2.1 ==

- fixed an issue with linear path merging (issue 21)
- GFA1 paths can contain a single segment only (issue 22)

== 1.2.0 ==

- fixed all open issues

== 1.1.0 ==

- fix: custom tags are not necessarily lower case
- additional support for rGFA subset of GFA1 by setting option dialect="rgfa"

== 1.0.0 ==

- initial release

The Gfa class

The content of a GFA file is represented in Gfapy by an instance of the class Gfa. In most cases, the Gfa instance will be constructed from the data contained in a GFA file, using the method Gfa.from_file().

Alternatively, it is possible to use the construct of the class; it takes an optional positional parameter, the content of a GFA file (as string, or as list of strings, one per line of the GFA file). If no GFA content is provided, the Gfa instance will be empty.

>>> gfa = gfapy.Gfa("H\tVN:Z:1.0\nS\tA\t*")
>>> print(len(gfa.lines))
2
>>> gfa = gfapy.Gfa(["H\tVN:Z:1.0", "S\tA\t*", "S\tB\t*"])
>>> print(len(gfa.lines))
3
>>> gfa = gfapy.Gfa()
>>> print(len(gfa.lines))
0

The string representation of the Gfa object (which can be obtained using str()) is the textual representation in GFA format. Using Gfa.to_file(filename) allows writing this representation to a GFA file (the content of the file is overwritten).

>>> g1 = gfapy.Gfa()
>>> g1.append("H\tVN:Z:1.0")
>>> g1.append("S\ta\t*")
>>> g1.to_file("my.gfa") 
>>> g2 = gfapy.Gfa.from_file("my.gfa") 
>>> str(g1)
'H\tVN:Z:1.0\nS\ta\t*'

All methods for creating a Gfa (constructor and from_file) accept a vlevel parameter, the validation level, and can assume the values 0, 1, 2 and 3. A higher value means more validations are performed. The Validation chapter explains the meaning of the different validation levels in detail. The default value is 1.

>>> gfapy.Gfa().vlevel
1
>>> gfapy.Gfa(vlevel = 0).vlevel
0

A further parameter is version. It can be set to 'gfa1', 'gfa2' or left to the default value (None). The default is to auto-detect the version of the GFA from the line content. If the version is set manually, any content not compatible to the specified version will trigger an exception. If the version is set automatically, an exception will be raised if two lines are found, with content incompatible to each other (e.g. a GFA1 segment followed by a GFA2 segment).

>>> g = gfapy.Gfa(version='gfa2')
>>> g.version
'gfa2'
>>> g.add_line("S\t1\t*")
Traceback (most recent call last):
...
gfapy.error.VersionError: Version: 1.0 (None)
...
>>> g = gfapy.Gfa()
>>> g.version
>>> g.add_line("S\t1\t*")
>>> g.version
'gfa1'
>>> g.add_line("S\t1\t100\t*")
Traceback (most recent call last):
...
gfapy.error.VersionError: Version: 1.0 (None)
...

Collections of lines

The property lines of the Gfa object is a list of all the lines in the GFA file (including the header, which is split into single-tag lines). The list itself shall not be modified by the user directly (i.e. adding and removing lines is done using a different interface, see below). However the single elements of the list can be edited.

>>> for line in gfa.lines: print(line)

For most record types, a list of the lines of the record type is available as a read-only property, which is named after the record type, in plural.

>>> [str(line) for line in gfa1.segments]
['S\t1\t*', 'S\t2\t*', 'S\t3\t*']
>>> [str(line) for line in gfa2.fragments]
[]

A particular case are edges; these are in GFA1 links and containments, while in GFA2 there is a unified edge record type, which also allows to represent internal alignments. In Gfapy, the edges property retrieves all edges (i.e. all E lines in GFA2, and all L and C lines in GFA1). The dovetails property is a list of all edges which represent dovetail overlaps (i.e. all L lines in GFA1 and a subset of the E lines in GFA2). The containments property is a list of all edges which represent containments (i.e. all C lines in GFA1 and a subset of the E lines in GFA2).

>>> gfa2.edges
[]
>>> gfa2.dovetails
[]
>>> gfa2.containments
[]

Paths are retrieved using the paths property. This list contains all P lines in GFA1 and all O lines in GFA2. Sets returns the list of all U lines in GFA2 (empty list in GFA1).

>>> gfa2.paths
[]
>>> gfa2.sets
[]

The header contain metadata in a single or multiple lines. For ease of access to the header information, all its tags are summarized in a single line instance, which is retrieved using the header property. This list The The Header chapter of this manual explains more in detail, how to work with the header object.

>>> gfa2.header.TS
100

All lines which start by the string # are comments; they are handled in the Comments chapter and are retrieved using the comments property.

>>> [str(line) for line in gfa1.comments]
['# this is a comment']

Custom lines are lines of GFA2 files which start with a non-standard record type. Gfapy provides basic built-in support for accessing the information in custom lines, and allows to define extensions for own record types for defining more advanced functionality (see the Custom records chapter).

>>> [str(line) for line in gfa2.custom_records]
['X\tcustom line', 'Y\tcustom line']
>>> gfa2.custom_record_keys
['X', 'Y']
>>> [str(line) for line in gfa2.custom_records_of_type('X')]
['X\tcustom line']

Line identifiers

Some GFA lines have a mandatory or optional identifier field: segments and paths in GFA1, segments, gaps, edges, paths and sets in GFA2. A line of this type can be retrieved by identifier, using the method Gfa.line(ID) using the identifier as argument.

>>> str(gfa1.line('1'))
'S\t1\t*'

The GFA2 specification prescribes the exact namespace for the identifier (segments, paths, sets, edges and gaps identifier share the same namespace). The content of this namespace can be retrieved using the names property. The identifiers of single line types can be retrieved using the properties segment_names, edge_names, gap_names, path_names and set_names.

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t100\t*")
>>> g.add_line("S\tB\t100\t*")
>>> g.add_line("S\tC\t100\t*")
>>> g.add_line("E\tb_c\tB+\tC+\t0\t10\t90\t100$\t*")
>>> g.add_line("O\tp1\tB+ C+")
>>> g.add_line("U\ts1\tA b_c g")
>>> g.add_line("G\tg\tA+\tB-\t1000\t*")
>>> g.names
['A', 'B', 'C', 'b_c', 'g', 'p1', 's1']
>>> g.segment_names
['A', 'B', 'C']
>>> g.path_names
['p1']
>>> g.edge_names
['b_c']
>>> g.gap_names
['g']
>>> g.set_names
['s1']

The GFA1 specification does not handle the question of the namespace of identifiers explicitly. However, gfapy assumes and enforces a single namespace for segment, path names and the values of the ID tags of L and C lines. The content of this namespace can be found using names property. The identifiers of single line types can be retrieved using the properties segment_names, edge_names (ID tags of links and containments) and path_names. For GFA1, the properties gap_names, set_names contain always empty lists.

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t*")
>>> g.add_line("S\tB\t*")
>>> g.add_line("S\tC\t*")
>>> g.add_line("L\tB\t+\tC\t+\t*\tID:Z:b_c")
>>> g.add_line("P\tp1\tB+,C+\t*")
>>> g.names
['A', 'B', 'C', 'b_c', 'p1']
>>> g.segment_names
['A', 'B', 'C']
>>> g.path_names
['p1']
>>> g.edge_names
['b_c']
>>> g.gap_names
[]
>>> g.set_names
[]

Identifiers of external sequences

Fragments contain identifiers which refer to external sequences (not contained in the GFA file). According to the specification, the these identifiers are not part of the same namespace as the identifier of the GFA lines. They can be retrieved using the external_names property.

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t100\t*")
>>> g.add_line("F\tA\tread1+\t10\t30\t0\t20$\t20M")
>>> g.external_names
['read1']

The method Gfa.fragments_for_external(external_ID) retrieves all F lines with a specified external sequence identifier.

>>> f = g.fragments_for_external('read1')
>>> len(f)
1
>>> str(f[0])
'F\tA\tread1+\t10\t30\t0\t20$\t20M'

Adding new lines

New lines can be added to a Gfa instance using the Gfa.add_line(line) method or its alias Gfa.append(line). The argument can be either a string describing a line with valid GFA syntax, or a Line instance. If a string is added, a line instance is created and then added.

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t*") 
>>> g.segment_names
['A']
>>> g.append("S\tB\t*") 
>>> g.segment_names
['A', 'B']

Editing the lines

Accessing the information stored in the fields of a line instance is described in the Positional fields and Tags chapters.

In Gfapy, a line instance belonging to a Gfa instance is said to be connected to the Gfa instance. Direct editing the content of a connected line is only possible, for those fields which do not contain references to other lines. For more information on how to modify the content of the fields of connected line, see the References chapter.

>>> g = gfapy.Gfa()
>>> e = gfapy.Line("E\t*\tA+\tB-\t0\t10\t90\t100$\t*")
>>> e.sid1 = "C+"
>>> g.add_line(e) 
>>> e.sid1 = "A+"
Traceback (most recent call last):
gfapy.error.RuntimeError: ...

Removing lines

Disconnecting a line from the Gfa instance is done using the Gfa.rm(line) method. The argument can be a line instance or the name of a line.

In alternative, a line instance can also be disconnected using the disconnect method on it. Disconnecting a line may trigger other operations, such as the disconnection of other lines (see the References chapter).

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t*") 
>>> g.segment_names
['A']
>>> g.rm('A') 
>>> g.segment_names
[]
>>> g.append("S\tB\t*") 
>>> g.segment_names
['B']
>>> b = g.line('B')
>>> b.disconnect()
>>> g.segment_names
[]

Renaming lines

Lines with an identifier can be renamed. This is done simply by editing the corresponding field (such as name or sid for a segment). This field is not a reference to another line and can be freely edited also in line instances connected to a Gfa. All references to the line from other lines will still be up to date, as they will refer to the same instance (whose name has been changed) and their string representation will use the new name.

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t*") 
>>> g.add_line("L\tA\t+\tB\t-\t*") 
>>> g.segment_names
['A', 'B']
>>> g.dovetails[0].from_name
'A'
>>> g.segment('A').name = 'C'
>>> g.segment_names
['B', 'C']
>>> g.dovetails[0].from_name
'C'

Validation

Different validation levels are available. They represent different compromises between speed and warrant of validity. The validation level can be specified when the Gfa object is created, using the vlevel parameter of the constructor and of the from_file method. Four levels of validation are defined (0 = no validation, 1 = validation by reading, 2 = validation by reading and writing, 3 = continuous validation). The default validation level value is 1.

Manual validation

Independently from the validation level chosen, the user can always check the value of a field calling validate_field() on the line instance. If no exception is raised, the field content is valid.

To check if the entire content of the line is valid, the user can call validate() on the line instance. This will check all fields and perform cross-field validations, such as comparing the length of the sequence of a GFA1 segment, to the value of the LN tag (if present).

It is also possible to validate the structure of the GFA, for example to check if there are unresolved references to lines. To do this, use the validate() of the Gfa instance.

No validations

If the validation is set to 0, Gfapy will try to accept any input and never raise an exception. This is not always possible, and in some cases, an exception will still be raised, if the data is invalid.

Validation when reading

If the validation level is set to 1 or higher, basic validations will be performed, such as checking the number of positional fields, the presence of duplicated tags, the tag datatype of predefined tags. Additionally, all tags will be validated, either during parsing or on first access. Record-type cross-field validations will also be performed.

In other words, a validation of 1 means that Gfapy guarantees (as good as it can) that the GFA content read from a file is valid, and will raise an exception on accessing the data if not.

The user is supposed to call validate_field after changing a field content to something which can be potentially invalid, or validate() if potentially cross-field validations could fail.

Validation when writing

Setting the level to 2 will perform all validations described above, plus validate the fields content when their value is written to string.

In other words, a validation of 2 means that Gfapy guarantee (as good as it can) that the GFA content read from a file and written to a file is valid and will raise an exception on accessing the data or writing to file if not.

Continuous validation

If the validation level is set to 3, all validations for lower levels described above are run, plus a validation of fields contents each time a setter method is used.

A validation of 3 means that Gfapy guarantees (as good as it can) that the GFA content is always valid.

Positional fields

Most lines in GFA have positional fields (Headers are an exception). During parsing, if a line is encountered, which has too less or too many positional fields, an exception will be thrown. The correct number of positional fields is record type-specific.

Positional fields are recognized by its position in the line. Each positional field has an implicit field name and datatype associated with it.

Field names

The field names are derived from the specification. Lower case versions of the field names are used and spaces are substituted with underscores. In some cases, the field names were changed, as they represent keywords in common programming languages or clash with potential tag names (from, to, send).

The following tables shows the field names used in Gfapy, for each kind of line. Headers have no positional fields. Comments and custom records follow particular rules, see the respective chapters (Comments and Custom records).

GFA1 field names

Record Type Field 1 Field 2 Field 3 Field 4 Field 5 Field 6
Segment name sequence        
Link from_segment from_orient to_segment to_orient overlap  
Containment from_segment from_orient to_segment to_orient pos overlap
Path path_name segment_names overlaps      

GFA2 field names

Record Type Field 1 Field 2 Field 3 Field 4 Field 5 Field 6 Field 7 Field 8
Segment sid slen sequence          
Edge eid sid1 sid2 beg1 end1 beg2 end2 alignment
Fragment sid external s_beg s_end f_beg f_end alignment  
Gap gid sid1 d1 d2 sid2 disp var  
Set pid items            
Path pid items            

Datatypes

The datatype of each positional field is described in the specification and cannot be changed (differently from tags). Here is a short description of the Python classes used to represent data for different datatypes.

Placeholders

The positional fields in GFA can never be empty. However, there are some fields with optional values. If a value is not specified, a placeholder character is used instead (*). Such undefined values are represented in Gfapy by the Placeholder class, which is described more in detail in the Placeholders chapter.

Arrays

The items field in unordered and ordered groups and the segment_names and overlaps fields in paths are lists of objects and are represented by list instances.

>>> set = gfapy.Line("U\t*\t1 A 2")
>>> type(set.items)
<class 'list'>
>>> gfa2_path = gfapy.Line("O\t*\tA+ B-")
>>> type(gfa2_path.items)
<class 'list'>
>>> gfa1_path = gfapy.Line("P\tp1\tA+,B-\t10M,9M1D1M")
>>> type(gfa1_path.segment_names)
<class 'list'>
>>> type(gfa1_path.overlaps)
<class 'list'>

Orientations

Orientations are represented by strings. The gfapy.invert() method applied to an orientation string returns the other orientation.

>>> gfapy.invert("+")
'-'
>>> gfapy.invert("-")
'+'

Identifiers

The identifier of the line itself (available for S, P, E, G, U, O lines) can always be accessed in Gfapy using the name alias and is represented in Gfapy by a string. If it is optional (E, G, U, O lines) and not specified, it is represented by a Placeholder instance. The fragment identifier is also a string.

Identifiers which refer to other lines are also present in some line types (L, C, E, G, U, O, F). These are never placeholders and in stand-alone lines are represented by strings. In connected lines they are references to the Line instances to which they refer to (see the References chapter).

Oriented identifiers

Oriented identifiers (e.g. segment_names in GFA1 paths) are represented by elements of the class gfapy.OrientedLine. The segment method of the oriented segments returns the segment identifier (or segment reference in connected path lines) and the orient method returns the orientation string. The name method returns the string of the segment, even if this is a reference to a segment. A new oriented line can be created using the OL[line, orientation] method.

Calling invert returns an oriented segment, with inverted orientation. To set the two attributes the methods segment= and orient= are available.

Examples:

>>> p = gfapy.Line("P\tP1\ta+,b-\t*")
>>> p.segment_names
[gfapy.OrientedLine('a','+'), gfapy.OrientedLine('b','-')]
>>> sn0 = p.segment_names[0]
>>> sn0.line
'a'
>>> sn0.name
'a'
>>> sn0.orient
'+'
>>> sn0.invert()
>>> sn0
gfapy.OrientedLine('a','-')
>>> sn0.orient
'-'
>>> sn0.line = gfapy.Line('S\tX\t*')
>>> str(sn0)
'X-'
>>> sn0.name
'X'
>>> sn0 = gfapy.OrientedLine(gfapy.Line('S\tY\t*'), '+')

Sequences

Sequences (S field sequence) are represented by strings in Gfapy. Depending on the GFA version, the alphabet definition is more or less restrictive. The definitions are correctly applied by the validation methods.

The method rc() is provided to compute the reverse complement of a nucleotidic sequence. The extended IUPAC alphabet is understood by the method. Applied to non nucleotidic sequences, the results will be meaningless:

>>> from gfapy.sequence import rc
>>> rc("gcat")
'atgc'
>>> rc("*")
'*'
>>> rc("yatc")
'gatr'
>>> rc("gCat")
'atGc'
>>> rc("cag", rna=True)
'cug'

Integers and positions

The C lines pos field and the G lines disp and var fields are represented by integers. The var field is optional, and thus can be also a placeholder. Positions are 0-based coordinates.

The position fields of GFA2 E lines (beg1, beg2, end1, end2) and F lines (s_beg, s_end, f_beg, f_end) contain a dollar string as suffix if the position is equal to the segment length. For more information, see the Positions chapter.

Alignments

Alignments are always optional, ie they can be placeholders. If they are specified they are CIGAR alignments or, only in GFA2, trace alignments. For more details, see the Alignments chapter.

GFA1 datatypes

Datatype Record Type Fields
Identifier Segment name
  Path path_name
  Link from_segment, to_segment
  Containment from_segment, to_segment
[OrientedIdentifier] Path segment_names
Orientation Link from_orient, to_orient
  Containment from_orient, to_orient
Sequence Segment sequence
Alignment Link overlap
  Containment overlap
[Alignment] Path overlaps
Position Containment pos

GFA2 datatypes

Datatype Record Type Fields
Itentifier Segment sid
  Fragment sid
OrientedIdentifier Edge sid1, sid2
  Gap sid1, sid2
  Fragment external
OptionalIdentifier Edge eid
  Gap gid
  U Group oid
  O Group uid
[Identifier] U Group items
[OrientedIdentifier] O Group items
Sequence Segment sequence
Alignment Edge alignment
  Fragment alignment
Position Edge beg1, end1, beg2, end2
  Fragment s_beg, s_end, f_beg, f_end
Integer Gap disp, var

Reading and writing positional fields

The positional_fieldnames method returns the list of the names (as strings) of the positional fields of a line. The positional fields can be read using a method on the Gfapy line object, which is called as the field name. Setting the value is done with an equal sign version of the field name method (e.g. segment.slen = 120). In alternative, the set(fieldname, value) and get(fieldname) methods can also be used.

>>> s_gfa1 = gfapy.Line("S\t1\t*")
>>> s_gfa1.positional_fieldnames
['name', 'sequence']
>>> s_gfa1.name
'1'
>>> s_gfa1.get("name")
'1'
>>> s_gfa1.name = "segment2"
>>> s_gfa1.name
'segment2'
>>> s_gfa1.set('name',"3")
>>> s_gfa1.name
'3'

When a field is read, the value is converted into an appropriate object. The string representation of a field can be read using the field_to_s(fieldname) method.

>>> gfa = gfapy.Gfa()
>>> gfa.add_line("S\ts1\t*")
>>> gfa.add_line("L\ts1\t+\ts2\t-\t*")
>>> link = gfa.dovetails[0]
>>> str(link.from_segment)
'S\ts1\t*'
>>> link.field_to_s('from_segment')
's1'

When setting a non-string field, the user can specify the value of a tag either as a Python non-string object, or as the string representation of the value.

>>> gfa = gfapy.Gfa(version='gfa1')
>>> gfa.add_line("C\ta\t+\tb\t-\t10\t*")
>>> c = gfa.containments[0]
>>> c.pos
10
>>> c.pos = 1
>>> c.pos
1
>>> c.pos = "2"
>>> c.pos
2
>>> c.field_to_s("pos")
'2'

Note that setting the value of reference and backreferences-related fields is generally not allowed, when a line instance is connected to a Gfa object (see the References chapter).

>>> gfa = gfapy.Gfa(version='gfa1')
>>> l = gfapy.Line("L\ts1\t+\ts2\t-\t*")
>>> l.from_name
's1'
>>> l.from_segment = "s3"
>>> l.from_name
's3'
>>> gfa.add_line(l)
>>> l.from_segment = "s4"
Traceback (most recent call last):
...
gfapy.error.RuntimeError: ...

Validation

The content of all positional fields must be a correctly formatted string according to the rules given in the GFA specifications (or a Python object whose string representation is a correctly formatted string).

Depending on the validation level, more or less checks are done automatically (see the Validation chapter). Not regarding which validation level is selected, the user can trigger a manual validation using the validate_field(fieldname) method for a single field, or using validate, which does a full validation on the whole line, including all positional fields.

>>> line = gfapy.Line("H\txx:i:1")
>>> line.validate_field("xx")
>>> line.validate()

Aliases

For some fields, aliases are defined, which can be used in all contexts where the original field name is used (i.e. as parameter of a method, and the same setter and getter methods defined for the original field name are also defined for each alias, see below).

>>> gfa1_path = gfapy.Line("P\tX\t1-,2+,3+\t*")
>>> gfa1_path.name == gfa1_path.path_name
True
>>> edge = gfapy.Line("E\t*\tA+\tB-\t0\t10\t90\t100$\t*")
>>> edge.eid == edge.name
True
>>> containment = gfapy.Line("C\tA\t+\tB\t-\t10\t*")
>>> containment.from_segment == containment.container
True
>>> segment = gfapy.Line("S\t1\t*")
>>> segment.sid == segment.name
True
>>> segment.sid
'1'
>>> segment.name = '2'
>>> segment.sid
'2'

Name

Different record types have an identifier field: segments (name in GFA1, sid in GFA2), paths (path_name), edge (eid), fragment (sid), gap (gid), groups (pid).

All these fields are aliased to name. This allows the user for example to set the identifier of a line using the name=(value) method using the same syntax for different record types (segments, edges, paths, fragments, gaps and groups).

Version-specific field names

For segments the GFA1 name and the GFA2 sid are equivalent fields. For this reason an alias sid is defined for GFA1 segments and name for GFA2 segments.

Crypical field names

The definition of from and to for containments is somewhat cryptic. Therefore following aliases have been defined for containments: container[_orient] for from[_|segment|orient]; contained[_orient] for to[_segment|orient].

Placeholders

Some positional fields may contain an undefined value S: sequence; L/C: overlap; P: overlaps; E: eid, alignment; F: alignment; G: gid, var; U/O: pid. In GFA this value is represented by a *.

In Gfapy the class Placeholder represent the undefined value.

Distinguishing placeholders

The gfapy.is_placeholder() method allows to check if a value is a placeholder; a value is a placeholder if it is a Placeholder instance, or would represent a placeholder in GFA (a string containing *), or would be represented by a placeholder in GFA (e.g. an empty array).

>>> gfapy.is_placeholder("*")
True
>>> gfapy.is_placeholder("**")
False
>>> gfapy.is_placeholder([])
True
>>> gfapy.is_placeholder(gfapy.Placeholder())
True

Note that, as a placeholder is False in boolean context, just a if not placeholder will also work, if the value is an instance of Placeholder, but not always for the other cases (in particular not for the string representation *). Therefore using gfapy.is_placeholder() is better.

>>> if "*": print('* is not a placeholder')
* is not a placeholder
>>> if gfapy.is_placeholder("*"): print('but it represents a placeholder')
but it represents a placeholder

Compatibility methods

Some methods are defined for placeholders, which allow them to respond to the same methods as defined values. This allows to write generic code.

>>> placeholder = gfapy.Placeholder()
>>> placeholder.validate() # does nothing
>>> len(placeholder)
0
>>> placeholder[1]
gfapy.Placeholder()
>>> placeholder + 1
gfapy.Placeholder()

Positions

The only position field in GFA1 is the pos field in the C lines. This represents the starting position of the contained segment in the container segment and is 0-based.

Some fields in GFA2 E lines (beg1, beg2, end1, end2) and F lines (s_beg, s_end, f_beg, f_end) are positions. According to the specification, they are 0-based and represent virtual ticks before and after each string in the sequence. Thus ranges are represented similarly to the Python range conventions: e.g. a 1-character prefix of a sequence will have begin 0 and end 1.

Last positions in GFA2

The GFA2 positions must contain an additional string ($) appended to the integer, if (and only if) they are the last position in the segment sequence. These particular positions are represented in Gfapy as instances of the class LastPos.

To create a lastpos instance, the constructor can be used with an integer, or the string representation (which must end with the dollar sign, otherwise an integer is returned):

>>> str(gfapy.LastPos(12))
'12$'
>>> gfapy.LastPos("12")
12
>>> str(gfapy.LastPos("12"))
'12'
>>> gfapy.LastPos("12$")
gfapy.LastPos(12)
>>> str(gfapy.LastPos("12$"))
'12$'

Subtracting an integer from a lastpos returns a lastpos if 0 subtracted, an integer otherwise. This allows to do some arithmetic on positions without making them invalid.

>>> gfapy.LastPos(12) - 0
gfapy.LastPos(12)
>>> gfapy.LastPos(12) - 1
11

The functions islastpos() and isfirstpos() allow to determine if a position value is 0 (first), or the last position, using the same syntax for lastpos and integer instances.

>>> gfapy.isfirstpos(0)
True
>>> gfapy.islastpos(0)
False
>>> gfapy.isfirstpos(12)
False
>>> gfapy.islastpos(12)
False
>>> gfapy.islastpos(gfapy.LastPos("12"))
False
>>> gfapy.islastpos(gfapy.LastPos("12$"))
True

Alignments

Some GFA1 (L/C overlap, P overlaps) and GFA2 (E/F alignment) fields contain alignments or lists of alignments. The alignment can be left unspecified and a placeholder symbol * used instead. In GFA1 the alignments can be given as CIGAR strings, in GFA2 also as Dazzler traces.

Gfapy uses three different classes for representing the content of alignment fields: CIGAR, Trace and AlignmentPlaceholder.

Creating an alignment

An alignment instance is usually created from its GFA string representation or from a list by using the gfapy.Alignment() constructor.

>>> from gfapy import Alignment
>>> Alignment("*")
gfapy.AlignmentPlaceholder()
>>> Alignment("10,10,10")
gfapy.Trace([10,10,10])
>>> Alignment([10,10,10])
gfapy.Trace([10,10,10])
>>> Alignment("30M2I")
gfapy.CIGAR([gfapy.CIGAR.Operation(30,'M'), gfapy.CIGAR.Operation(2,'I')])

If the argument is an alignment object it will be returned, so that is always safe to call the method on a variable which can contain a string or an alignment instance:

>>> Alignment(Alignment("*"))
gfapy.AlignmentPlaceholder()
>>> Alignment(Alignment("10,10"))
gfapy.Trace([10,10])

Recognizing undefined alignments

The gfapy.is_placeholder() method allows to test if an alignment field contains an undefined value (placeholder) instead of a defined value (CIGAR string, trace). The method accepts as argument either an alignment object or a string or list representation.

>>> from gfapy import is_placeholder, Alignment
>>> is_placeholder(Alignment("30M"))
False
>>> is_placeholder(Alignment("10,10"))
False
>>> is_placeholder(Alignment("*"))
True
>>> is_placeholder("*")
True
>>> is_placeholder("30M")
False
>>> is_placeholder("10,10")
False
>>> is_placeholder([])
True
>>> is_placeholder([10,10])
False

Note that, as a placeholder is False in boolean context, just a if not aligment will also work, if alignment is an alignment object. But this of course, does not work, if it is a string representation. Therefore it is better to use the gfapy.is_placeholder() method, which works in both cases.

>>> if not Alignment("*"): print('no alignment')
no alignment
>>> if is_placeholder(Alignment("*")): print('no alignment')
no alignment
>>> if "*": print('not a placeholder...?')
not a placeholder...?
>>> if is_placeholder("*"): print('really? it is a placeholder!')
really? it is a placeholder!

Reading and editing CIGARs

CIGARs are represented by specialized lists, instances of the class CIGAR, whose elements are CIGAR operations CIGAR operations are represented by instance of the class Operation, and provide the properties length (length of the operation, an integer) and code (one-letter string which specifies the type of operation). Note that not all operations allowed in SAM files (for which CIGAR strings were first defined) are also meaningful in GFA and thus GFA2 only allows the operations M, I, D and P.

>>> cigar = gfapy.Alignment("30M")
>>> isinstance(cigar, list)
True
>>> operation = cigar[0]
>>> type(operation)
<class 'gfapy.alignment.cigar.CIGAR.Operation'>
>>> operation.code
'M'
>>> operation.code = 'D'
>>> operation.length
30
>>> len(operation)
30
>>> str(operation)
'30D'

As a CIGAR instance is a list, list methods apply to it. If the array is emptied, its string representation will be the placeholder symbol *.

>>> cigar = gfapy.Alignment("1I20M2D")
>>> cigar[0].code = "M"
>>> cigar.pop(1)
gfapy.CIGAR.Operation(20,'M')
>>> str(cigar)
'1M2D'
>>> cigar[:] = []
>>> str(cigar)
'*'

The validate CIGAR.validate() function checks if a CIGAR instance is valid. A version can be provided, as the CIGAR validation is version specific (as GFA2 forbids some CIGAR operations).

>>> cigar = gfapy.Alignment("30M10D20M5I10M")
>>> cigar.validate()
>>> cigar[1].code = "L"
>>> cigar.validate()
Traceback (most recent call last):
  ...
gfapy.error.ValueError:
>>> cigar = gfapy.Alignment("30M10D20M5I10M")
>>> cigar[1].code = "X"
>>> cigar.validate(version="gfa1")
>>> cigar.validate(version="gfa2")
Traceback (most recent call last):
  ...
gfapy.error.ValueError:

Reading and editing traces

Traces are arrays of non-negative integers. The values are interpreted using a trace spacing value. If traces are used, a trace spacing value must be defined in a TS integer tag, either in the header, or in the single lines which contain traces (which takes precedence over the header global value).

>>> print(gfa) 
H TS:i:100
E x A+ B- 0 100$ 0 100$ 4,2 TS:i:50
...
>>> gfa.header.TS
100
>>> gfa.line("x").TS
50

Query, reference and complement

CIGARs are asymmetric, i.e.they consider one sequence as reference and another sequence as query.

The length_on_reference() and length_on_query() methods compute the length of the alignment on the two sequences. These methods are used by the library e.g. to convert GFA1 L lines to GFA2 E lines (which is only possible if CIGARs are provided).

>>> cigar = gfapy.Alignment("30M10D20M5I10M")
>>> cigar.length_on_reference()
70
>>> cigar.length_on_query()
65

CIGARs are dependent on which sequence is taken as reference and which is taken as query. For each alignment, a complement CIGAR can be computed using the method complement(); it is the CIGAR obtained when the two sequences are switched.

>>> cigar = gfapy.Alignment("2M1D3M")
>>> str(cigar.complement())
'3M1I2M'

The current version of Gfapy does not provide a way to compute the alignment, thus the trace information can be accessed and edited, but not used for this purpose. Because of this there is currently no way in Gfapy to compute a complement trace (trace obtained when the sequences are switched).

>>> trace = gfapy.Alignment("1,2,3")
>>> str(trace.complement())
'*'

The complement of a placeholder is a placeholder:

>>> str(gfapy.Alignment("*").complement())
'*'

Tags

Each record in GFA can contain tags. Tags are fields which consist in a tag name, a datatype and data. The format is NN:T:DATA where NN is a two-letter tag name, T is a one-letter datatype string and DATA is a string representing the data according to the specified datatype. Tag names must be unique for each line, i.e. each line may only contain a tag once.

# Examples of GFA tags of different datatypes:
"aa:i:-12"
"bb:f:1.23"
"cc:Z:this is a string"
"dd:A:X"
"ee:B:c,12,3,2"
"ff:H:122FA0"
'gg:J:["A","B"]'

Custom tags

Some tags are explicitly defined in the specification (these are named predefined tags in Gfapy), and the user or an application can define its own custom tags. These may contain lower case letters.

Custom tags are user or program specific and may of course collide with the tags used by other users or programs. For this reasons, if you write scripts which employ custom tags, you should always check that the values are of the correct datatype and plausible.

>>> line = gfapy.Line("H\txx:i:2")
>>> if line.get_datatype("xx") != "i":
...   raise Exception("I expected the tag xx to contain an integer!")
>>> myvalue = line.xx
>>> if (myvalue > 120) or (myvalue % 2 == 1):
...   raise Exception("The value in the xx tag is not an even value <= 120")
>>> # ... do something with myvalue

Also it is good practice to allow the user of the script to change the name of the custom tags. For example, Gfapy employs the +or+ custom tag to track the original segment from which a segment in the final graph is derived. All methods which read or write the +or+ tag allow to specify an alternative tag name to use instead of +or+, for the case that this name collides with the custom tag of another program.

# E.g. a method which does something with myvalue, usually stored in tag xx
# allows the user to specify an alternative name for the tag
def mymethod(line, mytag="xx"):
  myvalue = line.get(mytag)
  # ...

Predefined tags

According to the GFA specifications, predefined tag names consist of either two upper case letters, or an upper case letter followed by a digit. The GFA1 specification predefines tags for each line type, while GFA2 only predefines tags for the header and edges.

While tags with the predefined names are allowed to be added to any line, when they are used in the lines mentiones in the specification (e.g. VN in the header) gfapy checks that the datatype is the one prescribed by the specification (e.g. VN must be of type Z). It is not forbidden to use the same tags in other contexts, but in this case, the datatype restriction is not enforced.

Tag | Type Line types GFA version
VN | Z H 1,2
TS i H,S 2
LN i S 1
RC i S,L,C 1
FC i S,L 1
KC i S,L 1
SH H S 1
UR Z S 1
MQ i L 1
NM i L,i 1
ID Z L,C 1
"VN:Z:1.0" # VN => predefined tag
"z5:Z:1.0" # z5 first char is downcase => custom tag
"XX:Z:aaa" # XX upper case, but not predefined => custom tag

# not forbidden, but not recommended:
"zZ:Z:1.0" # => mixed case, first char downcase => custom tag
"Zz:Z:1.0" # => mixed case, first char upcase => custom tag
"vn:Z:1.0" # => same name as predefined tag, but downcase => custom tag

Datatypes

The following table summarizes the datatypes available for tags:

Symbol Datatype Example Python class
Z string This is a string str
i integer -12 int
f float 1.2E-5 float
A char X str
J JSON [1,{“k1”:1,”k2”:2},”a”] list/dict
B numeric array f,1.2,13E-2,0 gfapy.NumericArray
H byte array FFAA01 gfapy.ByteArray

Validation

The tag names must consist of a letter and a digit or two letters.

"KC:i:1"  # => OK
"xx:i:1"  # => OK
"x1:i:1"  # => OK
"xxx:i:1" # => error: name is too long
"x:i:1"   # => error: name is too short
"11:i:1"  # => error: at least one letter must be present

The datatype must be one of the datatypes specified above. For predefined tags, Gfapy also checks that the datatype given in the specification is used.

"xx:X:1" # => error: datatype X is unknown
"VN:i:1" # => error: VN must be of type Z

The data must be a correctly formatted string for the specified datatype or a Python object whose string representation is a correctly formatted string.

# current value: xx:i:2
>>> line = gfapy.Line("S\tA\t*\txx:i:2")
>>> line.xx = 1
>>> line.xx
1
>>> line.xx = "3"
>>> line.xx
3
>>> line.xx = "A"
>>> line.xx
Traceback (most recent call last):
...
gfapy.error.FormatError: ...

Depending on the validation level, more or less checks are done automatically (see Validation chapter). Per default - validation level (1) - validation is performed only during parsing or accessing values the first time, therefore the user must perform a manual validation if he changes values to something which is not guaranteed to be correct. To trigger a manual validation, the user can call the method validate_field(fieldname) to validate a single tag, or validate() to validate the whole line, including all tags.

>>> line = gfapy.Line("S\tA\t*\txx:i:2", vlevel = 0)
>>> line.validate_field("xx")
>>> line.validate()
>>> line.xx = "A"
>>> line.validate_field("xx")
Traceback (most recent call last):
...
gfapy.error.FormatError: ...
>>> line.validate()
Traceback (most recent call last):
...
gfapy.error.FormatError: ...
>>> line.xx = "3"
>>> line.validate_field("xx")
>>> line.validate()

Reading and writing tags

Tags can be read using a property on the Gfapy line object, which is called as the tag (e.g. line.xx). A special version of the property prefixed by try_get_ raises an error if the tag was not available (e.g. line.try_get_LN), while the tag property (e.g. line.LN) would return None in this case. Setting the value is done assigning a value to it the tag name method (e.g. line.TS = 120). In alternative, the set(fieldname, value), get(fieldname) and try_get(fieldname) methods can also be used. To remove a tag from a line, use the delete(fieldname) method, or set its value to None. The tagnames property Line instances is a list of the names (as strings) of all defined tags for a line.

>>> line = gfapy.Line("S\tA\t*\txx:i:1", vlevel = 0)
>>> line.xx
1
>>> line.xy is None
True
>>> line.try_get_xx()
1
>>> line.try_get_xy()
Traceback (most recent call last):
...
gfapy.error.NotFoundError: ...
>>> line.get("xx")
1
>>> line.try_get("xy")
Traceback (most recent call last):
...
gfapy.error.NotFoundError: ...
>>> line.xx = 2
>>> line.xx
2
>>> line.xx = "a"
>>> line.tagnames
['xx']
>>> line.xy = 2
>>> line.xy
2
>>> line.set("xy", 3)
>>> line.get("xy")
3
>>> line.tagnames
['xx', 'xy']
>>> line.delete("xy")
3
>>> line.xy is None
True
>>> line.xx = None
>>> line.xx is None
True
>>> line.try_get("xx")
Traceback (most recent call last):
...
gfapy.error.NotFoundError: ...
>>> line.tagnames
[]

When a tag is read, the value is converted into an appropriate object (see Python classes in the datatype table above). When setting a value, the user can specify the value of a tag either as a Python object, or as the string representation of the value.

>>> line = gfapy.Line('H\txx:i:1\txy:Z:TEXT\txz:J:["a","b"]')
>>> line.xx
1
>>> isinstance(line.xx, int)
True
>>> line.xy
'TEXT'
>>> isinstance(line.xy, str)
True
>>> line.xz
['a', 'b']
>>> isinstance(line.xz, list)
True

The string representation of a tag can be read using the field_to_s(fieldname) method. The default is to only output the content of the field. By setting ``tag: true```, the entire tag is output (name, datatype, content, separated by colons). An exception is raised if the field does not exist.

>>> line = gfapy.Line("H\txx:i:1")
>>> line.xx
1
>>> line.field_to_s("xx")
'1'
>>> line.field_to_s("xx", tag=True)
'xx:i:1'

Datatype of custom tags

The datatype of an existing custom field (but not of predefined fields) can be changed using the set_datatype(fieldname, datatype) method. The current datatype specification can be read using get_datatype(fieldname).

>>> line = gfapy.Line("H\txx:i:1")
>>> line.get_datatype("xx")
'i'
>>> line.set_datatype("xx", "Z")
>>> line.get_datatype("xx")
'Z'

If a new custom tag is specified, Gfapy selects the correct datatype for it: i/f for numeric values, J/B for arrays, J for hashes and Z for strings and strings. If the user wants to specify a different datatype, he may do so by setting it with set_datatype() (this can be done also before assigning a value, which is necessary if full validation is active).

>>> line = gfapy.Line("H")
>>> line.xx = "1"
>>> line.xx
'1'
>>> line.set_datatype("xy", "i")
>>> line.xy = "1"
>>> line.xy
1

Arrays of numerical values

B and H tags represent array with particular constraints (e.g. they can only contain numeric values, and in some cases the values must be in predefined ranges). In order to represent them correctly and allow for validation, Python classes have been defined for both kind of tags: gfapy.ByteArray for H and gfapy.NumericArray for B fields.

Both are subclasses of list. Object of the two classes can be created by passing an existing list or the string representation to the class constructor.

>>> # create a byte array instance
>>> gfapy.ByteArray([12,3,14])
b'\x0c\x03\x0e'
>>> gfapy.ByteArray("A012FF")
b'\xa0\x12\xff'
>>> # create a numeric array instance
>>> gfapy.NumericArray.from_string("c,12,3,14")
[12, 3, 14]
>>> gfapy.NumericArray([12,3,14])
[12, 3, 14]

Instances of the classes behave as normal lists, except that they provide a #validate() method, which checks the constraints, and that their string representation is the GFA string representation of the field value.

>>> gfapy.NumericArray([12,1,"1x"]).validate()
Traceback (most recent call last):
...
gfapy.error.ValueError
>>> str(gfapy.NumericArray([12,3,14]))
'C,12,3,14'
>>> gfapy.ByteArray([12,1,"1x"]).validate()
Traceback (most recent call last):
...
gfapy.error.ValueError
>>> str(gfapy.ByteArray([12,3,14]))
'0C030E'

For numeric values, the compute_subtype method allows to compute the subtype which will be used for the string representation. Unsigned subtypes are used if all values are positive. The smallest possible subtype range is selected. The subtype may change when the range of the elements changes.

>>> gfapy.NumericArray([12,13,14]).compute_subtype()
'C'

Special cases: custom records, headers, comments and virtual lines.

GFA2 allows custom records, introduced by record type strings other than the predefined ones. Gfapy uses a pragmatical approach for identifying tags in custom records, and tries to interpret the rightmost fields as tags, until the first field from the right raises an error; all remaining fields are treated as positional fields.

"X a b c xx:i:12" # => xx is tag, a, b, c are positional fields
"Y a b xx:i:12 c" # => all positional fields, as c is not a valid tag

For easier access, the entire header of the GFA is summarized in a single line instance. A class (FieldArray) has been defined to handle the special case when multiple H lines define the same tag (see The Header chapter for details).

Comment lines are represented by a subclass of the same class (Line) as the records. However, they cannot contain tags: the entire line is taken as content of the comment. See the Comments chapter for more information about comments.

"# this is not a tag: xx:i:1" # => xx is not a tag, xx:i:1 is part of the comment

Virtual instances of the Line class (e.g. segment instances automatically created because of not yet resolved references found in edges) cannot be modified by the user, and tags cannot be specified for them. This includes all instances of the Unknown class. See the References chapter for more information about virtual lines.

References

Some fields in GFA lines contain identifiers or lists of identifiers (sometimes followed by orientation strings), which reference other lines of the GFA file. In Gfapy it is possible to follow these references and traverse the graph.

Connecting a line to a Gfa object

In stand-alone line instances, the identifiers which reference other lines are either strings containing the line name, pairs of strings (name and orientation) in a gfapy.OrientedLine object, or lists of lines names or gfapy.OrientedLine objects.

Using the add_line(line) (alias: append(line)) method of the gfapy.Gfa object, or the equivalent connect(gfa) method of the gfapy.Line instance, a line is added to a Gfa instance (this is done automatically when a GFA file is parsed). All strings expressing references are then changed into references to the corresponding line objects. The method is_connected() allows to determine if a line is connected to a gfapy instance. The read-only property gfa contains the gfapy.Gfa instance to which the line is connected.

>>> gfa = gfapy.Gfa(version='gfa1')
>>> link = gfapy.Line("L\tA\t-\tB\t+\t20M")
>>> link.is_connected()
False
>>> link.gfa is None
True
>>> type(link.from_segment)
<class 'str'>
>>> gfa.append(link)
>>> link.is_connected()
True
>>> link.gfa 
<gfapy.gfa.Gfa object at ...>
>>> type(link.from_segment)
<class 'gfapy.line.segment.gfa1.GFA1'>

References for each record type

The following tables describes the references contained in each record type. The notation [] represent lists.

GFA1

Record type Fields Type of reference
Link from_segment, to_segment | Segment
Containment from_segment, to_segment | Segment
Path segment_names, [OrientedLine(Segment)]
  links (1) [OrientedLine(Link)]

(1): paths contain information in the fields segment_names and overlaps, which allow to find the identify from which they depend; these links can be retrieved using links (which is not a field).

GFA2

Record type Fields Type of reference
Edge sid1, sid2 Segment
Gap sid1, sid2 Segment
Fragment sid Segment
Set items [Edge/Set/Path/Segment]
Path items [OrientedLine(Edge/Set/Segment)]

Backreferences for each record type

When a line containing a reference to another line is connected to a Gfa object, backreferences to it are created in the targeted line.

For each backreference collection a read-only property exist, which is named as the collection (e.g. dovetails_L for segments). Note that the reference list returned by these arrays are read-only and editing the references is done using other methods (see the section “Editing reference fields” below).

segment.dovetails_L # => [gfapy.line.edge.Link(...), ...]

The following tables describe the backreferences collections for each record type.

GFA1

Record type Backreferences
Segment dovetails_L
  dovetails_R
  edges_to_contained
  edges_to_containers
  paths
Link paths

GFA2

Record type Backreferences Type
Segment dovetails_L E
  dovetails_R E
  edges_to_contained E
  edges_to_containers E
  internals E
  gaps_L G
  gaps_R G
  fragments F
  paths O
  sets U
Edge paths O
  sets U
O Group paths O
  sets U
U Group sets U

Segment backreference convenience methods

For segments, additional methods are available which combine in different way the backreferences information. The dovetails_of_end and gaps_of_end methods take an argument L or R and return the dovetails overlaps (or gaps) of the left or, respectively, right end of the segment sequence (equivalent to the segment properties dovetails_L/dovetails_R and gaps_L/gaps_R).

The segment containments property is a list of both containments where the segment is the container or the contained segment. The segment edges property is a list of all edges (dovetails, containments and internals) with a reference to the segment.

Other methods directly compute list of segments from the edges lists mentioned above. The neighbours_L, neighbours_R properties and the neighbours method compute the set of segment instances which are connected by dovetails to the segment. The containers and contained properties similarly compute the set of segment instances which, respectively, contains the segment, or are contained in the segment.

>>> gfa = gfapy.Gfa()
>>> gfa.append('S\tA\t*')
>>> s = gfa.segment('A')
>>> gfa.append('S\tB\t*')
>>> gfa.append('S\tC\t*')
>>> gfa.append('L\tA\t-\tB\t+\t*')
>>> gfa.append('C\tA\t+\tC\t+\t10\t*')
>>> [str(l) for l in s.dovetails_of_end("L")]
['L\tA\t-\tB\t+\t*']
>>> s.dovetails_L == s.dovetails_of_end("L")
True
>>> s.gaps_of_end("R")
[]
>>> [str(e) for e in s.edges]
['L\tA\t-\tB\t+\t*', 'C\tA\t+\tC\t+\t10\t*']
>>> [str(n) for n in s.neighbours_L]
['S\tB\t*']
>>> s.containers
[]
>>> [str(c) for c in s.contained]
['S\tC\t*']

Multiline group definitions

The GFA2 specification opens the possibility (experimental) to define groups on multiple lines, by using the same ID for each line defining the group. This is supported by gfapy.

This means that if multiple Ordered or Unordered instances connected to a Gfa object have the same gid, they are merged into a single instance (technically the last one getting added to the graph object). The items list are merged.

The tags of multiple line defining a group shall not contradict each other (i.e. either are the tag names on different lines defining the group all different, or, if the same tag is present on different lines, the value and datatype must be the same, in which case the multiple definition will be ignored).

>>> gfa = gfapy.Gfa()
>>> gfa.add_line("U\tu1\ts1 s2 s3")
>>> [s.name for s in gfa.sets[-1].items]
['s1', 's2', 's3']
>>> gfa.add_line('U\tu1\t4 5')
>>> [s.name for s in gfa.sets[-1].items]
['s1', 's2', 's3', '4', '5']

Induced set and captured path

The item list in GFA2 sets and paths may not contain elements which are implicitly involved. For example a path may contain segments, without specifying the edges connecting them, if there is only one such edge. Alternatively a path may contain edges, without explicitly indicating the segments. Similarly a set may contain edges, but not the segments referred to in them, or contain segments which are connected by edges, without the edges themselves. Furthermore groups may refer to other groups (set to sets or paths, paths to paths only), which then indirectly contain references to segments and edges.

Gfapy provides methods for the computation of the sets of segments and edges which are implied by an ordered or unordered group. Thereby all references to subgroups are resolved and implicit elements are added, as described in the specification. The computation can, therefore, only be applied to connected lines. For unordered groups, this computation is provided by the method induced_set(), which returns an array of segment and edge instances. For ordered group, the computation is provided by the method captured_path(), which returns a list of gfapy.OrientedLine instances, alternating segment and edge instances (and starting and ending in segments).

The methods induced_segments_set(), induced_edges_set(), captured_segments() and captured_edges() return, respectively, the list of only segments or edges, in ordered or unordered groups.

>>> gfa = gfapy.Gfa()
>>> gfa.add_line("S\ts1\t100\t*")
>>> gfa.add_line("S\ts2\t100\t*")
>>> gfa.add_line("S\ts3\t100\t*")
>>> gfa.add_line("E\te1\ts1+\ts2-\t0\t10\t90\t100$\t*")
>>> gfa.add_line("U\tu1\ts1 s2 s3")
>>> u = gfa.sets[-1]
>>> [l.name for l in u.induced_edges_set]
['e1']
>>> [l.name for l in u.induced_segments_set ]
['s1', 's2', 's3']
>>> [l.name for l in u.induced_set ]
['s1', 's2', 's3', 'e1']

Disconnecting a line from a Gfa object

Lines can be disconnected using the rm(line) method of the gfapy.Gfa object or the disconnect() method of the line instance.

>>> gfa = gfapy.Gfa()
>>> gfa.append('S\tsA\t*')
>>> gfa.append('S\tsB\t*')
>>> line = gfa.segment("sA")
>>> gfa.segment_names
['sA', 'sB']
>>> gfa.rm(line)
>>> gfa.segment_names
['sB']
>>> line = gfa.segment('sB')
>>> line.disconnect()
>>> gfa.segment_names
[]

Disconnecting a line affects other lines as well. Lines which are dependent on the disconnected line are disconnected as well. Any other reference to disconnected lines is removed as well. In the disconnected line, references to lines are transformed back to strings and backreferences are deleted.

The following tables show which dependent lines are disconnected if they refer to a line which is being disconnected.

GFA1

Record type Dependent lines
Segment links (+ paths), containments
Link paths

GFA2

Record type Dependent lines
Segment edges, gaps, fragments, sets, paths
Edge sets, paths
Sets sets, paths

Editing reference fields

In connected line instances, it is not allowed to directly change the content of fields containing references to other lines, as this would make the state of the Gfa object invalid.

Besides the fields containing references, some other fields are read-only in connected lines. Changing some of the fields would require moving the backreferences to other collections (position fields of edges and gaps, from_orient and to_orient of links). The overlaps field of connected links is readonly as it may be necessary to identify the link in paths.

Renaming an element

The name field of a line (e.g. segment name/sid) is not a reference and thus can be edited also in connected lines. When the name of the line is changed, no manual editing of references (e.g. from_segment/to_segment fields in links) is necessary, as all lines which refer to the line will still refer to the same instance. The references to the instance in the Gfa lines collections will be automatically updated. Also, the new name will be correctly used when converting to string, such as when the Gfa instance is written to a GFA file.

Renaming a line to a name which already exists has the same effect of adding a line with that name. That is, in most cases, gfapy.NotUniqueError is raised. An exception are GFA2 sets and paths: in this case the line will be appended to the existing line with the same name (as described in “Multiline group definitions”).

Adding and removing group elements

Elements of GFA2 groups can be added and removed from both connected and non-connected lines, using the following methods.

To add an item to or remove an item from an unordered group, use the methods add_item(item) and rm_item(item), which take as argument either a string (identifier) or a line instance.

To append or prepend an item to an ordered group, use the methods append_item(item) and prepend_item(item). To remove the first or the last item of an ordered group use the methods rm_first_item() and rm_last_item().

Editing read-only fields of connected lines

Editing the read-only information of edges, gaps, links, containments, fragments and paths is more complicated. These lines shall be disconnected before the edit and connected again to the Gfa object after it. Before disconnecting a line, you should check if there are other lines dependent on it (see tables above). If so, you will have to disconnect these lines first, eventually update their fields and reconnect them at the end of the operation.

Virtual lines

The order of the lines in GFA is not prescribed. Therefore, during parsing, or constructing a Gfa in memory, it is possible that a line is referenced to, before it is added to the Gfa instance. Whenever this happens, Gfapy creates a “virtual” line instance.

Users do not have to handle with virtual lines, if they work with complete and valid GFA files.

Virtual lines are similar to normal line instances, with some limitations (they contain only limited information and it is not allowed to add tags to them). To check if a line is a virtual line, one can use the virtual property of the line.

As soon as the parser founds the real line corresponding to a previously introduced virtual line, the virtual line is exchanged with the real line and all references are corrected to point to the real line.

>>> g = gfapy.Gfa()
>>> g.add_line("S\t1\t*")
>>> g.add_line("L\t1\t+\t2\t+\t*")
>>> l = g.dovetails[0]
>>> g.segment("1").virtual
False
>>> g.segment("2").virtual
True
>>> l.to_segment == g.segment("2")
True
>>> g.segment("2").dovetails == [l]
True
>>> g.add_line("S\t2\t*")
>>> g.segment("2").virtual
False
>>> l.to_segment == g.segment("2")
True
>>> g.segment("2").dovetails == [l]
True

The Header

GFA files may contain one or multiple header lines (record type: “H”). These lines may be present in any part of the file, not necessarily at the beginning.

Although the header may consist of multiple lines, its content refers to the whole file. Therefore in Gfapy the header is accessed using a single line instance (accessible by the header property). Header lines contain only tags. If not header line is present in the Gfa, then the header line object will be empty (i.e. contain no tags).

Note that header lines cannot be connected to the Gfa as other lines (i.e. calling connect() on them raises an exception). Instead they must be merged to the existing Gfa header, using add_line on the Gfa instance.

>>> gfa.add_line("H\tnn:f:1.0") 
>>> gfa.header.nn
1.0
>>> gfapy.Line("H\tnn:f:1.0").connect(gfa)
Traceback (most recent call last):
...
gfapy.error.RuntimeError: ...

Multiple definitions of the predefined header tags

For the predefined tags (VN and TS), the presence of multiple values in different lines is an error, unless the value is the same in each instance (in which case the repeated definitions are ignored).

>>> gfa.add_line("H\tVN:Z:1.0") 
>>> gfa.add_line("H\tVN:Z:1.0") # ignored 
>>> gfa.add_line("H\tVN:Z:2.0")
Traceback (most recent call last):
...
gfapy.error.VersionError: ...

Multiple definitions of custom header tags

If the tags are present only once in the header in its entirety, the access to the tags is the same as for any other line (see the Tags chapter).

However, the specification does not forbid custom tags to be defined with different values in different header lines (which we name “multi-definition tags”). This particular case is handled in the next sections.

Reading multi-definitions tags

Reading, validating and setting the datatype of multi-definition tags is done using the same methods as for all other lines (see the Tags chapter). However, if a tag is defined multiple times on multiple H lines, reading the tag will return a list of the values on the lines. This array is an instance of the subclass gfapy.FieldArray of list.

>>> gfa.add_line("H\txx:i:1") 
>>> gfa.add_line("H\txx:i:2") 
>>> gfa.add_line("H\txx:i:3") 
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2, 3])

Setting tags

There are two possibilities to set a tag for the header. The first is the normal tag interface (using set or the tag name property). The second is to use add. The latter supports multi-definition tags, i.e. it adds the value to the previous ones (if any), instead of overwriting them.

>>> gfa = gfapy.Gfa()
>>> gfa.header.xx
>>> gfa.header.add("xx", 1)
>>> gfa.header.xx
1
>>> gfa.header.add("xx", 2)
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2])
>>> gfa.header.set("xx", 3)
>>> gfa.header.xx
3

Modifying field array values

Field arrays can be modified directly (e.g. adding new values or removing some values). After modification, the user may check if the array values remain compatible with the datatype of the tag using the validate_field`() method.

>>> gfa = gfapy.Gfa()
>>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3])
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2, 3])
>>> gfa.header.validate_field("xx")
>>> gfa.header.xx.append("X")
>>> gfa.header.validate_field("xx")
Traceback (most recent call last):
...
gfapy.error.FormatError: ...

If the field array is modified using array methods which return a list or data of any other type, a field array must be constructed, setting its datatype to the value returned by calling get_datatype() on the header.

>>> gfa = gfapy.Gfa()
>>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3])
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2, 3])
>>> gfa.header.xx = gfapy.FieldArray(gfa.header.get_datatype("xx"),
... list(map(lambda x: x+1, gfa.header.xx)))
>>> gfa.header.xx
gfapy.FieldArray('i',[2, 3, 4])

String representation of the header

For consistency with other line types, the string representation of the header is a single-line string, eventually non standard-compliant, if it contains multiple instances of the tag. (and when calling field_to_s() for a tag present multiple times, the output string will contain the instances of the tag, separated by tabs).

However, when the Gfa is output to file or string, the header is split into multiple H lines with single tags, so that standard-compliant GFA is output. The split header can be retrieved using the headers property of the Gfa instance.

>>> gfa = gfapy.Gfa()
>>> gfa.header.VN = "1.0"
>>> gfa.header.xx = gfapy.FieldArray('i',[1,2])
>>> gfa.header.field_to_s("xx")
'1\t2'
>>> gfa.header.field_to_s("xx", tag=True)
'xx:i:1\txx:i:2'
>>> str(gfa.header)
'H\tVN:Z:1.0\txx:i:1\txx:i:2'
>>> [str(h) for h in gfa.headers]
['H\tVN:Z:1.0', 'H\txx:i:1', 'H\txx:i:2']
>>> str(gfa)
'H\tVN:Z:1.0\nH\txx:i:1\nH\txx:i:2'

Count the input header lines

Due to the different way header lines are stored, the number of header elements is not equal to the number of header lines in the input. This is annoying if an application wants to count the number of input lines in a file. In order to make that possible, the number of input header lines are counted and can be retrieved using the n_input_header_lines property of the Gfa instance.

>>> gfa = gfapy.Gfa()
>>> gfa.add_line("H\txx:i:1\tyy:Z:ABC") 
>>> gfa.add_line("H\txy:i:2") 
>>> gfa.add_line("H\tyz:i:3\tab:A:A") 
>>> len(gfa.headers)
5
>>> gfa.n_input_header_lines
3

Custom records

The GFA2 specification considers each line which starts with a non-standard record type a custom (i.e. user- or program-specific) record. Gfapy allows to retrieve these records and access their data using a similar interface to that for the predefined record types.

Retrieving, adding and deleting custom records

Gfa instances have the property custom_records(), a list of all line instances with a non-standard record type. Among these, records of a specific record type are retrieved using the method Gfa.custom_records_of_type(record_type). Lines are added and deleted using the same methods (add_line() and disconnect()) as for other line types.

>>> g.add_line("X\tcustom line") 
>>> g.add_line("Y\tcustom line") 
>>> [str(line) for line in g.custom_records] 
['X\tcustom line', 'Y\tcustom line']
>>> g.custom_record_keys) 
['X', 'Y']
>>> [str(line) for line in g.custom_records_of_type('X')]
['X\tcustom line']
>>> g.custom_records_of_type("X")[-1].disconnect()
>>> g.custom_records_of_type('X')
[]

Interface without extensions

If no extension (see Extensions section) has been defined to handle a custom record type, the interface has some limitations: the field content is not validated, and the field names are unknown. The generic custom record class is employed (CustomRecord).

As the name of the positional fields in a custom record is not known, a generic name field1, field2, … is used. The number of positional fields is found by getting the length of the positional_fieldnames list.

>>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100") 
>>> x = g.custom_records_of_type('X')[-1]
>>> len(x.positional_fieldnames)
2
>>> x.field1
'a'
>>> x.field2
'b'

Positional fields are allowed to contain any character (including non-printable characters and spacing characters), except tabs and newlines (as they are structural elements of the line). No further validation is performed.

As Gfapy cannot know how many positional fields are present when parsing custom records, a heuristic approach is followed, to identify tags. A field resembles a tag if it starts with tn:d: where tn is a valid tag name and d a valid tag datatype (see Tags chapter). The fields are parsed from the last to the first.

As soon as a field is found which does not resemble a tag, all remaining fields are considered positionals (even if another field parsed later resembles a tag). Due to this, invalid tags are sometimes wrongly taken as positional fields (this can be avoided by writing an extension).

>>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100") 
>>> x1 = g.custom_records_of_type("X")[-1]
>>> x1.cc
10
>>> x1.dd
100
>>> g.add_line("X\ta\tb\tcc:i:10\tdd:i:100\te") 
>>> x2 = g.custom_records_of_type("X")[-1]
>>> x2.cc
>>> x2.field3
'cc:i:10'
>>> g.add_line("Z\ta\tb\tcc:i:10\tddd:i:100") 
>>> x3 = g.custom_records_of_type("Z")[-1]
>>> x3.cc
>>> x3.field3
'cc:i:10'
>>> x3.field4
'ddd:i:100'

Extensions

The support for custom fields is limited, as Gfapy does not know which and how many fields are there and how shall they be validated. It is possible to create an extension of Gfapy, which defines new record types: this will allow to use these record types in a similar way to the built-in types.

As an example, an extension will be described, which defines two record types: T for taxa and M for assignments of segments to taxa. For further information about the possible usage case for this extension, see the Supplemental Information to the manuscript describing Gfapy.

The T records will contain a single positional field, tid, a GFA2 identifier, and an optional UL string tag. The M records will contain three positional fields (all three GFA2 identifier): a name field mid (optional), and two references, tid to a T line and sid to an S line. The SC integer tag will be also defined. Here is an example of a GFA containing M and T lines:

S sA 1000 *
S sB 1000 *
M assignment1 t123 sA SC:i:40
M assignment2 t123 sB
M * B12c sB SC:i:20
T B12c
T t123 UL:Z:http://www.taxon123.com

Writing subclasses of the Line class, it is possible to communicate to Gfapy, how records of the M and T class shall be handled. This only requires to define some constants and to call the class method register_extension().

The constants to define are RECORD TYPE, which shall be the content of the record type field (e.g. M); POSFIELDS shall contain an ordered dict, specifying the datatype for each positional field, in the order these fields are found in the line; TAGS_DATATYPE is a dict, specifying the datatype of the predefined optional tags; NAME_FIELD is a field name, and specifies which field contains the identifier of the line. For details on predefined and custom datatypes, see the next sections (Predefined datatypes for extensions and Custom datatypes for extensions).

To handle references, register_extension() can be supplied with a references parameter, a list of triples (fieldname, classname, backreferences). Thereby fieldname is the name of the field in the corresponding record containing the reference (e.g. sid), classname is the name of the class to which the reference goes (e.g. gfa.line.segment.GFA2), and texttt{backreferences} is how the collection of backreferences shall be called, in the records to which reference points to (e.g. metagenomic_assignments).

from collections include OrderedDict

class Taxon(gfapy.Line):
  RECORD_TYPE = "T"
  POSFIELDS = OrderedDict([("tid","identifier_gfa2")])
  TAGS_DATATYPE = {"UL":"Z"}
  NAME_FIELD = "tid"

Taxon.register_extension()

class MetagenomicAssignment(gfapy.Line):
  RECORD_TYPE = "M"
  POSFIELDS = OrderedDict([("mid","optional_identifier_gfa2"),
                           ("tid","identifier_gfa2"),
                           ("sid","identifier_gfa2")])
  TAGS_DATATYPE = {"SC":"i"}
  NAME_FIELD = "mid"

MetagenomicAssignment.register_extension(references=
    [("sid", gfapy.line.segment.GFA2, "metagenomic_assignments"),
     ("tid", Taxon, "metagenomic_assignments")])

Predefined datatypes for extensions

The datatype of fields is specified in Gfapy using classes, which provide functions for decoding, encoding and validating the corresponding data. Gfapy contains a number of datatypes which correspond to the description of the field content in the GFA1 and GFA2 specification.

When writing extensions only the GFA2 field datatypes are generally used (as GFA1 does not contain custom fields). They are summarized in the following table:

Name Example Description
alignment_gfa2 12M1I3M CIGAR string, Trace alignment or Placeholder (*)
identifier_gfa2 S1 ID of a line
oriented_identifier_gfa2 S1+ ID of a line followed by + or -
optional_identifier_gfa2 * ID of a line or Placeholder (*)
identifier_list_gfa2 S1 S2 space separated list of line IDs
oriented_identifier_list_gfa2 S1+ S2- space separated list of line IDs plus orientations
position_gfa2 120$ non-negative integer, optionally followed by $
sequence_gfa2 ACGNNYR sequence of printable chars., no whitespace
string a b_c;d string, no tabs and newlines (Z tags)
char A single character (A tags)
float 1.12 float (f tags)
integer -12 integer (i tags)
optional_integer * integer or placeholder
numeric_array c,10,3 array of integers or floats (B tags)
byte_array 12F1FF hexadecimal byte string (H tags)
json {’b’:2} JSON string, no tabs and newlines (J tags)

Custom datatypes for extensions

For custom records, one sometimes needs datatypes not yet available in the GFA specification. For example, a custom datatype can be defined for the taxon identifier used in the tid field of the T and M records: accordingly the taxon identifier shall be only either in the form taxon:<n>, where <n> is a positive integer, or consist of letters, numbers and underscores only (without :).

To define the datatype, a class is written, which contains the following functions:

  • validate_encoded(string): validates the content of the field, if this is a string (e.g., the name of the T line)
  • validate_decoded(object): validates the content of the field, if this is not a string (e.g., a reference to a T line)
  • decode(string): validates the content of the field (a string) and returns the decoded content; note that references must not be resolved (there is no access to the Gfa instance here), thus the name of the T line will be returned unchanged
  • encode(string): validates the content of the field (not in string form) and returns the string which codes it in the GFA file (also here references are validated but not converted into strings)

Finally the datatype is registered calling register_datatype(). The code for the taxon ID extension is the following:

import re

class TaxonID:

  def validate_encoded(string):
    if not re.match(r"^taxon:(\d+)$",string) and \
        not re.match(r"^[a-zA-Z0-9_]+$", string):
      raise gfapy.ValueError("Invalid taxon ID: {}".format(string))

  def decode(string):
    TaxonID.validate_encoded(string)
    return string

  def validate_decoded(obj):
    if isinstance(obj,Taxon):
      TaxonID.validate_encoded(obj.name)
    else:
      raise gfapy.TypeError(
        "Invalid type for taxon ID: "+"{}".format(repr(obj)))

  def encode(obj):
    TaxonID.validate_decoded(obj)
    return obj

gfapy.Field.register_datatype("taxon_id", TaxonID)

To use the new datatype in the T and M lines defined above (Extensions), the definition of the two subclasses can be changed: in POSFIELDS the value taxon_id shall be assigned to the key tid.

Comments

GFA lines starting with a # symbol are considered comments. In Gfapy comments are represented by instances of the class gfapy.line.Comment. They have a similar interface to other line instances, with some differences, e.g. they do not support tags.

The comments collection

The comments of a Gfa object are accessed using the Gfa.comments property. This is a list of comment line instances. The single elements can be modified, but the list itself is read-only. To remove a comment from the Gfa, you need to find the instance in the list, and call disconnect() on it. To add a comment to a Gfa instance is done similarly to other lines, by using the Gfa.add_line(line) method.

>>> g.add_line("# this is a comment") 
>>> [str(c) for c in g.comments]
['# this is a comment']
>>> g.comments[0].disconnect()
>>> g.comments
[]

Accessing the comment content

The content of the comment line, excluding the initial # and eventual initial spacing characters, is included in the content field. The initial spacing characters can be read/changed using the spacer field. The default value is a single space.

>>> g.add_line("# this is a comment") 
>>> c = g.comments[-1]
>>> c.content
'this is a comment'
>>> c.spacer
' '
>>> c.spacer = '___'
>>> str(c)
'#___this is a comment'

Tags are not supported by comment lines. If the line contains tags, these are nor parsed, but included in the content field. Trying to set tags raises exceptions.

>>> c = gfapy.Line("# this is not a tag\txx:i:1")
>>> c.content
'this is not a tag\txx:i:1'
>>> c.xx
>>> c.xx = 1
Traceback (most recent call last):
...
gfapy.error.RuntimeError: Tags of comment lines cannot be set

Errors

The different types of errors defined in Gfapy are summarized in the following table. All exception raised in the library are subclasses of Error. Thus, except gfapy.Error can be use to catch all library errors.

Error Description Examples
VersionError An unknown or wrong version is specified or implied “GFA0”; or GFA1 in GFA2 context
ValueError The value of an object is invalid a negative position is used
TypeError The wrong type has been used or specified Z instead of i used for VN tag; Hash for an i tag
FormatError The format of an object is wrong a line does not contain the expected number of fields
NotUniqueError Something should be unique but is not duplicated tag name or line identifier
InconsistencyError Pieces of information collide with each other length of sequence and LN tag do not match
RuntimeError The user tried to do something which is not allowed editing from_segment field in connected links
ArgumentError Problem with the arguments of a method wrong number of arguments in dynamically created method
AssertionError Something unexpected happened there is a bug in the library or the library has been used in an unintended way

Graph operations

Graph operations such as linear paths merging, multiplication of segments and other are provided. These operations are implemented in analogy to those provided by the Ruby library RGFA. As RGFA only handles GFA1 graphs, only dovetail overlaps are considered as connections. A detailed description of the operation can be found in Gonnella and Kurtz (2016). More information about the single operations are found in the method documentation of the submodules of GraphOperations.

rGFA

rGFA (https://github.com/lh3/gfatools/blob/master/doc/rGFA.md) is a subset of GFA1, in which only particular line types (S and L) are allowed, and the S lines are required to contain the tags SN (of type Z), SO and SR (of type i).

When working with rGFA files, it is convenient to use the dialect="rgfa" option in the constructor Gfa() and in func:Gfa.from_file().

This ensures that additional validations are performed: GFA version must be 1, only rGFA-compatible lines (S,L) are allowed and that the required tags are required (with the correct datatype). The validations can also be executed manually using Gfa.validate_rgfa().

Furthermore, the stable_sequence_names attribute of the GFA objects returns the set of stable sequence names contained in the SN tags of the segments.

>>> g = gfapy.Gfa("S\tS1\tCTGAA\tSN:Z:chr1\tSO:i:0\tSR:i:0", dialect="rgfa")
>>> g.segment_names
['S1']
>>> g.stable_sequence_names
['chr1']
>>> g.add_line("S\tS2\tACG\tSN:Z:chr1\tSO:i:5\tSR:i:0")

Indices and tables