The Gfa class

The content of a GFA file is represented in Gfapy by an instance of the class Gfa. In most cases, the Gfa instance will be constructed from the data contained in a GFA file, using the method Gfa.from_file().

Alternatively, it is possible to use the construct of the class; it takes an optional positional parameter, the content of a GFA file (as string, or as list of strings, one per line of the GFA file). If no GFA content is provided, the Gfa instance will be empty.

>>> gfa = gfapy.Gfa("H\tVN:Z:1.0\nS\tA\t*")
>>> print(len(gfa.lines))
2
>>> gfa = gfapy.Gfa(["H\tVN:Z:1.0", "S\tA\t*", "S\tB\t*"])
>>> print(len(gfa.lines))
3
>>> gfa = gfapy.Gfa()
>>> print(len(gfa.lines))
0

The string representation of the Gfa object (which can be obtained using str()) is the textual representation in GFA format. Using Gfa.to_file(filename) allows writing this representation to a GFA file (the content of the file is overwritten).

>>> g1 = gfapy.Gfa()
>>> g1.append("H\tVN:Z:1.0")
>>> g1.append("S\ta\t*")
>>> g1.to_file("my.gfa") 
>>> g2 = gfapy.Gfa.from_file("my.gfa") 
>>> str(g1)
'H\tVN:Z:1.0\nS\ta\t*'

All methods for creating a Gfa (constructor and from_file) accept a vlevel parameter, the validation level, and can assume the values 0, 1, 2 and 3. A higher value means more validations are performed. The Validation chapter explains the meaning of the different validation levels in detail. The default value is 1.

>>> gfapy.Gfa().vlevel
1
>>> gfapy.Gfa(vlevel = 0).vlevel
0

A further parameter is version. It can be set to 'gfa1', 'gfa2' or left to the default value (None). The default is to auto-detect the version of the GFA from the line content. If the version is set manually, any content not compatible to the specified version will trigger an exception. If the version is set automatically, an exception will be raised if two lines are found, with content incompatible to each other (e.g. a GFA1 segment followed by a GFA2 segment).

>>> g = gfapy.Gfa(version='gfa2')
>>> g.version
'gfa2'
>>> g.add_line("S\t1\t*")
Traceback (most recent call last):
...
gfapy.error.VersionError: Version: 1.0 (None)
...
>>> g = gfapy.Gfa()
>>> g.version
>>> g.add_line("S\t1\t*")
>>> g.version
'gfa1'
>>> g.add_line("S\t1\t100\t*")
Traceback (most recent call last):
...
gfapy.error.VersionError: Version: 1.0 (None)
...

Collections of lines

The property lines of the Gfa object is a list of all the lines in the GFA file (including the header, which is split into single-tag lines). The list itself shall not be modified by the user directly (i.e. adding and removing lines is done using a different interface, see below). However the single elements of the list can be edited.

>>> for line in gfa.lines: print(line)

For most record types, a list of the lines of the record type is available as a read-only property, which is named after the record type, in plural.

>>> [str(line) for line in gfa1.segments]
['S\t1\t*', 'S\t2\t*', 'S\t3\t*']
>>> [str(line) for line in gfa2.fragments]
[]

A particular case are edges; these are in GFA1 links and containments, while in GFA2 there is a unified edge record type, which also allows to represent internal alignments. In Gfapy, the edges property retrieves all edges (i.e. all E lines in GFA2, and all L and C lines in GFA1). The dovetails property is a list of all edges which represent dovetail overlaps (i.e. all L lines in GFA1 and a subset of the E lines in GFA2). The containments property is a list of all edges which represent containments (i.e. all C lines in GFA1 and a subset of the E lines in GFA2).

>>> gfa2.edges
[]
>>> gfa2.dovetails
[]
>>> gfa2.containments
[]

Paths are retrieved using the paths property. This list contains all P lines in GFA1 and all O lines in GFA2. Sets returns the list of all U lines in GFA2 (empty list in GFA1).

>>> gfa2.paths
[]
>>> gfa2.sets
[]

The header contain metadata in a single or multiple lines. For ease of access to the header information, all its tags are summarized in a single line instance, which is retrieved using the header property. This list The The Header chapter of this manual explains more in detail, how to work with the header object.

>>> gfa2.header.TS
100

All lines which start by the string # are comments; they are handled in the Comments chapter and are retrieved using the comments property.

>>> [str(line) for line in gfa1.comments]
['# this is a comment']

Custom lines are lines of GFA2 files which start with a non-standard record type. Gfapy provides basic built-in support for accessing the information in custom lines, and allows to define extensions for own record types for defining more advanced functionality (see the Custom records chapter).

>>> [str(line) for line in gfa2.custom_records]
['X\tcustom line', 'Y\tcustom line']
>>> gfa2.custom_record_keys
['X', 'Y']
>>> [str(line) for line in gfa2.custom_records_of_type('X')]
['X\tcustom line']

Line identifiers

Some GFA lines have a mandatory or optional identifier field: segments and paths in GFA1, segments, gaps, edges, paths and sets in GFA2. A line of this type can be retrieved by identifier, using the method Gfa.line(ID) using the identifier as argument.

>>> str(gfa1.line('1'))
'S\t1\t*'

The GFA2 specification prescribes the exact namespace for the identifier (segments, paths, sets, edges and gaps identifier share the same namespace). The content of this namespace can be retrieved using the names property. The identifiers of single line types can be retrieved using the properties segment_names, edge_names, gap_names, path_names and set_names.

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t100\t*")
>>> g.add_line("S\tB\t100\t*")
>>> g.add_line("S\tC\t100\t*")
>>> g.add_line("E\tb_c\tB+\tC+\t0\t10\t90\t100$\t*")
>>> g.add_line("O\tp1\tB+ C+")
>>> g.add_line("U\ts1\tA b_c g")
>>> g.add_line("G\tg\tA+\tB-\t1000\t*")
>>> g.names
['A', 'B', 'C', 'b_c', 'g', 'p1', 's1']
>>> g.segment_names
['A', 'B', 'C']
>>> g.path_names
['p1']
>>> g.edge_names
['b_c']
>>> g.gap_names
['g']
>>> g.set_names
['s1']

The GFA1 specification does not handle the question of the namespace of identifiers explicitly. However, gfapy assumes and enforces a single namespace for segment, path names and the values of the ID tags of L and C lines. The content of this namespace can be found using names property. The identifiers of single line types can be retrieved using the properties segment_names, edge_names (ID tags of links and containments) and path_names. For GFA1, the properties gap_names, set_names contain always empty lists.

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t*")
>>> g.add_line("S\tB\t*")
>>> g.add_line("S\tC\t*")
>>> g.add_line("L\tB\t+\tC\t+\t*\tID:Z:b_c")
>>> g.add_line("P\tp1\tB+,C+\t*")
>>> g.names
['A', 'B', 'C', 'b_c', 'p1']
>>> g.segment_names
['A', 'B', 'C']
>>> g.path_names
['p1']
>>> g.edge_names
['b_c']
>>> g.gap_names
[]
>>> g.set_names
[]

Identifiers of external sequences

Fragments contain identifiers which refer to external sequences (not contained in the GFA file). According to the specification, the these identifiers are not part of the same namespace as the identifier of the GFA lines. They can be retrieved using the external_names property.

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t100\t*")
>>> g.add_line("F\tA\tread1+\t10\t30\t0\t20$\t20M")
>>> g.external_names
['read1']

The method Gfa.fragments_for_external(external_ID) retrieves all F lines with a specified external sequence identifier.

>>> f = g.fragments_for_external('read1')
>>> len(f)
1
>>> str(f[0])
'F\tA\tread1+\t10\t30\t0\t20$\t20M'

Adding new lines

New lines can be added to a Gfa instance using the Gfa.add_line(line) method or its alias Gfa.append(line). The argument can be either a string describing a line with valid GFA syntax, or a Line instance. If a string is added, a line instance is created and then added.

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t*") 
>>> g.segment_names
['A']
>>> g.append("S\tB\t*") 
>>> g.segment_names
['A', 'B']

Editing the lines

Accessing the information stored in the fields of a line instance is described in the Positional fields and Tags chapters.

In Gfapy, a line instance belonging to a Gfa instance is said to be connected to the Gfa instance. Direct editing the content of a connected line is only possible, for those fields which do not contain references to other lines. For more information on how to modify the content of the fields of connected line, see the References chapter.

>>> g = gfapy.Gfa()
>>> e = gfapy.Line("E\t*\tA+\tB-\t0\t10\t90\t100$\t*")
>>> e.sid1 = "C+"
>>> g.add_line(e) 
>>> e.sid1 = "A+"
Traceback (most recent call last):
gfapy.error.RuntimeError: ...

Removing lines

Disconnecting a line from the Gfa instance is done using the Gfa.rm(line) method. The argument can be a line instance or the name of a line.

In alternative, a line instance can also be disconnected using the disconnect method on it. Disconnecting a line may trigger other operations, such as the disconnection of other lines (see the References chapter).

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t*") 
>>> g.segment_names
['A']
>>> g.rm('A') 
>>> g.segment_names
[]
>>> g.append("S\tB\t*") 
>>> g.segment_names
['B']
>>> b = g.line('B')
>>> b.disconnect()
>>> g.segment_names
[]

Renaming lines

Lines with an identifier can be renamed. This is done simply by editing the corresponding field (such as name or sid for a segment). This field is not a reference to another line and can be freely edited also in line instances connected to a Gfa. All references to the line from other lines will still be up to date, as they will refer to the same instance (whose name has been changed) and their string representation will use the new name.

>>> g = gfapy.Gfa()
>>> g.add_line("S\tA\t*") 
>>> g.add_line("L\tA\t+\tB\t-\t*") 
>>> g.segment_names
['A', 'B']
>>> g.dovetails[0].from_name
'A'
>>> g.segment('A').name = 'C'
>>> g.segment_names
['B', 'C']
>>> g.dovetails[0].from_name
'C'