Some fields in GFA lines contain identifiers or lists of identifiers (sometimes followed by orientation strings), which reference other lines of the GFA file. In Gfapy it is possible to follow these references and traverse the graph.
Connecting a line to a Gfa object¶
In stand-alone line instances, the identifiers which reference other
lines are either strings containing the line name, pairs of strings
(name and orientation) in a
gfapy.OrientedLine object, or lists of
lines names or
append(line)) method of the
gfapy.Gfa object, or the equivalent
connect(gfa) method of the
gfapy.Line instance, a line is added to a Gfa instance (this is done
automatically when a GFA file is parsed). All strings expressing
references are then changed into references to the corresponding line
objects. The method
is_connected() allows to determine if a line is
connected to a gfapy instance. The read-only property
gfapy.Gfa instance to which the line is connected.
>>> gfa = gfapy.Gfa(version='gfa1') >>> link = gfapy.Line("L\tA\t-\tB\t+\t20M") >>> link.is_connected() False >>> link.gfa is None True >>> type(link.from_segment) <class 'str'> >>> gfa.append(link) >>> link.is_connected() True >>> link.gfa <gfapy.gfa.Gfa object at ...> >>> type(link.from_segment) <class 'gfapy.line.segment.gfa1.GFA1'>
References for each record type¶
The following tables describes the references contained in each record
type. The notation
 represent lists.
|Record type||Fields||Type of reference|
(1): paths contain information in the fields segment_names and
overlaps, which allow to find the identify from which they depend; these
links can be retrieved using
links (which is not a field).
|Record type||Fields||Type of reference|
Backreferences for each record type¶
When a line containing a reference to another line is connected to a Gfa object, backreferences to it are created in the targeted line.
For each backreference collection a read-only property exist, which is
named as the collection (e.g.
dovetails_L for segments). Note that
the reference list returned by these arrays are read-only and editing
the references is done using other methods (see the section “Editing
reference fields” below).
segment.dovetails_L # => [gfapy.line.edge.Link(...), ...]
The following tables describe the backreferences collections for each record type.
Segment backreference convenience methods¶
For segments, additional methods are available which combine in
different way the backreferences information. The
gaps_of_end methods take an
R and return the dovetails overlaps (or gaps) of the
left or, respectively, right end of the segment sequence
(equivalent to the segment properties
containments property is a list of both containments where the
segment is the container or the contained segment. The segment
property is a list of all edges (dovetails, containments and internals)
with a reference to the segment.
Other methods directly compute list of segments from the edges lists
mentioned above. The
neighbours_R properties and
neighbours method compute the set of segment instances which are
connected by dovetails to the segment.
properties similarly compute the set of segment instances which,
respectively, contains the segment, or are contained in the segment.
>>> gfa = gfapy.Gfa() >>> gfa.append('S\tA\t*') >>> s = gfa.segment('A') >>> gfa.append('S\tB\t*') >>> gfa.append('S\tC\t*') >>> gfa.append('L\tA\t-\tB\t+\t*') >>> gfa.append('C\tA\t+\tC\t+\t10\t*') >>> [str(l) for l in s.dovetails_of_end("L")] ['L\tA\t-\tB\t+\t*'] >>> s.dovetails_L == s.dovetails_of_end("L") True >>> s.gaps_of_end("R")  >>> [str(e) for e in s.edges] ['L\tA\t-\tB\t+\t*', 'C\tA\t+\tC\t+\t10\t*'] >>> [str(n) for n in s.neighbours_L] ['S\tB\t*'] >>> s.containers  >>> [str(c) for c in s.contained] ['S\tC\t*']
Multiline group definitions¶
The GFA2 specification opens the possibility (experimental) to define groups on multiple lines, by using the same ID for each line defining the group. This is supported by gfapy.
This means that if multiple
Unordered instances connected to a Gfa object have
gid, they are merged into a single instance (technically
the last one getting added to the graph object). The items list are
The tags of multiple line defining a group shall not contradict each other (i.e. either are the tag names on different lines defining the group all different, or, if the same tag is present on different lines, the value and datatype must be the same, in which case the multiple definition will be ignored).
>>> gfa = gfapy.Gfa() >>> gfa.add_line("U\tu1\ts1 s2 s3") >>> [s.name for s in gfa.sets[-1].items] ['s1', 's2', 's3'] >>> gfa.add_line('U\tu1\t4 5') >>> [s.name for s in gfa.sets[-1].items] ['s1', 's2', 's3', '4', '5']
Induced set and captured path¶
The item list in GFA2 sets and paths may not contain elements which are implicitly involved. For example a path may contain segments, without specifying the edges connecting them, if there is only one such edge. Alternatively a path may contain edges, without explicitly indicating the segments. Similarly a set may contain edges, but not the segments referred to in them, or contain segments which are connected by edges, without the edges themselves. Furthermore groups may refer to other groups (set to sets or paths, paths to paths only), which then indirectly contain references to segments and edges.
Gfapy provides methods for the computation of the sets of segments and
edges which are implied by an ordered or unordered group. Thereby all
references to subgroups are resolved and implicit elements are added, as
described in the specification. The computation can, therefore, only be
applied to connected lines. For unordered groups, this computation is
provided by the method
induced_set(), which returns an array of
segment and edge instances. For ordered group, the computation is
provided by the method
captured_path(), which returns a list of
gfapy.OrientedLine instances, alternating segment and edge instances
(and starting and ending in segments).
captured_edges() return, respectively,
the list of only segments or edges, in ordered or unordered groups.
>>> gfa = gfapy.Gfa() >>> gfa.add_line("S\ts1\t100\t*") >>> gfa.add_line("S\ts2\t100\t*") >>> gfa.add_line("S\ts3\t100\t*") >>> gfa.add_line("E\te1\ts1+\ts2-\t0\t10\t90\t100$\t*") >>> gfa.add_line("U\tu1\ts1 s2 s3") >>> u = gfa.sets[-1] >>> [l.name for l in u.induced_edges_set] ['e1'] >>> [l.name for l in u.induced_segments_set ] ['s1', 's2', 's3'] >>> [l.name for l in u.induced_set ] ['s1', 's2', 's3', 'e1']
Disconnecting a line from a Gfa object¶
Lines can be disconnected using the
rm(line) method of the
gfapy.Gfa object or the
disconnect() method of the line
>>> gfa = gfapy.Gfa() >>> gfa.append('S\tsA\t*') >>> gfa.append('S\tsB\t*') >>> line = gfa.segment("sA") >>> gfa.segment_names ['sA', 'sB'] >>> gfa.rm(line) >>> gfa.segment_names ['sB'] >>> line = gfa.segment('sB') >>> line.disconnect() >>> gfa.segment_names 
Disconnecting a line affects other lines as well. Lines which are dependent on the disconnected line are disconnected as well. Any other reference to disconnected lines is removed as well. In the disconnected line, references to lines are transformed back to strings and backreferences are deleted.
The following tables show which dependent lines are disconnected if they refer to a line which is being disconnected.
|Record type||Dependent lines|
|Segment||links (+ paths), containments|
|Record type||Dependent lines|
|Segment||edges, gaps, fragments, sets, paths|
Editing reference fields¶
In connected line instances, it is not allowed to directly change the content of fields containing references to other lines, as this would make the state of the Gfa object invalid.
Besides the fields containing references, some other fields are
read-only in connected lines. Changing some of the fields would require
moving the backreferences to other collections (position fields of edges
to_orient of links). The overlaps
field of connected links is readonly as it may be necessary to identify
the link in paths.
Renaming an element¶
The name field of a line (e.g. segment
sid) is not a
reference and thus can be edited also in connected lines. When the name
of the line is changed, no manual editing of references (e.g. from/to
fields in links) is necessary, as all lines which refer to the line will
still refer to the same instance. The references to the instance in the
Gfa lines collections will be automatically updated. Also, the new name
will be correctly used when converting to string, such as when the Gfa
instance is written to a GFA file.
Renaming a line to a name which already exists has the same effect of
adding a line with that name. That is, in most cases,
gfapy.NotUniqueError is raised. An exception are GFA2 sets and
paths: in this case the line will be appended to the existing line with
the same name (as described in “Multiline group definitions”).
Adding and removing group elements¶
Elements of GFA2 groups can be added and removed from both connected and non-connected lines, using the following methods.
To add an item to or remove an item from an unordered group, use the
rm_item(item), which take as argument
either a string (identifier) or a line instance.
To append or prepend an item to an ordered group, use the methods
prepend_item(item). To remove the first or
the last item of an ordered group use the methods
Editing read-only fields of connected lines¶
Editing the read-only information of edges, gaps, links, containments, fragments and paths is more complicated. These lines shall be disconnected before the edit and connected again to the Gfa object after it. Before disconnecting a line, you should check if there are other lines dependent on it (see tables above). If so, you will have to disconnect these lines first, eventually update their fields and reconnect them at the end of the operation.
The order of the lines in GFA is not prescribed. Therefore, during parsing, or constructing a Gfa in memory, it is possible that a line is referenced to, before it is added to the Gfa instance. Whenever this happens, Gfapy creates a “virtual” line instance.
Users do not have to handle with virtual lines, if they work with complete and valid GFA files.
Virtual lines are similar to normal line instances, with some
limitations (they contain only limited information and it is not allowed
to add tags to them). To check if a line is a virtual line, one can use
virtual property of the line.
As soon as the parser founds the real line corresponding to a previously introduced virtual line, the virtual line is exchanged with the real line and all references are corrected to point to the real line.
>>> g = gfapy.Gfa() >>> g.add_line("S\t1\t*") >>> g.add_line("L\t1\t+\t2\t+\t*") >>> l = g.dovetails >>> g.segment("1").virtual False >>> g.segment("2").virtual True >>> l.to_segment == g.segment("2") True >>> g.segment("2").dovetails == [l] True >>> g.add_line("S\t2\t*") >>> g.segment("2").virtual False >>> l.to_segment == g.segment("2") True >>> g.segment("2").dovetails == [l] True