References

Some fields in GFA lines contain identifiers or lists of identifiers (sometimes followed by orientation strings), which reference other lines of the GFA file. In Gfapy it is possible to follow these references and traverse the graph.

Connecting a line to a Gfa object

In stand-alone line instances, the identifiers which reference other lines are either strings containing the line name, pairs of strings (name and orientation) in a gfapy.OrientedLine object, or lists of lines names or gfapy.OrientedLine objects.

Using the add_line(line) (alias: append(line)) method of the gfapy.Gfa object, or the equivalent connect(gfa) method of the gfapy.Line instance, a line is added to a Gfa instance (this is done automatically when a GFA file is parsed). All strings expressing references are then changed into references to the corresponding line objects. The method is_connected() allows to determine if a line is connected to a gfapy instance. The read-only property gfa contains the gfapy.Gfa instance to which the line is connected.

>>> gfa = gfapy.Gfa(version='gfa1')
>>> link = gfapy.Line("L\tA\t-\tB\t+\t20M")
>>> link.is_connected()
False
>>> link.gfa is None
True
>>> type(link.from_segment)
<class 'str'>
>>> gfa.append(link)
>>> link.is_connected()
True
>>> link.gfa 
<gfapy.gfa.Gfa object at ...>
>>> type(link.from_segment)
<class 'gfapy.line.segment.gfa1.GFA1'>

References for each record type

The following tables describes the references contained in each record type. The notation [] represent lists.

GFA1

Record type Fields Type of reference
Link from_segment, to_segment | Segment
Containment from_segment, to_segment | Segment
Path segment_names, [OrientedLine(Segment)]
  links (1) [OrientedLine(Link)]

(1): paths contain information in the fields segment_names and overlaps, which allow to find the identify from which they depend; these links can be retrieved using links (which is not a field).

GFA2

Record type Fields Type of reference
Edge sid1, sid2 Segment
Gap sid1, sid2 Segment
Fragment sid Segment
Set items [Edge/Set/Path/Segment]
Path items [OrientedLine(Edge/Set/Segment)]

Backreferences for each record type

When a line containing a reference to another line is connected to a Gfa object, backreferences to it are created in the targeted line.

For each backreference collection a read-only property exist, which is named as the collection (e.g. dovetails_L for segments). Note that the reference list returned by these arrays are read-only and editing the references is done using other methods (see the section “Editing reference fields” below).

segment.dovetails_L # => [gfapy.line.edge.Link(...), ...]

The following tables describe the backreferences collections for each record type.

GFA1

Record type Backreferences
Segment dovetails_L
  dovetails_R
  edges_to_contained
  edges_to_containers
  paths
Link paths

GFA2

Record type Backreferences Type
Segment dovetails_L E
  dovetails_R E
  edges_to_contained E
  edges_to_containers E
  internals E
  gaps_L G
  gaps_R G
  fragments F
  paths O
  sets U
Edge paths O
  sets U
O Group paths O
  sets U
U Group sets U

Segment backreference convenience methods

For segments, additional methods are available which combine in different way the backreferences information. The dovetails_of_end and gaps_of_end methods take an argument L or R and return the dovetails overlaps (or gaps) of the left or, respectively, right end of the segment sequence (equivalent to the segment properties dovetails_L/dovetails_R and gaps_L/gaps_R).

The segment containments property is a list of both containments where the segment is the container or the contained segment. The segment edges property is a list of all edges (dovetails, containments and internals) with a reference to the segment.

Other methods directly compute list of segments from the edges lists mentioned above. The neighbours_L, neighbours_R properties and the neighbours method compute the set of segment instances which are connected by dovetails to the segment. The containers and contained properties similarly compute the set of segment instances which, respectively, contains the segment, or are contained in the segment.

>>> gfa = gfapy.Gfa()
>>> gfa.append('S\tA\t*')
>>> s = gfa.segment('A')
>>> gfa.append('S\tB\t*')
>>> gfa.append('S\tC\t*')
>>> gfa.append('L\tA\t-\tB\t+\t*')
>>> gfa.append('C\tA\t+\tC\t+\t10\t*')
>>> [str(l) for l in s.dovetails_of_end("L")]
['L\tA\t-\tB\t+\t*']
>>> s.dovetails_L == s.dovetails_of_end("L")
True
>>> s.gaps_of_end("R")
[]
>>> [str(e) for e in s.edges]
['L\tA\t-\tB\t+\t*', 'C\tA\t+\tC\t+\t10\t*']
>>> [str(n) for n in s.neighbours_L]
['S\tB\t*']
>>> s.containers
[]
>>> [str(c) for c in s.contained]
['S\tC\t*']

Multiline group definitions

The GFA2 specification opens the possibility (experimental) to define groups on multiple lines, by using the same ID for each line defining the group. This is supported by gfapy.

This means that if multiple Ordered or Unordered instances connected to a Gfa object have the same gid, they are merged into a single instance (technically the last one getting added to the graph object). The items list are merged.

The tags of multiple line defining a group shall not contradict each other (i.e. either are the tag names on different lines defining the group all different, or, if the same tag is present on different lines, the value and datatype must be the same, in which case the multiple definition will be ignored).

>>> gfa = gfapy.Gfa()
>>> gfa.add_line("U\tu1\ts1 s2 s3")
>>> [s.name for s in gfa.sets[-1].items]
['s1', 's2', 's3']
>>> gfa.add_line('U\tu1\t4 5')
>>> [s.name for s in gfa.sets[-1].items]
['s1', 's2', 's3', '4', '5']

Induced set and captured path

The item list in GFA2 sets and paths may not contain elements which are implicitly involved. For example a path may contain segments, without specifying the edges connecting them, if there is only one such edge. Alternatively a path may contain edges, without explicitly indicating the segments. Similarly a set may contain edges, but not the segments referred to in them, or contain segments which are connected by edges, without the edges themselves. Furthermore groups may refer to other groups (set to sets or paths, paths to paths only), which then indirectly contain references to segments and edges.

Gfapy provides methods for the computation of the sets of segments and edges which are implied by an ordered or unordered group. Thereby all references to subgroups are resolved and implicit elements are added, as described in the specification. The computation can, therefore, only be applied to connected lines. For unordered groups, this computation is provided by the method induced_set(), which returns an array of segment and edge instances. For ordered group, the computation is provided by the method captured_path(), which returns a list of gfapy.OrientedLine instances, alternating segment and edge instances (and starting and ending in segments).

The methods induced_segments_set(), induced_edges_set(), captured_segments() and captured_edges() return, respectively, the list of only segments or edges, in ordered or unordered groups.

>>> gfa = gfapy.Gfa()
>>> gfa.add_line("S\ts1\t100\t*")
>>> gfa.add_line("S\ts2\t100\t*")
>>> gfa.add_line("S\ts3\t100\t*")
>>> gfa.add_line("E\te1\ts1+\ts2-\t0\t10\t90\t100$\t*")
>>> gfa.add_line("U\tu1\ts1 s2 s3")
>>> u = gfa.sets[-1]
>>> [l.name for l in u.induced_edges_set]
['e1']
>>> [l.name for l in u.induced_segments_set ]
['s1', 's2', 's3']
>>> [l.name for l in u.induced_set ]
['s1', 's2', 's3', 'e1']

Disconnecting a line from a Gfa object

Lines can be disconnected using the rm(line) method of the gfapy.Gfa object or the disconnect() method of the line instance.

>>> gfa = gfapy.Gfa()
>>> gfa.append('S\tsA\t*')
>>> gfa.append('S\tsB\t*')
>>> line = gfa.segment("sA")
>>> gfa.segment_names
['sA', 'sB']
>>> gfa.rm(line)
>>> gfa.segment_names
['sB']
>>> line = gfa.segment('sB')
>>> line.disconnect()
>>> gfa.segment_names
[]

Disconnecting a line affects other lines as well. Lines which are dependent on the disconnected line are disconnected as well. Any other reference to disconnected lines is removed as well. In the disconnected line, references to lines are transformed back to strings and backreferences are deleted.

The following tables show which dependent lines are disconnected if they refer to a line which is being disconnected.

GFA1

Record type Dependent lines
Segment links (+ paths), containments
Link paths

GFA2

Record type Dependent lines
Segment edges, gaps, fragments, sets, paths
Edge sets, paths
Sets sets, paths

Editing reference fields

In connected line instances, it is not allowed to directly change the content of fields containing references to other lines, as this would make the state of the Gfa object invalid.

Besides the fields containing references, some other fields are read-only in connected lines. Changing some of the fields would require moving the backreferences to other collections (position fields of edges and gaps, from_orient and to_orient of links). The overlaps field of connected links is readonly as it may be necessary to identify the link in paths.

Renaming an element

The name field of a line (e.g. segment name/sid) is not a reference and thus can be edited also in connected lines. When the name of the line is changed, no manual editing of references (e.g. from_segment/to_segment fields in links) is necessary, as all lines which refer to the line will still refer to the same instance. The references to the instance in the Gfa lines collections will be automatically updated. Also, the new name will be correctly used when converting to string, such as when the Gfa instance is written to a GFA file.

Renaming a line to a name which already exists has the same effect of adding a line with that name. That is, in most cases, gfapy.NotUniqueError is raised. An exception are GFA2 sets and paths: in this case the line will be appended to the existing line with the same name (as described in “Multiline group definitions”).

Adding and removing group elements

Elements of GFA2 groups can be added and removed from both connected and non-connected lines, using the following methods.

To add an item to or remove an item from an unordered group, use the methods add_item(item) and rm_item(item), which take as argument either a string (identifier) or a line instance.

To append or prepend an item to an ordered group, use the methods append_item(item) and prepend_item(item). To remove the first or the last item of an ordered group use the methods rm_first_item() and rm_last_item().

Editing read-only fields of connected lines

Editing the read-only information of edges, gaps, links, containments, fragments and paths is more complicated. These lines shall be disconnected before the edit and connected again to the Gfa object after it. Before disconnecting a line, you should check if there are other lines dependent on it (see tables above). If so, you will have to disconnect these lines first, eventually update their fields and reconnect them at the end of the operation.

Virtual lines

The order of the lines in GFA is not prescribed. Therefore, during parsing, or constructing a Gfa in memory, it is possible that a line is referenced to, before it is added to the Gfa instance. Whenever this happens, Gfapy creates a “virtual” line instance.

Users do not have to handle with virtual lines, if they work with complete and valid GFA files.

Virtual lines are similar to normal line instances, with some limitations (they contain only limited information and it is not allowed to add tags to them). To check if a line is a virtual line, one can use the virtual property of the line.

As soon as the parser founds the real line corresponding to a previously introduced virtual line, the virtual line is exchanged with the real line and all references are corrected to point to the real line.

>>> g = gfapy.Gfa()
>>> g.add_line("S\t1\t*")
>>> g.add_line("L\t1\t+\t2\t+\t*")
>>> l = g.dovetails[0]
>>> g.segment("1").virtual
False
>>> g.segment("2").virtual
True
>>> l.to_segment == g.segment("2")
True
>>> g.segment("2").dovetails == [l]
True
>>> g.add_line("S\t2\t*")
>>> g.segment("2").virtual
False
>>> l.to_segment == g.segment("2")
True
>>> g.segment("2").dovetails == [l]
True