Positional fields¶
Most lines in GFA have positional fields (Headers are an exception). During parsing, if a line is encountered, which has too less or too many positional fields, an exception will be thrown. The correct number of positional fields is record type-specific.
Positional fields are recognized by its position in the line. Each positional field has an implicit field name and datatype associated with it.
Field names¶
The field names are derived from the specification. Lower case versions
of the field names are used and spaces are substituted with underscores.
In some cases, the field names were changed, as they represent keywords
in common programming languages or clash with potential tag names
(from
, to
, send
).
The following tables shows the field names used in Gfapy, for each kind of line. Headers have no positional fields. Comments and custom records follow particular rules, see the respective chapters (Comments and Custom records).
GFA1 field names¶
Record Type | Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 |
---|---|---|---|---|---|---|
Segment | name |
sequence |
||||
Link | from_segment |
from_orient |
to_segment |
to_orient |
overlap |
|
Containment | from_segment |
from_orient |
to_segment |
to_orient |
pos |
overlap |
Path | path_name |
segment_names |
overlaps |
GFA2 field names¶
Record Type | Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 | Field 7 | Field 8 |
---|---|---|---|---|---|---|---|---|
Segment | sid |
slen |
sequence |
|||||
Edge | eid |
sid1 |
sid2 |
beg1 |
end1 |
beg2 |
end2 |
alignment |
Fragment | sid |
external |
s_beg |
s_end |
f_beg |
f_end |
alignment |
|
Gap | gid |
sid1 |
d1 |
d2 |
sid2 |
disp |
var |
|
Set | pid |
items |
||||||
Path | pid |
items |
Datatypes¶
The datatype of each positional field is described in the specification and cannot be changed (differently from tags). Here is a short description of the Python classes used to represent data for different datatypes.
Placeholders¶
The positional fields in GFA can never be empty. However, there are some
fields with optional values. If a value is not specified, a placeholder
character is used instead (*
). Such undefined values are represented
in Gfapy by the Placeholder
class, which is described more in
detail in the Placeholders chapter.
Arrays¶
The items
field in unordered and ordered groups and the
segment_names
and overlaps
fields in paths are lists of objects
and are represented by list instances.
>>> set = gfapy.Line("U\t*\t1 A 2")
>>> type(set.items)
<class 'list'>
>>> gfa2_path = gfapy.Line("O\t*\tA+ B-")
>>> type(gfa2_path.items)
<class 'list'>
>>> gfa1_path = gfapy.Line("P\tp1\tA+,B-\t10M,9M1D1M")
>>> type(gfa1_path.segment_names)
<class 'list'>
>>> type(gfa1_path.overlaps)
<class 'list'>
Orientations¶
Orientations are represented by strings. The gfapy.invert()
method
applied to an orientation string returns the other orientation.
>>> gfapy.invert("+")
'-'
>>> gfapy.invert("-")
'+'
Identifiers¶
The identifier of the line itself (available for S, P, E, G, U, O lines)
can always be accessed in Gfapy using the name
alias and is
represented in Gfapy by a string. If it is optional (E, G, U, O lines)
and not specified, it is represented by a Placeholder instance. The
fragment identifier is also a string.
Identifiers which refer to other lines are also present in some line types (L, C, E, G, U, O, F). These are never placeholders and in stand-alone lines are represented by strings. In connected lines they are references to the Line instances to which they refer to (see the References chapter).
Oriented identifiers¶
Oriented identifiers (e.g. segment_names
in GFA1 paths) are
represented by elements of the class gfapy.OrientedLine
. The
segment
method of the oriented segments returns the segment
identifier (or segment reference in connected path lines) and the
orient
method returns the orientation string. The name
method
returns the string of the segment, even if this is a reference to a
segment. A new oriented line can be created using the
OL[line, orientation]
method.
Calling invert
returns an oriented segment, with inverted
orientation. To set the two attributes the methods segment=
and
orient=
are available.
Examples:
>>> p = gfapy.Line("P\tP1\ta+,b-\t*")
>>> p.segment_names
[gfapy.OrientedLine('a','+'), gfapy.OrientedLine('b','-')]
>>> sn0 = p.segment_names[0]
>>> sn0.line
'a'
>>> sn0.name
'a'
>>> sn0.orient
'+'
>>> sn0.invert()
>>> sn0
gfapy.OrientedLine('a','-')
>>> sn0.orient
'-'
>>> sn0.line = gfapy.Line('S\tX\t*')
>>> str(sn0)
'X-'
>>> sn0.name
'X'
>>> sn0 = gfapy.OrientedLine(gfapy.Line('S\tY\t*'), '+')
Sequences¶
Sequences (S field sequence) are represented by strings in Gfapy. Depending on the GFA version, the alphabet definition is more or less restrictive. The definitions are correctly applied by the validation methods.
The method rc()
is provided to compute the reverse complement of a
nucleotidic sequence. The extended IUPAC alphabet is understood by the
method. Applied to non nucleotidic sequences, the results will be
meaningless:
>>> from gfapy.sequence import rc
>>> rc("gcat")
'atgc'
>>> rc("*")
'*'
>>> rc("yatc")
'gatr'
>>> rc("gCat")
'atGc'
>>> rc("cag", rna=True)
'cug'
Integers and positions¶
The C lines pos
field and the G lines disp
and var
fields
are represented by integers. The var
field is optional, and thus can
be also a placeholder. Positions are 0-based coordinates.
The position fields of GFA2 E lines (beg1, beg2, end1, end2
) and F
lines (s_beg, s_end, f_beg, f_end
) contain a dollar string as suffix
if the position is equal to the segment length. For more information,
see the Positions chapter.
Alignments¶
Alignments are always optional, ie they can be placeholders. If they are specified they are CIGAR alignments or, only in GFA2, trace alignments. For more details, see the Alignments chapter.
GFA1 datatypes¶
Datatype | Record Type | Fields |
---|---|---|
Identifier | Segment | name |
Path | path_name |
|
Link | from_segment, to_segment |
|
Containment | from_segment, to_segment |
|
[OrientedIdentifier] | Path | segment_names |
Orientation | Link | from_orient, to_orient |
Containment | from_orient, to_orient |
|
Sequence | Segment | sequence |
Alignment | Link | overlap |
Containment | overlap |
|
[Alignment] | Path | overlaps |
Position | Containment | pos |
GFA2 datatypes¶
Datatype | Record Type | Fields |
---|---|---|
Itentifier | Segment | sid |
Fragment | sid |
|
OrientedIdentifier | Edge | sid1, sid2 |
Gap | sid1, sid2 |
|
Fragment | external |
|
OptionalIdentifier | Edge | eid |
Gap | gid |
|
U Group | oid |
|
O Group | uid |
|
[Identifier] | U Group | items |
[OrientedIdentifier] | O Group | items |
Sequence | Segment | sequence |
Alignment | Edge | alignment |
Fragment | alignment |
|
Position | Edge | beg1, end1, beg2, end2 |
Fragment | s_beg, s_end, f_beg, f_end |
|
Integer | Gap | disp, var |
Reading and writing positional fields¶
The positional_fieldnames
method returns the list of the names (as
strings) of the positional fields of a line. The positional fields can
be read using a method on the Gfapy line object, which is called as the
field name. Setting the value is done with an equal sign version of the
field name method (e.g. segment.slen = 120). In alternative, the
set(fieldname, value)
and get(fieldname)
methods can also be
used.
>>> s_gfa1 = gfapy.Line("S\t1\t*")
>>> s_gfa1.positional_fieldnames
['name', 'sequence']
>>> s_gfa1.name
'1'
>>> s_gfa1.get("name")
'1'
>>> s_gfa1.name = "segment2"
>>> s_gfa1.name
'segment2'
>>> s_gfa1.set('name',"3")
>>> s_gfa1.name
'3'
When a field is read, the value is converted into an appropriate object.
The string representation of a field can be read using the
field_to_s(fieldname)
method.
>>> gfa = gfapy.Gfa()
>>> gfa.add_line("S\ts1\t*")
>>> gfa.add_line("L\ts1\t+\ts2\t-\t*")
>>> link = gfa.dovetails[0]
>>> str(link.from_segment)
'S\ts1\t*'
>>> link.field_to_s('from_segment')
's1'
When setting a non-string field, the user can specify the value of a tag either as a Python non-string object, or as the string representation of the value.
>>> gfa = gfapy.Gfa(version='gfa1')
>>> gfa.add_line("C\ta\t+\tb\t-\t10\t*")
>>> c = gfa.containments[0]
>>> c.pos
10
>>> c.pos = 1
>>> c.pos
1
>>> c.pos = "2"
>>> c.pos
2
>>> c.field_to_s("pos")
'2'
Note that setting the value of reference and backreferences-related fields is generally not allowed, when a line instance is connected to a Gfa object (see the References chapter).
>>> gfa = gfapy.Gfa(version='gfa1')
>>> l = gfapy.Line("L\ts1\t+\ts2\t-\t*")
>>> l.from_name
's1'
>>> l.from_segment = "s3"
>>> l.from_name
's3'
>>> gfa.add_line(l)
>>> l.from_segment = "s4"
Traceback (most recent call last):
...
gfapy.error.RuntimeError: ...
Validation¶
The content of all positional fields must be a correctly formatted string according to the rules given in the GFA specifications (or a Python object whose string representation is a correctly formatted string).
Depending on the validation level, more or less checks are done
automatically (see the Validation chapter). Not regarding which
validation level is selected, the user can trigger a manual validation
using the validate_field(fieldname)
method for a single field, or
using validate
, which does a full validation on the whole line,
including all positional fields.
>>> line = gfapy.Line("H\txx:i:1")
>>> line.validate_field("xx")
>>> line.validate()
Aliases¶
For some fields, aliases are defined, which can be used in all contexts where the original field name is used (i.e. as parameter of a method, and the same setter and getter methods defined for the original field name are also defined for each alias, see below).
>>> gfa1_path = gfapy.Line("P\tX\t1-,2+,3+\t*")
>>> gfa1_path.name == gfa1_path.path_name
True
>>> edge = gfapy.Line("E\t*\tA+\tB-\t0\t10\t90\t100$\t*")
>>> edge.eid == edge.name
True
>>> containment = gfapy.Line("C\tA\t+\tB\t-\t10\t*")
>>> containment.from_segment == containment.container
True
>>> segment = gfapy.Line("S\t1\t*")
>>> segment.sid == segment.name
True
>>> segment.sid
'1'
>>> segment.name = '2'
>>> segment.sid
'2'
Name¶
Different record types have an identifier field: segments (name in GFA1, sid in GFA2), paths (path_name), edge (eid), fragment (sid), gap (gid), groups (pid).
All these fields are aliased to name
. This allows the user for
example to set the identifier of a line using the name=(value)
method using the same syntax for different record types (segments,
edges, paths, fragments, gaps and groups).
Version-specific field names¶
For segments the GFA1 name and the GFA2 sid are equivalent fields. For
this reason an alias sid
is defined for GFA1 segments and name
for GFA2 segments.
Crypical field names¶
The definition of from and to for containments is somewhat cryptic. Therefore following aliases have been defined for containments: container[_orient] for from[_|segment|orient]; contained[_orient] for to[_segment|orient].