Some GFA1 (L/C overlap, P overlaps) and GFA2 (E/F alignment) fields contain
alignments or lists of alignments. The alignment can be left unspecified and a
* used instead. In GFA1 the alignments can be given as
CIGAR strings, in GFA2 also as Dazzler traces.
Gfapy uses three different classes for representing the content of alignment fields:
Creating an alignment¶
An alignment instance is usually created from its GFA string
representation or from a list by using the
>>> from gfapy import Alignment >>> Alignment("*") gfapy.AlignmentPlaceholder() >>> Alignment("10,10,10") gfapy.Trace([10,10,10]) >>> Alignment([10,10,10]) gfapy.Trace([10,10,10]) >>> Alignment("30M2I") gfapy.CIGAR([gfapy.CIGAR.Operation(30,'M'), gfapy.CIGAR.Operation(2,'I')])
If the argument is an alignment object it will be returned, so that is always safe to call the method on a variable which can contain a string or an alignment instance:
>>> Alignment(Alignment("*")) gfapy.AlignmentPlaceholder() >>> Alignment(Alignment("10,10")) gfapy.Trace([10,10])
Recognizing undefined alignments¶
allows to test if an alignment field contains an undefined value (placeholder)
instead of a defined value (CIGAR string, trace). The method accepts as
argument either an alignment object or a string or list representation.
>>> from gfapy import is_placeholder, Alignment >>> is_placeholder(Alignment("30M")) False >>> is_placeholder(Alignment("10,10")) False >>> is_placeholder(Alignment("*")) True >>> is_placeholder("*") True >>> is_placeholder("30M") False >>> is_placeholder("10,10") False >>> is_placeholder() True >>> is_placeholder([10,10]) False
Note that, as a placeholder is
False in boolean context, just a
if not aligment will also work, if alignment is an alignment object.
But this of course, does not work, if it is a string representation.
Therefore it is better to use the
which works in both cases.
>>> if not Alignment("*"): print('no alignment') no alignment >>> if is_placeholder(Alignment("*")): print('no alignment') no alignment >>> if "*": print('not a placeholder...?') not a placeholder...? >>> if is_placeholder("*"): print('really? it is a placeholder!') really? it is a placeholder!
Reading and editing CIGARs¶
CIGARs are represented by specialized lists, instances of the class
CIGAR, whose elements are CIGAR operations
CIGAR operations are represented by instance of the class
and provide the properties
length (length of the operation, an integer)
code (one-letter string which specifies the type of operation).
Note that not all operations allowed in SAM files (for which CIGAR strings
were first defined) are also meaningful in GFA and thus GFA2 only allows
>>> cigar = gfapy.Alignment("30M") >>> isinstance(cigar, list) True >>> operation = cigar >>> type(operation) <class 'gfapy.alignment.cigar.CIGAR.Operation'> >>> operation.code 'M' >>> operation.code = 'D' >>> operation.length 30 >>> len(operation) 30 >>> str(operation) '30D'
As a CIGAR instance is a list, list methods apply to it. If the array is
emptied, its string representation will be the placeholder symbol
>>> cigar = gfapy.Alignment("1I20M2D") >>> cigar.code = "M" >>> cigar.pop(1) gfapy.CIGAR.Operation(20,'M') >>> str(cigar) '1M2D' >>> cigar[:] =  >>> str(cigar) '*'
function checks if a CIGAR instance is valid. A version can be provided, as the
CIGAR validation is version specific (as GFA2 forbids some CIGAR operations).
>>> cigar = gfapy.Alignment("30M10D20M5I10M") >>> cigar.validate() >>> cigar.code = "L" >>> cigar.validate() Traceback (most recent call last): ... gfapy.error.ValueError: >>> cigar = gfapy.Alignment("30M10D20M5I10M") >>> cigar.code = "X" >>> cigar.validate(version="gfa1") >>> cigar.validate(version="gfa2") Traceback (most recent call last): ... gfapy.error.ValueError:
Reading and editing traces¶
Traces are arrays of non-negative integers. The values are interpreted using a trace spacing value. If traces are used, a trace spacing value must be defined in a TS integer tag, either in the header, or in the single lines which contain traces (which takes precedence over the header global value).
>>> print(gfa) H TS:i:100 E x A+ B- 0 100$ 0 100$ 4,2 TS:i:50 ... >>> gfa.header.TS 100 >>> gfa.line("x").TS 50
Query, reference and complement¶
CIGARs are asymmetric, i.e.they consider one sequence as reference and another sequence as query.
length_on_query() methods compute the length
of the alignment on the two sequences. These methods are used by the library
e.g. to convert GFA1 L lines to GFA2 E lines (which is only possible if CIGARs
>>> cigar = gfapy.Alignment("30M10D20M5I10M") >>> cigar.length_on_reference() 70 >>> cigar.length_on_query() 65
CIGARs are dependent on which sequence is taken as reference and which
is taken as query. For each alignment, a complement CIGAR can be
computed using the method
complement(); it is the CIGAR obtained
when the two sequences are switched.
>>> cigar = gfapy.Alignment("2M1D3M") >>> str(cigar.complement()) '3M1I2M'
The current version of Gfapy does not provide a way to compute the alignment, thus the trace information can be accessed and edited, but not used for this purpose. Because of this there is currently no way in Gfapy to compute a complement trace (trace obtained when the sequences are switched).
>>> trace = gfapy.Alignment("1,2,3") >>> str(trace.complement()) '*'
The complement of a placeholder is a placeholder:
>>> str(gfapy.Alignment("*").complement()) '*'