Alignments

Some GFA1 (L/C overlap, P overlaps) and GFA2 (E/F alignment) fields contain alignments or lists of alignments. The alignment can be left unspecified and a placeholder symbol * used instead. In GFA1 the alignments can be given as CIGAR strings, in GFA2 also as Dazzler traces.

Gfapy uses three different classes for representing the content of alignment fields: CIGAR, Trace and AlignmentPlaceholder.

Creating an alignment

An alignment instance is usually created from its GFA string representation or from a list by using the gfapy.Alignment() constructor.

>>> from gfapy import Alignment
>>> Alignment("*")
gfapy.AlignmentPlaceholder()
>>> Alignment("10,10,10")
gfapy.Trace([10,10,10])
>>> Alignment([10,10,10])
gfapy.Trace([10,10,10])
>>> Alignment("30M2I")
gfapy.CIGAR([gfapy.CIGAR.Operation(30,'M'), gfapy.CIGAR.Operation(2,'I')])

If the argument is an alignment object it will be returned, so that is always safe to call the method on a variable which can contain a string or an alignment instance:

>>> Alignment(Alignment("*"))
gfapy.AlignmentPlaceholder()
>>> Alignment(Alignment("10,10"))
gfapy.Trace([10,10])

Recognizing undefined alignments

The gfapy.is_placeholder() method allows to test if an alignment field contains an undefined value (placeholder) instead of a defined value (CIGAR string, trace). The method accepts as argument either an alignment object or a string or list representation.

>>> from gfapy import is_placeholder, Alignment
>>> is_placeholder(Alignment("30M"))
False
>>> is_placeholder(Alignment("10,10"))
False
>>> is_placeholder(Alignment("*"))
True
>>> is_placeholder("*")
True
>>> is_placeholder("30M")
False
>>> is_placeholder("10,10")
False
>>> is_placeholder([])
True
>>> is_placeholder([10,10])
False

Note that, as a placeholder is False in boolean context, just a if not aligment will also work, if alignment is an alignment object. But this of course, does not work, if it is a string representation. Therefore it is better to use the gfapy.is_placeholder() method, which works in both cases.

>>> if not Alignment("*"): print('no alignment')
no alignment
>>> if is_placeholder(Alignment("*")): print('no alignment')
no alignment
>>> if "*": print('not a placeholder...?')
not a placeholder...?
>>> if is_placeholder("*"): print('really? it is a placeholder!')
really? it is a placeholder!

Reading and editing CIGARs

CIGARs are represented by specialized lists, instances of the class CIGAR, whose elements are CIGAR operations CIGAR operations are represented by instance of the class Operation, and provide the properties length (length of the operation, an integer) and code (one-letter string which specifies the type of operation). Note that not all operations allowed in SAM files (for which CIGAR strings were first defined) are also meaningful in GFA and thus GFA2 only allows the operations M, I, D and P.

>>> cigar = gfapy.Alignment("30M")
>>> isinstance(cigar, list)
True
>>> operation = cigar[0]
>>> type(operation)
<class 'gfapy.alignment.cigar.CIGAR.Operation'>
>>> operation.code
'M'
>>> operation.code = 'D'
>>> operation.length
30
>>> len(operation)
30
>>> str(operation)
'30D'

As a CIGAR instance is a list, list methods apply to it. If the array is emptied, its string representation will be the placeholder symbol *.

>>> cigar = gfapy.Alignment("1I20M2D")
>>> cigar[0].code = "M"
>>> cigar.pop(1)
gfapy.CIGAR.Operation(20,'M')
>>> str(cigar)
'1M2D'
>>> cigar[:] = []
>>> str(cigar)
'*'

The validate CIGAR.validate() function checks if a CIGAR instance is valid. A version can be provided, as the CIGAR validation is version specific (as GFA2 forbids some CIGAR operations).

>>> cigar = gfapy.Alignment("30M10D20M5I10M")
>>> cigar.validate()
>>> cigar[1].code = "L"
>>> cigar.validate()
Traceback (most recent call last):
  ...
gfapy.error.ValueError:
>>> cigar = gfapy.Alignment("30M10D20M5I10M")
>>> cigar[1].code = "X"
>>> cigar.validate(version="gfa1")
>>> cigar.validate(version="gfa2")
Traceback (most recent call last):
  ...
gfapy.error.ValueError:

Reading and editing traces

Traces are arrays of non-negative integers. The values are interpreted using a trace spacing value. If traces are used, a trace spacing value must be defined in a TS integer tag, either in the header, or in the single lines which contain traces (which takes precedence over the header global value).

>>> print(gfa) 
H TS:i:100
E x A+ B- 0 100$ 0 100$ 4,2 TS:i:50
...
>>> gfa.header.TS
100
>>> gfa.line("x").TS
50

Query, reference and complement

CIGARs are asymmetric, i.e.they consider one sequence as reference and another sequence as query.

The length_on_reference() and length_on_query() methods compute the length of the alignment on the two sequences. These methods are used by the library e.g. to convert GFA1 L lines to GFA2 E lines (which is only possible if CIGARs are provided).

>>> cigar = gfapy.Alignment("30M10D20M5I10M")
>>> cigar.length_on_reference()
70
>>> cigar.length_on_query()
65

CIGARs are dependent on which sequence is taken as reference and which is taken as query. For each alignment, a complement CIGAR can be computed using the method complement(); it is the CIGAR obtained when the two sequences are switched.

>>> cigar = gfapy.Alignment("2M1D3M")
>>> str(cigar.complement())
'3M1I2M'

The current version of Gfapy does not provide a way to compute the alignment, thus the trace information can be accessed and edited, but not used for this purpose. Because of this there is currently no way in Gfapy to compute a complement trace (trace obtained when the sequences are switched).

>>> trace = gfapy.Alignment("1,2,3")
>>> str(trace.complement())
'*'

The complement of a placeholder is a placeholder:

>>> str(gfapy.Alignment("*").complement())
'*'