The Header¶
GFA files may contain one or multiple header lines (record type: “H”). These lines may be present in any part of the file, not necessarily at the beginning.
Although the header may consist of multiple lines, its content refers to the
whole file. Therefore in Gfapy the header is accessed using a single line
instance (accessible by the header
property). Header lines contain only tags. If not header line is present in the
Gfa, then the header line object will be empty (i.e. contain no tags).
Note that header lines cannot be connected to the Gfa as other lines (i.e.
calling connect()
on them raises
an exception). Instead they must be merged to the existing Gfa header, using
add_line
on the Gfa instance.
>>> gfa.add_line("H\tnn:f:1.0")
>>> gfa.header.nn
1.0
>>> gfapy.Line("H\tnn:f:1.0").connect(gfa)
Traceback (most recent call last):
...
gfapy.error.RuntimeError: ...
Multiple definitions of the predefined header tags¶
For the predefined tags (VN
and TS
), the presence of multiple
values in different lines is an error, unless the value is the same in
each instance (in which case the repeated definitions are ignored).
>>> gfa.add_line("H\tVN:Z:1.0")
>>> gfa.add_line("H\tVN:Z:1.0") # ignored
>>> gfa.add_line("H\tVN:Z:2.0")
Traceback (most recent call last):
...
gfapy.error.VersionError: ...
Multiple definitions of custom header tags¶
If the tags are present only once in the header in its entirety, the access to the tags is the same as for any other line (see the Tags chapter).
However, the specification does not forbid custom tags to be defined with different values in different header lines (which we name “multi-definition tags”). This particular case is handled in the next sections.
Reading multi-definitions tags¶
Reading, validating and setting the datatype of multi-definition tags is done
using the same methods as for all other lines (see the Tags chapter).
However, if a tag is defined multiple times on multiple H lines, reading the
tag will return a list of the values on the lines. This array is an instance of
the subclass gfapy.FieldArray
of list.
>>> gfa.add_line("H\txx:i:1")
>>> gfa.add_line("H\txx:i:2")
>>> gfa.add_line("H\txx:i:3")
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2, 3])
Setting tags¶
There are two possibilities to set a tag for the header. The first is
the normal tag interface (using set
or the tag name property). The
second is to use add
. The latter supports multi-definition tags,
i.e. it adds the value to the previous ones (if any), instead of
overwriting them.
>>> gfa = gfapy.Gfa()
>>> gfa.header.xx
>>> gfa.header.add("xx", 1)
>>> gfa.header.xx
1
>>> gfa.header.add("xx", 2)
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2])
>>> gfa.header.set("xx", 3)
>>> gfa.header.xx
3
Modifying field array values¶
Field arrays can be modified directly (e.g. adding new values or
removing some values). After modification, the user may check if the
array values remain compatible with the datatype of the tag using the
validate_field`()
method.
>>> gfa = gfapy.Gfa()
>>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3])
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2, 3])
>>> gfa.header.validate_field("xx")
>>> gfa.header.xx.append("X")
>>> gfa.header.validate_field("xx")
Traceback (most recent call last):
...
gfapy.error.FormatError: ...
If the field array is modified using array methods which return a list
or data of any other type, a field array must be constructed, setting
its datatype to the value returned by calling
get_datatype()
on the header.
>>> gfa = gfapy.Gfa()
>>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3])
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2, 3])
>>> gfa.header.xx = gfapy.FieldArray(gfa.header.get_datatype("xx"),
... list(map(lambda x: x+1, gfa.header.xx)))
>>> gfa.header.xx
gfapy.FieldArray('i',[2, 3, 4])
String representation of the header¶
For consistency with other line types, the string representation of the header
is a single-line string, eventually non standard-compliant, if it contains
multiple instances of the tag. (and when calling
field_to_s()
for a tag present multiple
times, the output string will contain the instances of the tag, separated by
tabs).
However, when the Gfa is output to file or string, the header is split into
multiple H lines with single tags, so that standard-compliant GFA is output.
The split header can be retrieved using the
headers
property of the Gfa instance.
>>> gfa = gfapy.Gfa()
>>> gfa.header.VN = "1.0"
>>> gfa.header.xx = gfapy.FieldArray('i',[1,2])
>>> gfa.header.field_to_s("xx")
'1\t2'
>>> gfa.header.field_to_s("xx", tag=True)
'xx:i:1\txx:i:2'
>>> str(gfa.header)
'H\tVN:Z:1.0\txx:i:1\txx:i:2'
>>> [str(h) for h in gfa.headers]
['H\tVN:Z:1.0', 'H\txx:i:1', 'H\txx:i:2']
>>> str(gfa)
'H\tVN:Z:1.0\nH\txx:i:1\nH\txx:i:2'
Count the input header lines¶
Due to the different way header lines are stored, the number of header elements
is not equal to the number of header lines in the input. This is annoying if an
application wants to count the number of input lines in a file. In order to make
that possible, the number of input header lines are counted and can be
retrieved using the n_input_header_lines
property of the Gfa instance.
>>> gfa = gfapy.Gfa()
>>> gfa.add_line("H\txx:i:1\tyy:Z:ABC")
>>> gfa.add_line("H\txy:i:2")
>>> gfa.add_line("H\tyz:i:3\tab:A:A")
>>> len(gfa.headers)
5
>>> gfa.n_input_header_lines
3