Tags¶
Each record in GFA can contain tags. Tags are fields which consist in a
tag name, a datatype and data. The format is NN:T:DATA
where NN
is a two-letter tag name, T
is a one-letter datatype string and
DATA
is a string representing the data according to the specified
datatype. Tag names must be unique for each line, i.e. each line may
only contain a tag once.
# Examples of GFA tags of different datatypes:
"aa:i:-12"
"bb:f:1.23"
"cc:Z:this is a string"
"dd:A:X"
"ee:B:c,12,3,2"
"ff:H:122FA0"
'gg:J:["A","B"]'
Custom tags¶
Some tags are explicitly defined in the specification (these are named predefined tags in Gfapy), and the user or an application can define its own custom tags. These may contain lower case letters.
Custom tags are user or program specific and may of course collide with the tags used by other users or programs. For this reasons, if you write scripts which employ custom tags, you should always check that the values are of the correct datatype and plausible.
>>> line = gfapy.Line("H\txx:i:2")
>>> if line.get_datatype("xx") != "i":
... raise Exception("I expected the tag xx to contain an integer!")
>>> myvalue = line.xx
>>> if (myvalue > 120) or (myvalue % 2 == 1):
... raise Exception("The value in the xx tag is not an even value <= 120")
>>> # ... do something with myvalue
Also it is good practice to allow the user of the script to change the name of the custom tags. For example, Gfapy employs the +or+ custom tag to track the original segment from which a segment in the final graph is derived. All methods which read or write the +or+ tag allow to specify an alternative tag name to use instead of +or+, for the case that this name collides with the custom tag of another program.
# E.g. a method which does something with myvalue, usually stored in tag xx
# allows the user to specify an alternative name for the tag
def mymethod(line, mytag="xx"):
myvalue = line.get(mytag)
# ...
Predefined tags¶
According to the GFA specifications, predefined tag names consist of either two upper case letters, or an upper case letter followed by a digit. The GFA1 specification predefines tags for each line type, while GFA2 only predefines tags for the header and edges.
While tags with the predefined names are allowed to be added to any line,
when they are used in the lines mentiones in the specification (e.g. VN
in the header) gfapy checks that the datatype is the one prescribed by
the specification (e.g. VN
must be of type Z
). It is not forbidden
to use the same tags in other contexts, but in this case, the datatype
restriction is not enforced.
Tag | Type | Line types | GFA version | |
---|---|---|---|
VN | Z | H | 1,2 | |
TS | i | H,S | 2 |
LN | i | S | 1 |
RC | i | S,L,C | 1 |
FC | i | S,L | 1 |
KC | i | S,L | 1 |
SH | H | S | 1 |
UR | Z | S | 1 |
MQ | i | L | 1 |
NM | i | L,i | 1 |
ID | Z | L,C | 1 |
"VN:Z:1.0" # VN => predefined tag
"z5:Z:1.0" # z5 first char is downcase => custom tag
"XX:Z:aaa" # XX upper case, but not predefined => custom tag
# not forbidden, but not recommended:
"zZ:Z:1.0" # => mixed case, first char downcase => custom tag
"Zz:Z:1.0" # => mixed case, first char upcase => custom tag
"vn:Z:1.0" # => same name as predefined tag, but downcase => custom tag
Datatypes¶
The following table summarizes the datatypes available for tags:
Symbol | Datatype | Example | Python class |
---|---|---|---|
Z | string | This is a string | str |
i | integer | -12 | int |
f | float | 1.2E-5 | float |
A | char | X | str |
J | JSON | [1,{“k1”:1,”k2”:2},”a”] | list/dict |
B | numeric array | f,1.2,13E-2,0 | gfapy.NumericArray |
H | byte array | FFAA01 | gfapy.ByteArray |
Validation¶
The tag names must consist of a letter and a digit or two letters.
"KC:i:1" # => OK
"xx:i:1" # => OK
"x1:i:1" # => OK
"xxx:i:1" # => error: name is too long
"x:i:1" # => error: name is too short
"11:i:1" # => error: at least one letter must be present
The datatype must be one of the datatypes specified above. For predefined tags, Gfapy also checks that the datatype given in the specification is used.
"xx:X:1" # => error: datatype X is unknown
"VN:i:1" # => error: VN must be of type Z
The data must be a correctly formatted string for the specified datatype or a Python object whose string representation is a correctly formatted string.
# current value: xx:i:2
>>> line = gfapy.Line("S\tA\t*\txx:i:2")
>>> line.xx = 1
>>> line.xx
1
>>> line.xx = "3"
>>> line.xx
3
>>> line.xx = "A"
>>> line.xx
Traceback (most recent call last):
...
gfapy.error.FormatError: ...
Depending on the validation level, more or less checks are done
automatically (see Validation chapter). Per default - validation level
(1) - validation is performed only during parsing or accessing values
the first time, therefore the user must perform a manual validation if
he changes values to something which is not guaranteed to be correct. To
trigger a manual validation, the user can call the method
validate_field(fieldname)
to validate a single tag, or
validate()
to validate the whole line, including all tags.
>>> line = gfapy.Line("S\tA\t*\txx:i:2", vlevel = 0)
>>> line.validate_field("xx")
>>> line.validate()
>>> line.xx = "A"
>>> line.validate_field("xx")
Traceback (most recent call last):
...
gfapy.error.FormatError: ...
>>> line.validate()
Traceback (most recent call last):
...
gfapy.error.FormatError: ...
>>> line.xx = "3"
>>> line.validate_field("xx")
>>> line.validate()
Reading and writing tags¶
Tags can be read using a property on the Gfapy line object, which is
called as the tag (e.g. line.xx). A special version of the property
prefixed by try_get_
raises an error if the tag was not available
(e.g. line.try_get_LN
), while the tag property (e.g. line.LN
)
would return None
in this case. Setting the value is done assigning
a value to it the tag name method (e.g. line.TS = 120
). In
alternative, the set(fieldname, value)
, get(fieldname)
and
try_get(fieldname)
methods can also be used. To remove a tag from a
line, use the delete(fieldname)
method, or set its value to
None
. The tagnames
property Line instances is a list of
the names (as strings) of all defined tags for a line.
>>> line = gfapy.Line("S\tA\t*\txx:i:1", vlevel = 0)
>>> line.xx
1
>>> line.xy is None
True
>>> line.try_get_xx()
1
>>> line.try_get_xy()
Traceback (most recent call last):
...
gfapy.error.NotFoundError: ...
>>> line.get("xx")
1
>>> line.try_get("xy")
Traceback (most recent call last):
...
gfapy.error.NotFoundError: ...
>>> line.xx = 2
>>> line.xx
2
>>> line.xx = "a"
>>> line.tagnames
['xx']
>>> line.xy = 2
>>> line.xy
2
>>> line.set("xy", 3)
>>> line.get("xy")
3
>>> line.tagnames
['xx', 'xy']
>>> line.delete("xy")
3
>>> line.xy is None
True
>>> line.xx = None
>>> line.xx is None
True
>>> line.try_get("xx")
Traceback (most recent call last):
...
gfapy.error.NotFoundError: ...
>>> line.tagnames
[]
When a tag is read, the value is converted into an appropriate object (see Python classes in the datatype table above). When setting a value, the user can specify the value of a tag either as a Python object, or as the string representation of the value.
>>> line = gfapy.Line('H\txx:i:1\txy:Z:TEXT\txz:J:["a","b"]')
>>> line.xx
1
>>> isinstance(line.xx, int)
True
>>> line.xy
'TEXT'
>>> isinstance(line.xy, str)
True
>>> line.xz
['a', 'b']
>>> isinstance(line.xz, list)
True
The string representation of a tag can be read using the
field_to_s(fieldname)
method. The default is to only output the
content of the field. By setting ``tag: true```, the entire tag is
output (name, datatype, content, separated by colons). An exception is
raised if the field does not exist.
>>> line = gfapy.Line("H\txx:i:1")
>>> line.xx
1
>>> line.field_to_s("xx")
'1'
>>> line.field_to_s("xx", tag=True)
'xx:i:1'
Datatype of custom tags¶
The datatype of an existing custom field (but not of predefined fields)
can be changed using the set_datatype(fieldname, datatype)
method.
The current datatype specification can be read using
get_datatype(fieldname)
.
>>> line = gfapy.Line("H\txx:i:1")
>>> line.get_datatype("xx")
'i'
>>> line.set_datatype("xx", "Z")
>>> line.get_datatype("xx")
'Z'
If a new custom tag is specified, Gfapy selects the correct datatype for
it: i/f for numeric values, J/B for arrays, J for hashes and Z for
strings and strings. If the user wants to specify a different datatype,
he may do so by setting it with set_datatype()
(this can be done
also before assigning a value, which is necessary if full validation is
active).
>>> line = gfapy.Line("H")
>>> line.xx = "1"
>>> line.xx
'1'
>>> line.set_datatype("xy", "i")
>>> line.xy = "1"
>>> line.xy
1
Arrays of numerical values¶
B
and H
tags represent array with particular constraints (e.g.
they can only contain numeric values, and in some cases the values must
be in predefined ranges). In order to represent them correctly and allow
for validation, Python classes have been defined for both kind of tags:
gfapy.ByteArray
for H
and gfapy.NumericArray
for B
fields.
Both are subclasses of list. Object of the two classes can be created by passing an existing list or the string representation to the class constructor.
>>> # create a byte array instance
>>> gfapy.ByteArray([12,3,14])
b'\x0c\x03\x0e'
>>> gfapy.ByteArray("A012FF")
b'\xa0\x12\xff'
>>> # create a numeric array instance
>>> gfapy.NumericArray.from_string("c,12,3,14")
[12, 3, 14]
>>> gfapy.NumericArray([12,3,14])
[12, 3, 14]
Instances of the classes behave as normal lists, except that they provide a #validate() method, which checks the constraints, and that their string representation is the GFA string representation of the field value.
>>> gfapy.NumericArray([12,1,"1x"]).validate()
Traceback (most recent call last):
...
gfapy.error.ValueError
>>> str(gfapy.NumericArray([12,3,14]))
'C,12,3,14'
>>> gfapy.ByteArray([12,1,"1x"]).validate()
Traceback (most recent call last):
...
gfapy.error.ValueError
>>> str(gfapy.ByteArray([12,3,14]))
'0C030E'
For numeric values, the compute_subtype
method allows to compute
the subtype which will be used for the string representation. Unsigned
subtypes are used if all values are positive. The smallest possible
subtype range is selected. The subtype may change when the range of the
elements changes.
>>> gfapy.NumericArray([12,13,14]).compute_subtype()
'C'
Special cases: custom records, headers, comments and virtual lines.¶
GFA2 allows custom records, introduced by record type strings other than the predefined ones. Gfapy uses a pragmatical approach for identifying tags in custom records, and tries to interpret the rightmost fields as tags, until the first field from the right raises an error; all remaining fields are treated as positional fields.
"X a b c xx:i:12" # => xx is tag, a, b, c are positional fields
"Y a b xx:i:12 c" # => all positional fields, as c is not a valid tag
For easier access, the entire header of the GFA is summarized in a
single line instance. A class (FieldArray
) has been defined to
handle the special case when multiple H lines define the same tag (see
The Header chapter for details).
Comment lines are represented by a subclass of the same class
(Line
) as the records. However, they cannot contain tags: the
entire line is taken as content of the comment. See the Comments
chapter for more information about comments.
"# this is not a tag: xx:i:1" # => xx is not a tag, xx:i:1 is part of the comment
Virtual instances of the Line
class (e.g. segment instances automatically
created because of not yet resolved references found in edges) cannot be
modified by the user, and tags cannot be specified for them. This
includes all instances of the Unknown
class. See the
References chapter for more information about virtual lines.