Note about dtype¶
By default, almost every operation in dinopy will return results in the Python
type bytes. However, most of these operations accept the keyword argument
dtype which may be set to any of the following values:
bytesReturn the ascii-encoded bases as python
bytes, e.g.b'ACGT'(which is basically the same as[65, 67, 71, 84]). This is the default.
bytearrayA mutable version of bytes.
strReturn the result as utf-8 string (capital letters), e.g.
"ACGT".
basenumbersMaps
A → 0, C → 1, G → 2, T → 3, N → 4, ..., e.g.[0, 1, 2, 3](or, more correctly, asbytes:b'\x00\x01\x02\x03').
Sometimes, as is the case with dinopy.processors.qgrams, you can also supply
the keyword argument ‘encoding’ with either of the following values:
two_bitEncodes the result as a
longinteger. If characters other than"ACGT"are encountered, they will be randomly replaced according to the usual IUPAC mapping (see IUPAC mapping for details).
four_bitBasically the same as two_bit, but does not need to replace characters other than
"ACGT", as each character of the IUPAC specification can be encoded using exactly 4 bit. For example,"ACGTM"translates to0b 0001 0010 0100 1000 0011.
Note
All of these types can be accessed like dinopy.two_bit or dinopy.definitions.two_bit.
You can use the functions dinopy.conversion.encode and dinopy.conversion.decode to manually
convert qgrams to and from 2bit / 4bit encoding.
IUPAC mapping¶
DNA sequences may contain characters other than A, C, G or T, namely:
N, U, R, Y, M, K, W, S, B, D, H and V.
If such a character is encountered while encoding a sequence to a single long
integer two-bit representation, they will get replaced randomly (uniform) according
to the following mapping:
Character
Replace
N
A
C
G
T
U
T
R
A
G
Y
C
T
M
A
C
K
G
T
W
A
T
S
C
G
B
C
G
T
D
A
G
T
H
A
C
T
V
A
C
G