dinopy.nameline_parser module¶
This module handles the parsing of name lines.
You can either use an instance of the NamelineParser class to handle picking
the correct parsing function for you or choose a specific parsing function yourself:
Let
NamelineParserdo the work for you:from dinopy import NamelineParser parser = NamelineParser() line1 = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC") # this is a Casava 1.8+ style nameline line2 = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC") # this is yet another Casava 1.8+ style nameline # line 3 = parser.parse(b"@HWUSI-EAS100R:6:73:941:1973#0/1") # this is a Casava <1.8 style nameline and parsing this will result in a `ValueError`.Pick a specific parsing function yourself:
from dinopy.nameline_parser import parse_casava_18_line, parse_casava_pre18_line, parse_ncbi_line line1 = parse_casava_18_line(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC") # this is a Casava 1.8+ style nameline line2 = parse_casava_18_line(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC") # this is yet another Casava 1.8+ style nameline # line3 = parse_casava_18_line(b"@HWUSI-EAS100R:6:73:941:1973#0/1") # this is a Casava <1.8 style nameline and parsing this will result in a `ValueError` line4 = parse_casava_pre18_line(b"@HWUSI-EAS100R:6:73:941:1973#0/1") # this however will work.Workflow:
from dinopy import FastqReader, NamelineParser nameline_parser = NamelineParser() with FastqReader("file.fastq") as fqr: for seq, name, _ in fqr.reads(): nameline = nameline_parser.parse(name) print(nameline.tile) # do_stuff(nameline) # do_other_stuff(seq)
The following nameline-conventions are supported:
Casava < 1.8:
@instrument:flowcell_lane:tile:cluster_x:cluster_y#index_sequence/pair_memberCasava ≥ 1.8:
@unique_instrument_name:run_id:flowecell_id:flowcell_lane:tile:cluster_x:cluster_y pair_member:filtered:control_number:index_sequence454: 14 character string, encoding
plate|region|xyHelicos:
flowcell-channel-field-camera-positionIonTorrent:
run_id chip_row chip_columnNCBI-SRA:
SRA-id anyoftheabove length=nUnknown formats: will be wrapped in a
_DummyLinewhich only holds the line as abytesreference, which can either be accessed viasome_dummy_line.line,some_dummy_line[0]or retrieved as a str viastr(some_dummy_line).
- class dinopy.nameline_parser.NamelineParser¶
Used for automagically parsing either of Casava 1.8+-, Casava <1.8- or NCBI-style namelines.
Examples
Parse Casava <1.8 style namelines:
from dinopy import NamelineParser parser = NamelineParser() line = parser.parse(b"@HWUSI-EAS100R:6:73:941:1973#0/1") print(line) # "instrument: b'HWUSI-EAS100R', flowcell_lane: 6, tile: 73, cluster_x: 941, cluster_y: 1973, index_sequence: b'TODO', pair_member: -1, additional_info: []" print(line.tile) # 73
Parse Casava 1.8+ style namelines:
from dinopy import NamelineParser parser = NamelineParser() line = parser.parse(b"@NS500639:6:H3MYMAFXX:1:11101:9262:1124 1:N:0:TAATGC") print(line) # "instrument: b'NS500639', run: 6, flowcell_id: b'H3MYMAFXX', lane: 1, tile: 11101, cluster_x: 9262, cluster_y: 1124, pair_member: 1, filtered: False, control_number: 0, index_sequence: b'TAATGC', additional_info: []" print(line.instrument) # b'NS500639'
Parse NCBI style namelines (defunct):
from dinopy import NamelineParser parser = NamelineParser() line = parser.parse(b"@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36") print(line) # TODO
Unknown nameline styles:
from dinopy import NamelineParser parser = NamelineParser() line = parser.parse(b"@Something:Entirely Different") print(line) # "@Something:Entirely Different"
- parse(self, bytes line) _NameLine¶
- dinopy.nameline_parser.parse_casava_18_line(bytes line) _NameLine¶
Split a Casava 1.8+ Style illumina fastq header. Documentation p. 50.
- Parameters:
line (bytes) – A casava style header line.
- Returns:
Containing all information of the casava line. These are:
instrument (bytes)
run (int)
flowcell_id (int)
lane (int)
tile (int)
cluster_x (int)
cluster_y (int)
pair_member (int)
filtered (bool)
control_nr (int)
index_sequence (bytes)
additional information after the casava line (empty most of the time) (list)
- Return type:
Nameline
- dinopy.nameline_parser.parse_casava_pre18_line(bytes line) _NameLine¶
Split a Casava <1.8 Style illumina fastq header. Documentation p. 50.
- Parameters:
line (bytes) – A casava style header line.
- Returns:
Containing all information of the casava line. These are:
instrument (bytes)
flowcell lane (int)
tile (int)
cluster_x (int)
cluster_y (int)
index_sequence (bytes)
pair_member (int)
additional information after the casava line (empty most of the time) (list)
- Return type:
Nameline
- dinopy.nameline_parser.parse_ncbi_line(bytes line) _NameLine¶