- class nxrepair.aligned_assembly(bamfile, fastafile, min_size, threshold, step, window, minmapq, maxinsert, fraction, prior)¶
Class to hold a set of mate pair or paired end reads aligned to the scaffolded genome assembly
- breakContigs_double(outfile, breakpoints, trim)¶
Function to break a contigs at positions identified as assembly errors and write a new fasta file containing all contigs (both altered and unaltered).
Makes a two-point break at the identified misassembly position, splitting at 5 Kb upstream and downstream of the misassembly and (currently) excluding the misassembled region.
Arguments: outfile: name of the new fasta file (including filepath) breakpoints: dictionary of misassemblies. key = contig reference ID, value = list of misassembly positions within the contig trim: distance, in bases, to trim from each each edge of a breakpoint to remove misassembly (integer)
- get_anomalies(outfile, trim, img_name=None)¶
Function to determine the frequency of anomalous mate pair behaviour across the entire genome assembly and return a dictionary where: key = contig reference IDs, value = list of postions within that contig where an assembly error is identified and the contig should be broken.
Calls get_size_anomalies and get_mapping_anomalies for each contig larger than the aligned_assembly.min_size; makes a .csv file listing for each contig the positions of identified misassemblies and their corresponding anomalous scores.
Arguments: outfile: name of file (including filepath) to store the list of contig misassemblies.
Keyword Arguments: img_name: name of file (including filepath, not including filetype) to store plots of alignment quality
Function to determine the frequency of strand mapping anomalies across the entire genome assembly.
Calls get_read_mappings for each contig larger than the aligned_assembly.min_size and returns: 1) a dictionary with keys = contig reference IDs; values = list of positions and strand alignment ratios as described in get_read_mappings 2) a dictionary of anomalies wiht keys = contig reference IDs, values = [list of positions for which the ratio of correctly aligned strands < 0.75 (currently hard-coded), corresponding ratio of correctly aligned strands]
Function to calculate the fraction of reads pairs within a contig that align correctly to opposite strands.
Return five arrays: the positions at which strand alignment was evaluated, the fraction correctly aligned, the fraction incorrectly aligned to the same strand, the unmapped fraction and the fraction that have some other alignment issue.
Arguments: ref: the reference id of the contig to be evaulated
Function to calculate global insert size distribution across the whole assembly Return a frequency table of insert sizes as a dictionary with key = insert size, value = frequency
- get_reads(ref, start, end)¶
Function to fetch reads aligned to a specific part of the assembled genome and return a list of aligned reads, where each list entry is a tuple: (read start position, read end position, read name, strand alignment) and strand alignment is a boolean indicating whether the two reads of a read pair align correctly to opposite strands. Reads are fetched that align to contig “ref” between positions “start” and “end”.
Arguments: ref: the name of the contig from which aligned reads are to be fetched. start: the position on the contig from which to start fetching aligned reads end: the position on the contig from which to end fetching aligned reads
Function to determine the frequency of insert size anomalies across the entire genome assembly.
Calls probability_of_readlength for each contig larger than the aligned_assembly.min_size and returns: 1) a dictionary with keys = contig reference IDs; values = array of Zscores as described in probability_of_readlength 2) a dictionary of anomalies wiht keys = contig reference IDs, values = [list of positions for which abs(z-score) > 2 (currently hard-coded), corresponding z-score value]
Function to construct an interval tree from reads aligning to a contig and return the interval tree.
The interval tree stores nodes with properties start (start postition of interval), end (end position of interval) and other, which is a tuple of the mate pair name (string) and the strand alignment of the two paired reads (boolean).
Arguments: ref: Reference ID of the contig for which the interval tree is to be constructed