Tutorial: Misassembly Detection using NxRepair¶
NxRepair is a python module that automatically detects large structural errors in de novo assemblies using Nextera mate pair reads. The decector will break a contig at the site of an identified misassembly and will generate a new fasta file containing both the corrected contigs and the correct, unaffected contigs.
Installing NxRepair¶
NxRepair program can be cloned from github:
git clone https://github.com/rebeccaroisin/nxrepair
NxRepair uses several other python libraries, which you will need to install. Specifically, you will need:
- scipy
- numpy
- matplotlib
- pysam
Scipy, numpy and matplotlib can be installed using your favourite package manager or from the Python Package Index (PyPI) using:
pip install numpy
pip install scipy
pip install matplotlib
If you don’t currently have any of these libraries, we recommend installing Anaconda, a python distribution that includes all of these libraries, along with all major scientific and analytical python packages.
Pysam can also be installed using “pip install”, but errors are more common.
Assistance with installation errors for pysam can be found online. Note that the current version of pysam wraps samtools-0.1.19 and tabix-0.2.6.
Running NxRepair¶
NxRepair can be run from the command line.
python nxrepair.py aligned_matepairs.bam assemblyfasta.fasta error_locations.csv new_fasta.fasta
The required arguments are as follows:
- aligned_matepairs.bam: an indexed bam file of mate pair reads aligned to your assembly;
- assemblyfasta.fasta: the fasta file containing your contigs;
- outfile: the name of a csv file in which to store the Z scores generated by NxRepair;
- newfasta: filename of the fasta file that will hold the new contigs following analysis.
There are also several optional arguments, which can be used to tune the NxRepair analysis. These are described in the table below.
Parameter | Default Value | Meaning |
---|---|---|
imgname | None | Prefix under which to save plots. |
maxinsert | 30000 | Maximum insert size, below which a read pair is included in calculating population statistics. |
minmapq | 40 | Minimum MapQ value, above which a read pair is included in calculating population statistics. |
minsize | 10000 | Minimum contig size to analyse. |
prior | 0.01 | Prior probablility that the insert size is anomalous. |
stepsize | 1000 | Step-size in bases to traverse contigs. |
trim | 5000 | Number of bases to trim from each side of an identified misassembly. |
T | -4.0 | Threshold in Z score (standard deviations from the mean) below which a misassembly is called. |
window | 200 | Window size across which bridging mate pairs are evaluated. |
Optional arguments are called from the commmand line, as shown in the example below:
python nxrepair.py aligned_matepairs.bam assemblyfasta.fasta error_locations.csv new_fasta.fasta -minsize 20000 -trim 4000 -T -5.0
The program will parse a bam file of reads aligned to your de novo assembly. Each contig that is larger than the min_size parameter will be analysed for potential structural misassemblies. When the program completes, a minimum of two new files will be generated:
- A new fastafile, specified by newfasta, that contains the improved contigs of the de novo assembly.
- A csv file that identifies the exact position where altered contigs were broken.
- If the optional argument -img_name was included, for each contig analysed, a plot will be generated showing the insert size distribution and directionality across the contig, with anomalous regions highlighted. These plots will be saved in the folder specified by img_name
Outputs 2 and 3 can allow identification of further, smaller structural misassemblies, as well as enabling verification of detected misassemblies using IGV.
How Does it Work?¶
NxRepair evaluates the insert sizes of mate pairs aligned across a contig. Regions of the contig that have unusual insert sizes, where few reads are aligned, or where a large fraction of the mate pairs have incorrect orientation are flagged as potentially anomalous based on a simple probabilistic model of the mate-pair size distribution. Where there is strong evidence that a region is misassembled, the contig will be broken into two pieces and 5 Kb of erroneous assemby will be trimmed from both sides of the break.