Tutorial: Misassembly Detection using NxRepair

NxRepair is a python module that automatically detects large structural errors in de novo assemblies using Nextera mate pair reads. The decector will break a contig at the site of an identified misassembly and will generate a new fasta file containing both the corrected contigs and the correct, unaffected contigs.

Installing NxRepair

NxRepair program can be cloned from github:

git clone https://github.com/rebeccaroisin/nxrepair

NxRepair uses several other python libraries, which you will need to install. Specifically, you will need:

  • scipy
  • numpy
  • matplotlib
  • pysam

Scipy, numpy and matplotlib can be installed using your favourite package manager or from the Python Package Index (PyPI) using:

pip install numpy
pip install scipy
pip install matplotlib

If you don’t currently have any of these libraries, we recommend installing Anaconda, a python distribution that includes all of these libraries, along with all major scientific and analytical python packages.

Pysam can also be installed using “pip install”, but errors are more common.

Assistance with installation errors for pysam can be found online. Note that the current version of pysam wraps samtools-0.1.19 and tabix-0.2.6.

Running NxRepair

NxRepair can be run from the command line.

python nxrepair.py aligned_matepairs.bam assemblyfasta.fasta error_locations.csv new_fasta.fasta

The required arguments are as follows:

  • aligned_matepairs.bam: an indexed bam file of mate pair reads aligned to your assembly;
  • assemblyfasta.fasta: the fasta file containing your contigs;
  • outfile: the name of a csv file in which to store the Z scores generated by NxRepair;
  • newfasta: filename of the fasta file that will hold the new contigs following analysis.

There are also several optional arguments, which can be used to tune the NxRepair analysis. These are described in the table below.

Parameter Default Value Meaning
imgname None Prefix under which to save plots.
maxinsert 30000 Maximum insert size, below which a read pair is included in calculating population statistics.
minmapq 40 Minimum MapQ value, above which a read pair is included in calculating population statistics.
minsize 10000 Minimum contig size to analyse.
prior 0.01 Prior probablility that the insert size is anomalous.
stepsize 1000 Step-size in bases to traverse contigs.
trim 5000 Number of bases to trim from each side of an identified misassembly.
T -4.0 Threshold in Z score (standard deviations from the mean) below which a misassembly is called.
window 200 Window size across which bridging mate pairs are evaluated.

Optional arguments are called from the commmand line, as shown in the example below:

python nxrepair.py aligned_matepairs.bam assemblyfasta.fasta error_locations.csv new_fasta.fasta -minsize 20000 -trim 4000 -T -5.0

The program will parse a bam file of reads aligned to your de novo assembly. Each contig that is larger than the min_size parameter will be analysed for potential structural misassemblies. When the program completes, a minimum of two new files will be generated:

  1. A new fastafile, specified by newfasta, that contains the improved contigs of the de novo assembly.
  2. A csv file that identifies the exact position where altered contigs were broken.
  3. If the optional argument -img_name was included, for each contig analysed, a plot will be generated showing the insert size distribution and directionality across the contig, with anomalous regions highlighted. These plots will be saved in the folder specified by img_name

Outputs 2 and 3 can allow identification of further, smaller structural misassemblies, as well as enabling verification of detected misassemblies using IGV.

How Does it Work?

NxRepair evaluates the insert sizes of mate pairs aligned across a contig. Regions of the contig that have unusual insert sizes, where few reads are aligned, or where a large fraction of the mate pairs have incorrect orientation are flagged as potentially anomalous based on a simple probabilistic model of the mate-pair size distribution. Where there is strong evidence that a region is misassembled, the contig will be broken into two pieces and 5 Kb of erroneous assemby will be trimmed from both sides of the break.