Dedup FAQ

NOTE: isoseq groupdedup is now the recommended deduplication tool that replaces the older, slower isoseq dedup. However some documentation figures might still refer to the old isoseq dedup tool as reference.

This FAQ explains how isoseq groupdedup identifies two reads to be from the same founder molecule.

Adjusting maximum mismatches and shifts

The following parameters control the thresholds for mismatches and shifts:

  --max-tag-mismatches      INT   Maximum number of mismatches between tags. [1]
  --max-tag-shift           INT   Tags may be shifted by at maximum of N bases. [1]

In case of unusually short designs (such as a custom UMI that is only 6bp and no cell barcodes), default parameters might lead to overclustering. In this case, please adjust parameters accordingly.

The following is an example of one founder molecule that is sequenced twice. PCR and sequencing errors are introduced, leading to a clipped base in one of the cell barcodes and a substitution in the other cell barcode.

Method

Perform all vs all comparison and cluster two reads if:

lengths are within +- 50 bp length
UMI (+cell barcode) match with at max 1 mismatch and may be shifted by at max 1 base
pairwise concordance is at least 97%
alignment starts/ends within 5 bp of the other read
no more than 5 bps are deleted or inserted in a window of 20 bp (like in isoseq cluster)
groupdedup only: these reads have the same cell barcode

Adjusting insert transcript concordance

The following parameters control the thresholds for how well the inserts (in this case, transcripts) match, even when the UMIs and BCs already match:

  --min-concordance-perc    INT   Minimum insert alignment concordance in %. [97]
  --max-insert-gaps         INT   Maximum number of insert gaps per 20 bp window. [5]

While rare, it is possible to have different transcript molecules share the same UMIs and BCs.

groupdedup only: cell barcode and real cells

If using isoseq groupdedup (which is recommended over isoseq dedup), it can use the corrected cell barcodes from the isoseq correct step for grouping reads.

However, the BAM file must first be sorted by CB tag:

samtools sort –t CB corrected.bam –o corrected.sorted.bam
isoseq groupdedup corrected.bam dedup.bam 

Additionally, isoseq groupdedup can use the rc tag from the isoseq correct step and apply to only real cells. This can be turned off with the option below (advanced, not recommended by default):

  --keep-non-real-cells           Do not skip reads with non-real cells.