isoseq groupdedup is now the recommended deduplication tool that replaces the older, slower
isoseq dedup. However some documentation figures might still refer to the old
isoseq dedup tool as reference.
This FAQ explains how
isoseq groupdedup identifies two reads to be from the same founder molecule.
The following parameters control the thresholds for mismatches and shifts:
--max-tag-mismatches INT Maximum number of mismatches between tags.  --max-tag-shift INT Tags may be shifted by at maximum of N bases. 
In case of unusually short designs (such as a custom UMI that is only 6bp and no cell barcodes), default parameters might lead to overclustering. In this case, please adjust parameters accordingly.
The following is an example of one founder molecule that is sequenced twice. PCR and sequencing errors are introduced, leading to a clipped base in one of the cell barcodes and a substitution in the other cell barcode.
Perform all vs all comparison and cluster two reads if:
- lengths are within +- 50 bp length
- UMI (+cell barcode) match with at max 1 mismatch and may be shifted by at max 1 base
- pairwise concordance is at least 97%
- alignment starts/ends within 5 bp of the other read
- no more than 5 bps are deleted or inserted in a window of 20 bp (like in isoseq cluster)
- groupdedup only: these reads have the same cell barcode
The following parameters control the thresholds for how well the inserts (in this case, transcripts) match, even when the UMIs and BCs already match:
--min-concordance-perc INT Minimum insert alignment concordance in %.  --max-insert-gaps INT Maximum number of insert gaps per 20 bp window. 
While rare, it is possible to have different transcript molecules share the same UMIs and BCs.
isoseq groupdedup (which is recommended over
isoseq dedup), it can use the corrected cell barcodes from the
isoseq correct step for grouping reads.
However, the BAM file must first be sorted by
samtools sort –t CB corrected.bam –o corrected.sorted.bam isoseq groupdedup corrected.bam dedup.bam
isoseq groupdedup can use the
rc tag from the
isoseq correct step and apply to only real cells. This can be turned off with the option below (advanced, not recommended by default):
--keep-non-real-cells Do not skip reads with non-real cells.