Single-cell, spatially-resolved, and other barcoded sequencing applications rely on the accuracy of the cell or group barcode, which is typically chosen from a set of known candidates, often referred to as a “whitelist”.
This contrasts with the uniformly randomly-generated molecular barcodes (a.k.a. UMIs, “Unique molecular identifiers”).
This tool uses the set of known candidates to correct sequencing errors in cell barcode identification. There are two primary benefits:
- Increased yield
- Improved accuracy in downstream deduplication.
By correcting errors in cell barcodes, the total number of usable reads is increased (typically ~5%).
And, once cell barcodes are corrected, the downstream groupdedup software tool can perform deduplication much more efficiently than standard deduplication. This is because only reads sharing a cell barcode are compared, which dramatically reduces the search space compared to exhaustive pairwise comparisons.
First, the correct tool builds a Locality-Sensitive Hashing (LSH) index over the 10x whitelist barcode subsequences. In the second step, correct uses the LSH index to map raw input barcodes to their nearest barcodes in the truth-set.
For each input HiFI read containing a 10x cell barcode:
- If the barcode is in the whitelist, it is unchanged.
- If the barcode is not found in the whitelist, the index is queried for the closest match in the whitelist.
- Edit distance is calculated between all retrieved whitelist cell barcodes and the input barcode.
- The barcode with the lowest edit distance and lowest hamming distance is output.
- By default, if the edit distance between the cell barcode and whitelist barcode is > 2, the read is marked as failing.
- If no candidates were found, the barcode is unchanged, and the read is marked as failing.
In addition, “real cells” will be marked with
rc tag after this step, which will be used by
isoseq groupdedup. For details on real cell calling, visit the cell calling page.
Run this tool on barcode-tagged BAM files before deduplication (
isoseq groupdedup). This provides substantial runtime improvements compared to
isoseq correct --barcodes barcodes.txt[.gz] input.bam output.bam
Common single-cell whitelists (e.g. 10x whitelist for 3’ kit) can be found in the MAS-Seq dataset. These are the reverse complement of the 10x single-cell whitelists.
This requires the existance of XC and XU barcode tags. The program will fail if either are missing.
We also add or update the following tags:
|Tag||Type||Short Name||Relevant Executable||Value|
|CR||string||Cell Barcode|| ||Raw (uncorrected) cell barcode|
|CB||string||Cell Barcode|| ||Corrected cell barcode|
|XC||string||Cell Barcode|| ||Raw cell barcode|
|nc||int||Number of Candidates|| ||Number of candidate barcodes.|
|oc||string||Other Choices|| ||String representation of other potential barcodes.|
|gp||int||Group Passes|| ||Flag specifying whether or not the barcode for the given read passes filters. 1 for passing, 0 for failing.|
|nb||int||Barcode Distance|| ||Edit distance from the barcode for the read to the barcode to which it was reassigned. This is 0 if the barcode matches exactly, -1 if the barcode could not be rescued, and the edit distance otherwise.|
|rc||int||Real Cell|| ||Predicted real cell. This is 1 if a read is predicted to come from a real cell and 0 if predicted to be a non-real cell.|