CLI Workflow

The low-level workflow explained via CLI calls. All necessary dependencies are installed via bioconda.

For a toy dataset + command walkthrough, see the Example page.

Step 1 - Input

HiFi Reads

Each sequencing run is processed by ccs to generate one HiFi read from productive ZMWs. After CCS is performed, you can use the hifi_reads.bam as input. The hifi_reads.bam contains only HiFi reads, with predicted accuracy ≥Q20. No additional filtering is required. HiFi reads that have been demultiplexed can also be used.

Segmented Reads

HiFi reads that have been segmented using skera as a part of the MAS-Seq single cell application can also be used.

Step 2 - Primer removal

Removal of primers and identification of barcodes is performed using lima, which can be installed with
conda install lima and offers a specialized --isoseq mode. Even in the case that your sample is not barcoded, primer removal is performed by lima.

More information about how to name input primer sequences in this lima Iso-Seq FAQ.

$ lima <movie>.hifi_reads.bam primers.fasta <movie>.fl.bam --isoseq

Example 1: If using the 10x 3’ kit, the primers are below:

>5p
AAGCAGTGGTATCAACGCAGAGTACATGGG
>3p
AGATCGGAAGAGCGTCGTGTAG

Example 2: If using the 10x 5’ kit, the primers are below:

>5p
CTACACGACGCTCTTCCGATCT
>3p
GTACTCTGCGTTGATACCACTGCTT

Lima will remove unwanted combinations and orient sequences to 5’ → 3’ orientation.

Output files will be called according to their primer pair. Example for single sample libraries:

<movie>.fl.5p--3p.bam

Step 3 - Tag

Tags, such as UMIs and cell barcodes, have to be clipped from the reads and associated with the reads for later deduplication.

Input The input file for tag is one full-length CCS file from the previous lima step: <movie>.fl.5p--3p.bam

Output The following output files of tag contain full-length tagged:

<movie>.flt.bam
<movie>.flt.transcriptset.xml

Insert your own design or pick a preset:

$ isoseq tag <mvie>.fl.5p--3p.bam <movie>.flt.bam --design XXX

Refer to the UMI and BC design page for how to specify --design.

For example, the 10x 3’ (v3.1) kit has a 12bp UMI and 16bp BC on the 3’ end, so the design would be --design T-12U-16B.

In contrast, the 10x 5’ kit has a 16bp BC, 10bp UMI, and 13bp TSO, so the design would be --design 16B-10U-13X-T.

Step 4 - Refine

Your data now contains full-length tagged reads, but still needs to be refined by:

Trimming of poly(A) tails
Unintended concatmer identification and removal (note: if the library was constructed using the MAS-Seq method, the reads should have already gone through skera and is not expected to contain any more concatemers at this step)

Input The input file for refine full-length tagged reads and the primer fasta file:

<movie>.flt.bam or <movie>.flt.transcriptset.xml
primers.fasta

Output The following output files of refine contain full-length non-concatemer (FLNC) reads:

<movie>.fltnc.bam
<movie>.fltnc.transcriptset.xml

Actual command to refine:

$ isoseq refine <movie>.flt.5p--3p.bam primers.fasta <movie>.fltnc.bam --require-polya

If your sample has poly(A) tails, use --require-polya.

This filters for FL reads that have a poly(A) tail with at least 20 base pairs. You can change the polyA length minimum with --min-polya-length.

Step 4b - Merge SMRT Cells

If you used more than one SMRT cells, merge all of your <movie>.fltnc.bam files:

$ ls <movie1>.fltnc.bam <movie2>.fltnc.bam ... <movieN>.fltnc.bam > fltnc.fofn

Step 5 - Cell Barcode Correction and Real Cell Identification

This step identifies cell barcode errors and corrects them. The tool uses a cell barcode whitelist to reassign erroneous barcodes based on edit distance. Additionally, correct estimates which reads are likely to originate from a real cell and labels them using the rc tag.

Common single-cell whitelists (e.g. 10x whitelist for 3’ kit) can be found in the MAS-Seq dataset. These are the reverse complement of the 10x single-cell whitelists.

For details on barcode correction, visit the barcode correction page.

Input The input file for correct is one FLTNC file:

<movie>.fltnc.bam

Output The following output files of correct contain reads with corrected cell barcodes:

<prefix>.bam
<prefix>.bam.pbi

Example: $ isoseq correct –barcodes barcode_set.txt fltnc.bam fltnc.corrected.bam

Common single-cell whitelist (e.g. 10x whitelist for 3’ kit) can be found in the MAS-Seq dataset.

Step 6 - Deduplication

This step performs PCR deduplication via clustering by UMI and cell barcodes (if available).

We provide two methods: dedup and groupdedup. It is recommended to use groupdedup for most cases.

They perform nearly identical functionality. The key difference is that groupdedup only deduplicates reads sharing a cell barcode and groupdedup requires both barcode correction with the correct tool and sorting by cell barcode (tag “CB”). (Sorting a BAM by cell barcode may be efficiently accomplished by samtools sort -t CB.)

This is because sequencing errors introduce erroneous barcodes, yielding spurious reads. dedup allows for barcode errors through pairwise barcode alignment, but groupdedup assumes that barcodes are correct. Performing this correction step allows this faster groupdedup step to reasonably make this assumption while also allowing for mismatches using the index.

This can provide over 200x speed-ups, as well as substantially reducing RAM requirements.

If the rc tag added by correct is present in the input, groupdedup and dedup will filter the output to only include real cells. This can be turned off using ` –keep-non-real-cells`.

After deduplication, dedup and groupdedup generate one consensus sequence per founder molecule, using a QV guided consensus.

For more details, visit the dedup FAQ.

NOTE: isoseq dedup is replaced by isoseq groupdedup

Input The input file for isoseq groupdedup is one FLTNC file, sorted by cell barcode (CB) tag: <movie>.fltnc.corrected.sorted.bam

Output The following output files of isoseq groupdedup contain polished isoforms:

<prefix>.bam
<prefix>.bam.pbi

Example:

$ samtools sort -t CB fltnc.corrected.bam -o fltnc.corrected.sorted.bam
$ isoseq groupdedup fltnc.corrected.sorted.bam dedup.bam

What to do after deduplication

After obtaining deduplicated reads, follow the rest of the recommended single-cell Iso-Seq workflow, which continues on with mapping, collapse, and transcript classification.