The low-level workflow explained via CLI calls. All necessary dependencies are installed via bioconda.
For each SMRT cell a
movieX.subreads.bam is needed for processing.
Each sequencing run is processed by ccs to generate one HiFi read from productive ZMWs. It is advised to use the latest CCS version 4.2.0 or newer. ccs can be installed with
conda install pbccs.
$ ccs movieX.subreads.bam movieX.ccs.bam
You can easily parallelize ccs generation by chunking, please follow this how-to.
If on-instrument CCS was performed, you can use the
hifi_reads.bam as input.
hifi_reads.bam contains only HiFi reads, with predicted accuracy ≥Q20. No additional filtering is required.
reads.bam contains one representative sequence per productive ZMW, irrespective of quality and passes. Do not forget to use
isoseq3 refine --min-rq 0.99 in step 3!
Removal of primers and identification of barcodes is performed using lima, which can be installed with
conda install lima and offers a specialized
--isoseq mode. Even in the case that your sample is not barcoded, primer removal is performed by lima. If there are more than two sequences in your
primer.fasta file or better said more than one pair of 5’ and 3’ primers, please use lima with
--peek-guess to remove spurious false positive signal. More information about how to name input primer(+barcode) sequences in this lima Iso-Seq FAQ.
$ lima movieX.ccs.bam barcoded_primers.fasta movieX.fl.bam --isoseq --peek-guess
Example 1: Following is the
primer.fasta for the Clontech SMARTer and NEB cDNA library prep, which are the officially recommended protocols:
>NEB_5p GCAATGAAGTCGCAGGGTTGGG >Clontech_5p AAGCAGTGGTATCAACGCAGAGTACATGGGG >NEB_Clontech_3p GTACTCTGCGTTGATACCACTGCTT
Example 2: Following are examples for barcoded primers using a 16bp barcode followed by Clontech primer:
>primer_5p AAGCAGTGGTATCAACGCAGAGTACATGGGG >brain_3p CGCACTCTGATATGTGGTACTCTGCGTTGATACCACTGCTT >liver_3p CTCACAGTCTGTGTGTGTACTCTGCGTTGATACCACTGCTT
Lima will remove unwanted combinations and orient sequences to 5’ → 3’ orientation.
Output files will be called according to their primer pair. Example for single sample libraries:
If your library contains multiple samples, execute the following workflow for each primer pair:
Tags, such as UMIs and cell barcodes, have to be clipped from the reads and associated with the reads for later deduplication.
Input The input file for tag is one demultiplexed CCS file with full-length reads:
Output The following output files of tag contain full-length tagged:
Insert your own design or pick a preset:
$ isoseq tag movieX.NEB_5p--NEB_Clontech_3p.fl.bam movieX.flt.bam --design XXX
Your data now contains full-length tagged reads, but still needs to be refined by:
Input The input file for refine full-length tagged reads and the primer fasta file:
Output The following output files of refine contain full-length non-concatemer reads:
Actual command to refine:
$ isoseq refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam primers.fasta movieX.fltnc.bam
If your sample has poly(A) tails, use
--require-polya. This filters for FL reads that have a poly(A) tail with at least 20 base pairs and removes identified tail:
$ isoseq refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam movieX.fltnc.bam --require-polya
Optional read quality filtering, if your initial CCS input is
$ isoseq refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam movieX.flnc.bam --min-rq 0.9
If you used more than one SMRT cells, merge all of your
$ ls movie1.fltnc.bam movie2.fltnc.bam movieN.fltnc.bam > fltnc.fofn
This step performs PCR deduplicatation via clustering by UMI and cell barcodes (if available). After deduplication, dedup generates one consensus sequence per founder molecule, using a QV guided consensus approach.
Perform all vs all comparison and cluster two reads if:
- lengths are within +- 50 bp length
- UMI (+cell barcode) match with at max 1 mismatch and may be shifted by at max 1 base
- pairwise concordance is at least 97%
- alignment starts/ends within 5 bp of the other read
- no more than 5 bps are deleted or inserted in a window of 20 bp (like in isoseq cluster)
Input The input file for dedup is one FLTNC file:
Output The following output files of dedup contain polished isoforms:
<prefix>.hq.fasta.gzwith predicted accuracy ≥ 0.99
<prefix>.lq.fasta.gzwith predicted accuracy < 0.99
$ isoseq dedup fltnc.fofn dedup.bam --verbose