The low-level workflow explained via CLI calls. All necessary dependencies are installed via bioconda.
For each SMRT cell a
movieX.subreads.bam is needed for processing.
Each sequencing run is processed by ccs to generate one representative circular consensus sequence (CCS) for each ZMW. It is advised to use the latest CCS version 4.2.0 or newer. ccs can be installed with
conda install pbccs.
$ ccs movieX.subreads.bam movieX.ccs.bam --min-rq 0.9
You can easily parallelize ccs generation by chunking, please follow this how-to.
If on-instrument CCS was performed, you can use the
hifi_reads.bam as input.
hifi_reads.bam contains only HiFi reads, with predicted accuracy ≥Q20. No additional filtering is required.
reads.bam contains one representative sequence per productive ZMW, irrespective of quality and passes. Do not forget to use
isoseq3 refine --min-rq 0.9 in step 3!
Removal of primers and identification of barcodes is performed using lima, which can be installed with
conda install lima and offers a specialized
--isoseq mode. Even in the case that your sample is not barcoded, primer removal is performed by lima. If there are more than two sequences in your
primer.fasta file or better said more than one pair of 5’ and 3’ primers, please use lima with
--peek-guess to remove spurious false positive signal. More information about how to name input primer(+barcode) sequences in this lima Iso-Seq FAQ.
$ lima movieX.ccs.bam barcoded_primers.fasta movieX.fl.bam --isoseq --peek-guess
Example 1: Following is the
primer.fasta for the Clontech SMARTer and NEB cDNA library prep, which are the officially recommended protocols:
>NEB_5p GCAATGAAGTCGCAGGGTTGGG >Clontech_5p AAGCAGTGGTATCAACGCAGAGTACATGGGG >NEB_Clontech_3p GTACTCTGCGTTGATACCACTGCTT
Example 2: Following are examples for barcoded primers using a 16bp barcode followed by Clontech primer:
>primer_5p AAGCAGTGGTATCAACGCAGAGTACATGGGG >brain_3p CGCACTCTGATATGTGGTACTCTGCGTTGATACCACTGCTT >liver_3p CTCACAGTCTGTGTGTGTACTCTGCGTTGATACCACTGCTT
Lima will remove unwanted combinations and orient sequences to 5’ → 3’ orientation.
Output files will be called according to their primer pair. Example for single sample libraries:
If your library contains multiple samples, execute the following workflow for each primer pair:
Your data now contains full-length reads, but still needs to be refined by:
The input file for refine is one demultiplexed CCS file with full-length reads and the primer fasta file:
The following output files of refine contain full-length non-concatemer reads:
Actual command to refine:
$ isoseq refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam primers.fasta movieX.flnc.bam
If your sample has poly(A) tails, use
--require-polya. This filters for FL reads that have a poly(A) tail with at least 20 base pairs (
--min-polya-length) and removes identified tail:
$ isoseq refine movieX.NEB_5p--NEB_Clontech_3p.fl.bam movieX.flnc.bam --require-polya
If your workflow input is
If you used more than one SMRT cells, list all of your
<movie>.flnc.bam in one
flnc.fofn, a file of filenames:
$ ls movie*.flnc.bam movie*.flnc.bam movie*.flnc.bam > flnc.fofn
Compared to previous IsoSeq approaches, IsoSeq v3 performs a single clustering technique. Due to the nature of the algorithm, it can’t be efficiently parallelized. It is advised to give this step as many coresas possible. The individual steps of cluster are as following:
- Clustering using hierarchical n*log(n) alignment and iterative cluster merging
- Polished POA sequence generation, using a QV guided consensus approach
Input The input file for cluster is one FLNC file:
Output The following output files of cluster contain polished isoforms:
<prefix>.hq.fasta.gzwith predicted accuracy ≥ 0.99
<prefix>.lq.fasta.gzwith predicted accuracy < 0.99
$ isoseq cluster flnc.fofn clustered.bam --verbose --use-qvs