Processing overlapping MiSeq 16S rRNA libraries

Alm lab Protocol for processing overlapping 16S rRNA reads from a MiSeq run

Experimental design

The specifics of the sequencing set-up and molecular construct will determine exactly how the data needs to be sequenced and processed. There are a few different designs that the Alm lab has set up:

1.) (Standard) Multiplexing different samples together to be sequenced in one lane of Illumina, marking each unique sample with a unique barcode on the reverse primer of step 2 that is read during the indexing read. This is common for up to 96 samples (or 105 including additional barcodes that are not in the 96-well format). The sequencing should be done, not with the standard Illumina indexing primer, but with the reverse complement of the 2nd read sequencing primer. This is a custom barcode that should be included in the sequencing set-up. See sequencing section below for protocol.

2.) Multiplexing multiple different plates of samples together using a barcode located 5' to the primer used in genome amplification (typically U515) and a reverse barcode that is read during the indexing read. This is not typical for the Alm lab MiSeq protocol, since getting only about 12-25 million reads from MiSeq is sufficient for about 100 samples, not more. However, it is a possible scenario for samples which do not require high coverage.

3.) Mixing both genome sequencing and 16S rRNA amplicon sequencing together in one lane. Adding genome library preps to 16S amplicon lanes improves the quality of the base calling by adding diversity without loosing much to phiX sequencing. The genome library constructs typically contain barcodes in the forward and reverse read sequences, and do not typcially have an indexing read associated with them. However, adding them to a lane which does have a index read is ok.

4.) An experimental set-up using both forward and reverse orientation of the 16S rRNA among different samples, and staggering the diversity region 5' of both primers used in genome amplification allows for sufficient base diversity to run 16S rRNA libraries without wasting phiX data. In this case, half of the samples begin by sequencing from the U515 primer in the forward read, and half begin by sequencing from the U786 from the forward read. Additionally, the number of bases before the primer sequence varies from 4-9 bp.

Sequencing

Sequencing our construct on MiSeq is slightly different than standard Illumina MiSeq sequencing. Load the providedsample sheet, (which arbitrarily specifies a 250 paired-end run with an 8nt barcode read) and spike in 15uL of the anti-reverse BMC index primer @ 100uM (5' AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG 3') into tube 13 of the cartridge. This should provide three reads (forward, reverse and index) at 250, 8, 250 bp each.

De-multiplexing

Depending on whether your samples contains data from other projects that you don not want to process, you can demultiplex at various stages. Typically if there are multiple un-related projects in the same run, I will pull out all of the reads that map to the specific barcodes I am interested in first, so that I don't have to process extra data. You also have the option of removing unwanted barcodes at the qiime: split_libraries_fastq.py step by providing a mapping file containing only the barcodes you want. This makes sense if you if you will eventually work with all of the data, but in sets. Otherwise, if you will never need to work with the other data in the lane, it doesn't make sense to process it at all.

Run this program to parse out sequences from the raw data according to the following order:

1.) Single barcodes that are unique in the sample. These are possibly control samples or extra barcodes that were done uniquely and are the only piece of information indicating which samples that sequence came from

2.) Next, it looks for the presence of the forward and reverse barcodes in the forward and reverse read. These are typically from genome sequences.

3.) Finally, it looks for paired data, pulling out reads that have both the forward barcode and the indexing barcodes that match input samples.

All other samples that do not match are discarded.

Program:

parse_Illumina_multiplex2.pl <Solexa File1> <Solexa File2> <mapping> <output_prefix>

<mapping> input file should have the following fields (tab delimited, here's an example file):

Barcode construction, output name, forward barcode name, forward barcode seq, forward barcode orientation, index barcode name, index barcode seq, index barcode orientation, reverse barcode name, reverse barcode seq, reverse barcode orientation

Samples with the same output name will be in the same file. Barcode construction must be one of the following exact fields: single, double or forbar. Use single for option 1 above (single barcodes identify the samples), the 2

The output should be forward and reverse files labeled output_prefix.output_name.1 and output.2 respectively.

These can be used as the fastq files in downstream processes.

Overlapping the reads

You may have sufficient length to overlap the forward and reverse reads to create a longer sequence. This process will be time consuming, but it gains phylogenetic resolution and can be useful for many applications. We use SHE-RA, which was created to have a sophisticated calculation of quality for overlapped bases, given the quality of the overlapped bases and whether or not they match. Other software exists (and is faster), but will do multiple things at once, including trimming the sequences for quality and will not provide as good an estimate of the quality of the overlapped bases. If other programs are used, it might be necessary to use other programs to de-multiplex samples after using. With SHE-RA, we overlap, then re-generate the fastq files to use with QIIME split_libraries_fastq.py.

First, divide up your samples into about 1 million reads per file. This can typically be processed on our computers in about 10 hours.

perl ~/bin/split_fastq_qiime_1.8.pl 131001Alm_D13-4961_1_sequence.fastq 100 131001Alm_D13-4961_1_sequence.split

perl ~/bin/split_fastq_qiime_1.8.pl 131001Alm_D13-4961_2_sequence.fastq 100 131001Alm_D13-4961_2_sequence.split

Then, overlap each of the 100 files with SHERA where ${PBS_ARRAYID} is the process number for parallel processing

perl ~/bin/SHERA_code/concatReads_1.8.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.fastq 131001Alm_D13-4961_2_sequence.split.${PBS_ARRAYID}.fastq --qualityScaling illumina

Filter out the bad overlaps from the fa and quala generated with SHERA:

perl /mit/spacocha/bin/SHERA_code/filterReads.pl 131001Alm_D13-4961_1_sequence.split.$

Unknown macro: {PBS_ARRAYID}

.fa 131001Alm_D13-4961_1_sequence.split.$

.quala 0.8

Use mothur to re-generate the fastq files:

mothur "#make.fastq(fasta=131001Alm_D13-4961_1_sequence.split.$

Unknown macro: {PBS_ARRAYID}

.filter_0.8.fa, qfile=131001Alm_D13-4961_1_sequence.split.$

.filter_0.8.quala)"

Now, you will either have to fix the index file to contain only the reads in your file (if the index read is a separate file):

perl ~/bin/fix_index.pl 131001Alm_D13-4961_1_sequence.split.$

Unknown macro: {PBS_ARRAYID}

.filter_0.8.fastq 131001Alm_D13-4961_3_sequence.fastq > 131001Alm_D13-4961_3_filter_0.8.fastq

Or, if you have to generate it from the header (if the index is already present in the header):

perl /mit/spacocha/bin/fastq2Qiime_barcode2.pl 131001Alm_D13-4961_1_sequence.split.$

.filter_0.8.fastq > 131001Alm_D13-4961_1_sequence.split.index.filter_0.8.fastq

This file can be used for specific header configuration where the fastq files look like this, where it pulls out the longest string of base letters (ATGCKMRYSWBVHDNX) after the #, in this case it would be TGGGACCT and creates a false quality for each base as the lower case of each barcode letter:

@MISEQ:1:2106:21797:11095#TGGGACCT_204bp_199.2_0.90

TGTAGTGCCAGCCGCCGCGGTAATACGTAGGTGGCGAGCGTTGTTCGGATTTATTGGGCGTAAAGGGTCCGCAGGGGGTT
CGCTAAGTCTGATGTGAAATCCCGGAGCTCAACTCCGGAACTGCATTGGAGACTGGTGGACTAGAGTATCGGAGAGGTAA
GCGGAATTCCAGGTGTAGCGGTGGAATGCGTAGATATCTGGAAGAACACCGAAAGCGAAGGCAGCTTACTGGACGGTAAC
TGACCCTCAGGGACGAAAGCGTGGGGATCAAACAGGATTAGAAACCCCTGTAGTCC

Result is:

@MISEQ:1:2106:21797:11095#TGGGACCT_204bp_199.2_0.90

TGGGACCT

+

tgggacct

De-multiplex

Now the re-created fasta and index read can be used as normal with QIIME or other software of your choice:

split_libraries_fastq.py -i 131001Alm_D13-4961_1_sequence.split.$

Unknown macro: {PBS_ARRAYID}

.filter_0.8.fastq -m
mapping_file.txt -b 131001Alm_D13-4961_1_sequence.split.$

.index.filter_0.8.fastq --barcode_typ
e 8 --rev_comp_barcode --min_per_read_length .8 -q 10 --max_bad_run_length 0 -o
unique_output_$

Unknown macro: {PBS_ARRAYID}

--phred_offset 33

This will create a seqs.fna file which can be used in down stream analysis.

Blog

Processing overlapping MiSeq 16S rRNA libraries