Alm lab Protocol for processing overlapping 16S rRNA reads from a MiSeq run
Experimental design
The specifics of the sequencing set-up and molecular construct will determine exactly how the data needs to be sequenced and processed. There are a few different designs that the Alm lab has set up:
1.) (Standard) Multiplexing different samples together to be sequenced in one lane of Illumina, marking each unique sample with a unique barcode on the reverse primer of step 2 that is read during the indexing read. This is common for up to 96 samples (or 105 including additional barcodes that are not in the 96-well format). The sequencing should be done, not with the standard Illumina indexing primer, but with the reverse complement of the 2nd read sequencing primer. This is a custom barcode that should be included in the sequencing set-up. See sequencing section below for protocol.
2.) Multiplexing multiple different plates of samples together using a barcode located 5' to the primer used in genome amplification (typically U515) and a reverse barcode that is read during the indexing read. This is not typical for the Alm lab MiSeq protocol, since getting only about 12-25 million reads from MiSeq is sufficient for about 100 samples, not more. However, it is a possible scenario for samples which do not require high coverage.
3.) Mixing both genome sequencing and 16S rRNA amplicon sequencing together in one lane. Adding genome library preps to 16S amplicon lanes improves the quality of the base calling by adding diversity without loosing much to phiX sequencing. The genome library constructs typically contain barcodes in the forward and reverse read sequences, and do not typcially have an indexing read associated with them. However, adding them to a lane which does have a index read is ok.
4.) An experimental set-up using both forward and reverse orientation of the 16S rRNA among different samples, and staggering the diversity region 5' of both primers used in genome amplification allows for sufficient base diversity to run 16S rRNA libraries without wasting phiX data. In this case, half of the samples begin by sequencing from the U515 primer in the forward read, and half begin by sequencing from the U786 from the forward read. Additionally, the number of bases before the primer sequence varies from 4-9 bp.
Sequencing
Sequencing our construct on MiSeq is slightly different than standard Illumina MiSeq sequencing. Load the providedsample sheet, (which arbitrarily specifies a 250 paired-end run with an 8nt barcode read) and spike in 15uL of the anti-reverse BMC index primer @ 100uM (5' AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG 3') into tube 13 of the cartridge. This should provide three reads (forward, reverse and index) at 250, 8, 250 bp each.
(The following is curtsey of Shaprio lab):
To generate an indexing file, you have to change the setup of the MiSeq reporter, because by default MSR doesn't generate a barcodes_reads.fastq.
In order to change that:
First, turn off MSR service: Task Manager, Services Tab, Right click on MiSeq Reporter and click stop.
Used NotePad to edit MiSeqReporter.exe.config file that can be found in C:\Illumina\MiSeq Reporter
The following needs to be included in the top portion of the file (the <appSettings> section)
<add key="CreateFastqForIndexReads" value="1" />
Save then close.
You will then want to restart the service. This can be accomplished by right clicking on the tool bar in windows, selecting "Start Task Manager", select the "Services" tab, find MiSeq Reporter on the list and then select to stop and then start the service.
You can re-queue your run using the sample sheet WITH the index information on the sample. In our case, we used a very simple sample_sheet with one index like ATATATAT.
De-multiplexing
You can demultiplex at various stages. If there are multiple, un-related projects in the same run, I will pull out all of the reads that map to barcodes for only one project, so that I don't have to process extra data. You also have the option of removing unwanted barcodes at the qiime: split_libraries_fastq.py step by providing a mapping file containing only the barcodes you want, but you may waste time overlapping them if there are a lot of them. Do not use the following step if you if you will eventually work with all of the data. Only use the following if you never need to work with the other data, since it doesn't make sense to process it at all.
Run this program to parse out sequences from the raw data according to the following order:
1.) Single barcodes that are unique in the sample. These are possibly control samples or extra barcodes that were done uniquely and are the only piece of information indicating which samples that sequence came from
2.) Next, it looks for the presence of the forward and reverse barcodes in the forward and reverse read. These are typically from genome sequences.
3.) Finally, it looks for paired data, pulling out reads that have both the forward barcode and the indexing barcodes that match input samples.
All other samples that do not match are discarded.
Program:
perl parse_Illumina_multiplex2.pl <Solexa File1> <Solexa File2> <mapping> <output_prefix>
<mapping> input file should have the following fields (tab delimited, here's an example file):
Barcode construction, output name, forward barcode name, forward barcode seq, forward barcode orientation, index barcode name, index barcode seq, index barcode orientation, reverse barcode name, reverse barcode seq, reverse barcode orientation
Samples with the same output name will be in the same file. Barcode construction must be one of the following exact fields: single, double or forbar. Use single for option 1 above (single barcodes identify the samples), the 2
The output should be forward and reverse files labeled output_prefix.output_name.1 and output.2 respectively.
These can be used as the fastq files in downstream processes.
You can also use just the mapping files that would be the input to QIIME (not in the example above, but default for QIIME), and the index read generated as previously stated, if you just want to limit the data to a set of barcodes in your mapping file. In that case run the following command:
perl parse_Illumina_multiplex_from_map_index.pl <Solexa File1> <Solexa File2> <mapping> <output_prefix> <index read>
The fastq files will contain only those found in your mapping file and can be used in downstream analysis.
Overlapping the reads
You may have sufficient length to overlap the forward and reverse reads to create a longer sequence. This process will be time consuming, but it gains phylogenetic resolution and can be useful for many applications. We use SHE-RA, which was created to have a sophisticated calculation of quality for an overlapped base, given the quality of each overlapped base and whether or not they match. Other software exists (and is faster), but will do multiple things at once, including trimming the sequences for quality and will not provide as good an estimate of the quality of the overlapped bases. If other programs are used, it might be necessary to use other ways to de-multiplex samples after using. With SHE-RA, we overlap paired end seuqences, then re-generate the fastq files to use with QIIME split_libraries_fastq.py.
First, divide up your samples into about 1 million reads per file, forward and reverse reads separately.
perl split_fastq_qiime_1.8.pl <read> <number needed> <output prefix>
Example:
perl ~/bin/split_fastq_qiime_1.8.pl 131001Alm_D13-4961_1_sequence.fastq 100 131001Alm_D13-4961_1_sequence.split
perl ~/bin/split_fastq_qiime_1.8.pl 131001Alm_D13-4961_2_sequence.fastq 100 131001Alm_D13-4961_2_sequence.split
Then, overlap each of the 100 files with SHERA where ${PBS_ARRAYID} is the process number for parallel processing:
perl concatReads_1.8.pl fastq_1 fastq2 --qualityScaling illumina
perl concatReads_1.8.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.fastq 131001Alm_D13-4961_2_sequence.split.${PBS_ARRAYID}.fastq --qualityScaling illumina
Filter out the bad overlaps from the fa and quala generated with SHERA:
perl filterReads.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.fa 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.quala 0.8
Use mothur to re-generate the fastq files:
mothur "#make.fastq(fasta=131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fa, qfile=131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.quala)"
Now, you will either have to fix the index file to contain only the reads in your file (if the index read is a separate file):
perl fix_index.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fastq 131001Alm_D13-4961_3_sequence.fastq > 131001Alm_D13-4961_3_${PBS_ARRAY_ID}.filter_0.8.fastq
Or, if you have to generate it from the header (if the index is already present in the header):
perl fastq2Qiime_barcode2.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fastq > 131001Alm_D13-4961_1_sequence.split.index.filter_0.8.fastq
This file can be used for specific header configuration where the fastq files look like this, where it pulls out the longest string of base letters (ATGCKMRYSWBVHDNX) after the #, in this case it would be TGGGACCT and creates a false quality for each base as the lower case of each barcode letter:
@MISEQ:1:2106:21797:11095#TGGGACCT_204bp_199.2_0.90
TGTAGTGCCAGCCGCCGCGGTAATACGTAGGTGGCGAGCGTTGTTCGGATTTATTGGGCGTAAAGGGTCCGCAGGGGGTT
CGCTAAGTCTGATGTGAAATCCCGGAGCTCAACTCCGGAACTGCATTGGAGACTGGTGGACTAGAGTATCGGAGAGGTAA
GCGGAATTCCAGGTGTAGCGGTGGAATGCGTAGATATCTGGAAGAACACCGAAAGCGAAGGCAGCTTACTGGACGGTAAC
TGACCCTCAGGGACGAAAGCGTGGGGATCAAACAGGATTAGAAACCCCTGTAGTCC
Result is:
@MISEQ:1:2106:21797:11095#TGGGACCT_204bp_199.2_0.90
TGGGACCT
+
tgggacct
De-multiplex
Now the re-created fasta and index read can be used as normal with QIIME or other software of your choice:
split_libraries_fastq.py -i 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fastq -m mapping_file.txt -b 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.index.filter_0.8.fastq --barcode_type 8 --rev_comp_barcode --min_per_read_length .8 -q 10 --max_bad_run_length 0 -o
unique_output_${PBS_ARRAYID} --phred_offset 33
This will create a seqs.fna file which can be used in down stream analysis.