Introduction- Scott's script to consolidate all of the OTU calling scripts and submit them to the cluster with one command
(smile_train: Get's you where you want to go and your happy)
Works on coyote and maybe the broad, but maybe not other clusters.
You should install in lib (bin get crowded)- each person should install their own.
Make a lib directory unless this already exists:
mkdir lib
Clone this into the lib folder you just make using git:
git clone https://github.com/almlab/SmileTrain
To get some information, go to the wiki and follow the directions:
https://github.com/almlab/SmileTrain/wiki
Alter train.cfg (you either have to link to usearch etc) to have your new names for the following (for example I changed these):
[User]
username=spacocha
tmp_directory=/net/radiodurans/alm/spacocha/tmp
library=/net/radiodurans/alm/spacocha/lib/SmileTrain
bashrc=/net/radiodurans/alm/spacocha/lib/SmileTrain/bashrcs/coyote.sh
Then, make the tmp_directory:
mkdir /net/radiodurans/alm/spacocha/tmp
You can have a detailed description of anything by using --help.
This will source the bashrc that script that Scott made.
Submit from the head node (when you log on), and if it's long, you;re going to have to use this:
ssh coyote
screen
Now you are inside screen (screen -ls will tell you which ones you are running):
You can name the screens
screen -S SPtest
you can detatch and keep running:
RUN COMMAND
contol^A D
(or type man screen to get information)
and screen reattach will give you
screen -R SPtest
but then you have to actually stop them by typing exit from within screen
exit
PCR, real-time PCR, primers, column and SPRI clean-up of reactions
Preparation of PCR primer stocks and working solutions
stocks
spin freezedried stocks, 1min full speed
add sterile H2O (molecular biology grade) to a final concentration of 100µM
working solutions
485µl sterile H2O (molecular biology grade)
+ 15µl primer stock
--> 3µM
DNA column purification (Qiagen PCR clean-up / Qiaquick Gel Extraction)
Loading
mix reaction + 5Vol PBI buffer in Epi
place column in collection tube
load on column
spin 1min, full speed, RT
discard floughthrough
alternatively for gel extraction of DNA bands
cut bands on blue light table (DeLong lab), or on UV Transilluminator (Thompson lab) with clean razor blade
transfer into Epi
weigh Epis (~1g) and gel slices
+3Vol/weight (300µl/100mg) GC buffer
incubate tubes at 50˚C, 10min, vortex gently every 2min
+ 1 Vol/weight (100µl/100mg) Isopropanol (improves yield especially for fragments below 500bp or above 4kb)
mix by inverting tube
place column in collection tube
load on column (if volume to big for column, then load, spin, discard flowthrough, and load rest)
spin 1min, full speed, RT
discard floughthrough
Washing
+750µl PE buffer (seal bottle tight to avoid EtOH evaporation)
spin 1min, full speed, RT
discard floughthrough
place column back in empty collection tube
Drying
spin (to dry) 30sec, full speed, RT
turn column in collection tube by 180˚
spin (to dry) 1min, full speed, RT
discard floughthrough and collection tube
place column in new Epi
dry open column under laminar air flow for 2min
Elution
+35-50µl EB or sterile H2O (molecular biology grade)
incubate at least for 5min
spin (to elute DNA) 30sec, full speed, RT
turn column in Epi by 180˚
spin (to elute DNA) 1min, full speed, RT
discard column and store Epi with DNA
SPRI clean up and primer/dimer removal (Agencourt AMPure XP beads)
Preparations:
adjust PCR reaction to 50µl with EB
vortex SPRI beads 1600rpm, 10sec
aliquot 45µl of beads into one 1.5ml tube per library
Binding to beads
add 50µl PCR reaction to beads
mix by pipetting/vortex 1600rpm
incubate for 5-7min
separate on magnet for 2min
remove and discard SN while tube stays on magnet
Wash - removal of salts, enzymes and low molecular weight DNA
wash beads carefully twice with 70% EtOH while tube stays on magnet (do not disturb bead pellet)
incubate for 30sec
remove all SN
repeat
Dry
air dry on magnet for 15min
Elution
remove tube from magnet, add 20µl EB
vortex 1600rpm, 10sec
incubate at RT for 5min
separate on magnet for 2min
collect SN and transfer into new 1.5ml tube
Quantitative real-time PCR
Mastermix
for 200µl (8x) reaction:
10.525µl H2O
5µl 5x Phusion Pol buffer
0.5µl dNTP mix 10mM
3.3+3.3µl primers, 3µM
2µl template (try different 10 fold dilutions)
0.125µl SYBR Green I
0.25µl Pol (Phusion)
prepare reaction in PCR tubes (or 96well plate) with optical covers that fit the utilized
real-time PCR
cycle at 1) 98˚C, 20''
2) 98˚C, 15''
3) specific Annealing Temp˚C, 20''
4) 72˚C, 20'' (for fragments shorther than 1kb)
5) go back to step 2 45x
always use at least 3 replicates per sample
always include 3 replicates of a non-template (H2O) control
16S By Hand Library Preparation
Materials:
● Agencourt Ampure XP, A63881 (60mL, $300)
● 2 Roche LichtCycler480 384-well plate
● 1:100 dilution of SYBR stock (Invitrogen S7563, 10,000x)
● Step 1/ Initial QPCR Primers ( PE_16s_v4U515-F OR PE_16s_v4_Bar#, PE_16s_v4_E786_R)
● Step 2 primers ( PE-III-PCR-F, PE-IV-PCR-XXX)
● Final QPCR primers (PE Seq F, PE Seq R)
● HF Phusion (NEB, M0530L)
● KAPA SYBR 2xMM for final QPCR
● Invitrogen Super magnet (16 or 8 sample capacity)
Determination of Step 1 Cycle Time and Sample Check:
Materials used:
○ Contents of MM
○ P200 multi-channel pipette
○ 96 well QPCR plate (96 well for opticon stocked in lab)
○ Clear QPCR plate covers
Initial QPCR master mix (MM)
Reagent | X1 RXN (uL) |
H2O | 12.1 |
HF Buffer | 5 |
dNTPs | 0.5 |
PE16s_V4_U515_F (3uM) | 2.5 |
PE16S_V4_E786_R (3uM) | 2.5 |
Template | 2 |
SYBR green (1/100 dilu) | 0.125 |
Phusion | 0.25 |
Run this step in duplicate or triplicate to best estimate proper cycling time
Initial QPCR Program (Opticon):
Heat:
98°C – 30 seconds
Amplify:
98°C – 30 seconds
52°C – 30 seconds
72°C – 30 seconds
Cool:
4°C - continuous
Use Ct (bottom of curve, not mid-log) of curves to determine dilutions for step 1 amplification (google docs, Illumina Library QPCR and Multiplexing)
Breakdown of QPCR amplification math (done to normalize each sample):
○ delta Ct = Sample Ct - lowest Ct in sample set
○ fold = 1.75^(delta Ct)
○ dilution needed = fold
○ note - that input is 2uL per RXN so sample with lowest Ct gets 2uL undiluted
Please note – samples may fail due to too little, too much material, or a poor reaction. It is recommended that failed samples be re-run before moving forward
Library Preparation:
Step 1
Please Note: Samples are run as four 25uL reactions that are pooled at end of cycling
1st step Master Mix 25uL RXN (MM1)
Reagent | X1 RXN (uL) |
H2O | 12.25 |
HF Buffer | 5 |
dNTP | 0.5 |
PE16S_V4_U515_F (3uM) | 2.5 |
PE16S_V4_E786_R (3uM) | 2.5 |
Template | 2 |
Phusion | 0.25 |
16s Step 1 Program:
Heat:
98°C – 30 seconds
Amplify:
98°C – 30 seconds
52°C – 30 seconds
72°C – 30 seconds
Cool:
4°C - continuous
Run amplification cycle number determined via QPCR (no more than 20 cycles allowed)
After cycling pool duplicates, now have 1x 100uL reaction per sample
SPRI Clean Up
Materials used:
○ SPRI beads
○ 70% EtOH
○ EB
○ Invitrogen super magnet
- Vortex cold AmpureXP beads, pool DNA from PCR tubes (~100uL)
- Aliquot 85.5uL beads into Epi’s – let equilibrate to RT
- Add DNA (take 95uL) + beads (85.5uL) = 180.5 uL
- Incubate 13’ @ RT
- Separate ON magnet 2’
- While ON magnet, remove/discard SN
- Wash beads 2x with 70% EtOH, 500uL each wash
- Air dry beads for 15-20’ on magnet
- Remove from magnet, elute in 40uL H2O, vortex to resuspend
- Incubate (at least 7’)
- Separate on magnet 2’
- Collect 35-40 ul and save SN
Sample Re-Aliquoting and Step 2
Please Note: Samples are run as four 25uL reactions that are pooled at end of cycling
2nd step Master Mix 25uL RXNs (MM2)
Reagents | X1 RXN (uL) |
H2O | 8.65 |
HF Buffer | 5 |
dNTPs | 0.5 |
PE-PCR-III-F (3uM) | 3.3 |
PE-PCR-IV-XXX (3uM) | 3.3 |
Template | 4 |
Phusion | 0.25 |
16s Step 2 Program:
Heat:
98°C – 30 seconds
Amplify:
98°C – 30 seconds
83°C – 30 seconds
72°C – 30 seconds
Cool:
4°C - continuous
Run 9 cycles of amplification
- After cycling pool duplicates, now have 1x 100uL reaction per sample
SPRI Clean Up
Materials used:
○ SPRI beads
○ 70% EtOH
○ EB
○ Invitrogen super magnet
- Vortex cold AmpureXP beads, pool DNA from PCR tubes (~100uL)
- Aliquot 85.5uL beads into Epi’s – let equilibrate to RT
- Add DNA (take 95uL) + beads (85.5uL) = 180.5 uL
- Incubate 13’ @ RT
- Separate ON magnet 2’
- While ON magnet, remove/discard SN
- Wash beads 2x with 70% EtOH, 500uL each wash
- Air dry beads for 15-20’ on magnet
- Remove from magnet, elute in 40uL H2O, vortex to resuspend
- Incubate (at least 7’)
- Separate on magnet 2’
- Collect 35-40 ul and save SN
Final QPCR
Once you have a substantial or all of your samples prepared you can run a final QPCR to determine dilutions and volumes for multiplexing. This step also confirms that the library preparation was successful
QPCR Master Mix (QPCR MM, 20uL RXN)
Reagents | X1 RXN (uL) | X345RXN (uL) |
H2O | 7.2 | 2,484 |
PE Seq Primer – F (10uM) | 0.4 | 138 |
PE Seq Primer – R (10uM) | 0.4 | 138 |
KAPA SYBRgreen MM | 10 | 3,450 |
Template | 2 | - |
Final QPCR Program (Opticon)
Heat:
95°C – 5 minutes
Amplify:
95°C – 10 seconds
60°C – 20 seconds
72°C – 30 seconds
Melting Curve:
95°C – 5 seconds
65°C – 1 minute
97°C - continuous
Cool:
40°C – 10 seconds
Run 35 cycles of amplification
Use mid-log phase of curves to determine volumes for multiplexing (google docs, Illumina Library QPCR and Multiplexing)
Please note – samples may fail due to too little, too much material, or a poor reaction. It is recommended that failed samples be re-run before moving forward
Breakdown of QPCR multiplexing math (done to normalize each sample):
○ delta Ct = Sample Ct - lowest Ct in sample set
○ fold = 1.75^(delta Ct)
○ ratio = 1/fold
○ volume to mix b/c of ratio = X*ratio (X = minimum desired volume per sample)
○ how to dilute = fold
○ note - sample with lowest Ct will get an undiluted Xuls added to final multiplex, X can be raised or lowered to accommodate the needed volume of other samples
Sample Multiplexing and Submission for Sequencing:
- Once samples have been multiplexed aliquot ~20uL of the final mix and submit it to the BioMicro Center for sequencing
Introduction
This is the general outline for running AdaptML. This guide was generated by Sarah Preheim, not Lawrence David, just to keep that in mind. It is meant to complement the user guide provided on the official web page, but also includes some additional information for non-experts.
Download
Download AdaptML from the Alm Lab website and install as directed:
http://almlab.mit.edu.ezproxyberklee.flo.org/adaptml.html
Input tree
Unrooted tree is used to run AdaptML. Usually I make them with phyml for 2,000 or fewer sequences:
phyml_v2.4.4/exe/phyml_linux all_u.phy 0 i 1 100 GTR e e 4 e BIONJ y y
Running AdaptML
Here is an example of how to run AdaptML:
python ../../latest_adaptml/habitats/trunk/AdaptML_Analyze.py tree=./All_simple_7_ur_particle.phy_phyml_tree.txt2 hab0=16 outgroup=ECK_1 write=./ thresh=0.025
Finding Stable Habitats
Although there was a set of habitats predicted, you want to determine how often those same habitats would be predicted from 100 different iterations of AdaptML.
For example:
Now try to standardize with 100 runs:
foreach f ( 0 1 2 3 4 5 6 7 8 9 )
foreach? foreach d (0 1 2 3 4 5 6 7 8 9)
foreach? mkdir ${f}${d}_dir
foreach? python ../../latest_adaptml/habitats/trunk/AdaptML_Analyze.py tree=./All_simple_7_ur_particle.phy_phyml_tree.txt2 hab0=16 outgroup=ECK_1 write=./${f}${d}_dir/ thresh=0.025
foreach? perl parse_migration3.pl ./${f}${d}_dir/habitat.matrix > ./${f}${d}_dir/habitat.matrix.tab
foreach? end
Make a list of all of the migration.tab files
ls *_dir/habitat.matrix.tab > migration_list.txt2
Then use temp_dist2.pl to get the habitat number distribution.
perl ~/bin/temp_dist2.pl migration_list.txt2
4 5
5 21
6 53
7 20
8 1
The original migration.tab file in ./040609_dir/ has 6 habitats, so just use that to see the percent matching.
Then use get_stable_habitats6.pl and you might have to change the percent matching
so it’s as high as it can go before it dies with double hits.
perl get_stable_habitats6.pl 0.015 ../040609_dir/migration.tab migration_list.txt2
Results with 0.024:
11 39
3 100
13 100
10 80
12 100
14 99
Make clusters
Once you have found a set of stable habitats you want to go with, then make the clusters. This example is for the tree ./example_noroot.phy_phyml_tree.txt3, outgroup ECK_1 using 9900.file to make the final clusters (using python 2.5.1 and you may have to change the paths to point to the AdaptML software):
python ~/AdaptML_dir/latest_AdaptML/habitats/trunk/AdaptML_Analyze.py tree=./example_noroot.phy_phyml_tree.txt3 hab0=16 outgroup=ECK_1 write=./example_dir thresh=0.05
perl ~/bin/migration2color_codes.pl ./example_dir/migration.matrix color_template.txt > ./example_dir/color.file
mkdir ./example_dir/rand_trials
python ~/AdaptML_dir/latest_AdaptML/clusters/getstats/rand_JointML.py habitats=./example_dir/migration.matrix mu=./example_dir/mu.val tree=./example_noroot.phy_phyml_tree.txt3 outgroup=ECK_1 write=./example_dir/rand_trials/
python ~/AdaptML_dir/latest_AdaptML/clusters/getstats/GetLikelihoods.py ./example_dir/rand_trials/ ./example_dir/
python ~/AdaptML_dir/latest_AdaptML/clusters/trunk/JointML.py habitats=./example_dir/migration.matrix mu=./example_dir/mu.val color=./example_dir/color.file tree=./example_noroot.phy_phyml_tree.txt3 write=./example_dir/ outgroup=ECK_1 thresh=./example_dir/9900.file
Alm lab Protocol for processing overlapping 16S rRNA reads from a MiSeq run
Experimental design
The specifics of the sequencing set-up and molecular construct will determine exactly how the data needs to be sequenced and processed. There are a few different designs that the Alm lab has set up:
1.) (Standard) Multiplexing different samples together to be sequenced in one lane of Illumina, marking each unique sample with a unique barcode on the reverse primer of step 2 that is read during the indexing read. This is common for up to 96 samples (or 105 including additional barcodes that are not in the 96-well format). The sequencing should be done, not with the standard Illumina indexing primer, but with the reverse complement of the 2nd read sequencing primer. This is a custom barcode that should be included in the sequencing set-up. See sequencing section below for protocol.
2.) Multiplexing multiple different plates of samples together using a barcode located 5' to the primer used in genome amplification (typically U515) and a reverse barcode that is read during the indexing read. This is not typical for the Alm lab MiSeq protocol, since getting only about 12-25 million reads from MiSeq is sufficient for about 100 samples, not more. However, it is a possible scenario for samples which do not require high coverage.
3.) Mixing both genome sequencing and 16S rRNA amplicon sequencing together in one lane. Adding genome library preps to 16S amplicon lanes improves the quality of the base calling by adding diversity without loosing much to phiX sequencing. The genome library constructs typically contain barcodes in the forward and reverse read sequences, and do not typcially have an indexing read associated with them. However, adding them to a lane which does have a index read is ok.
4.) An experimental set-up using both forward and reverse orientation of the 16S rRNA among different samples, and staggering the diversity region 5' of both primers used in genome amplification allows for sufficient base diversity to run 16S rRNA libraries without wasting phiX data. In this case, half of the samples begin by sequencing from the U515 primer in the forward read, and half begin by sequencing from the U786 from the forward read. Additionally, the number of bases before the primer sequence varies from 4-9 bp.
Sequencing
Sequencing our construct on MiSeq is slightly different than standard Illumina MiSeq sequencing. Load the providedsample sheet, (which arbitrarily specifies a 250 paired-end run with an 8nt barcode read) and spike in 15uL of the anti-reverse BMC index primer @ 100uM (5' AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG 3') into tube 13 of the cartridge. This should provide three reads (forward, reverse and index) at 250, 8, 250 bp each.
(The following is curtsey of Shaprio lab):
To generate an indexing file, you have to change the setup of the MiSeq reporter, because by default MSR doesn't generate a barcodes_reads.fastq.
In order to change that:
First, turn off MSR service: Task Manager, Services Tab, Right click on MiSeq Reporter and click stop.
Used NotePad to edit MiSeqReporter.exe.config file that can be found in C:\Illumina\MiSeq Reporter
The following needs to be included in the top portion of the file (the <appSettings> section)
<add key="CreateFastqForIndexReads" value="1" />
Save then close.
You will then want to restart the service. This can be accomplished by right clicking on the tool bar in windows, selecting "Start Task Manager", select the "Services" tab, find MiSeq Reporter on the list and then select to stop and then start the service.
You can re-queue your run using the sample sheet WITH the index information on the sample. In our case, we used a very simple sample_sheet with one index like ATATATAT.
De-multiplexing
You can demultiplex at various stages. If there are multiple, un-related projects in the same run, I will pull out all of the reads that map to barcodes for only one project, so that I don't have to process extra data. You also have the option of removing unwanted barcodes at the qiime: split_libraries_fastq.py step by providing a mapping file containing only the barcodes you want, but you may waste time overlapping them if there are a lot of them. Do not use the following step if you if you will eventually work with all of the data. Only use the following if you never need to work with the other data, since it doesn't make sense to process it at all.
Run this program to parse out sequences from the raw data according to the following order:
1.) Single barcodes that are unique in the sample. These are possibly control samples or extra barcodes that were done uniquely and are the only piece of information indicating which samples that sequence came from
2.) Next, it looks for the presence of the forward and reverse barcodes in the forward and reverse read. These are typically from genome sequences.
3.) Finally, it looks for paired data, pulling out reads that have both the forward barcode and the indexing barcodes that match input samples.
All other samples that do not match are discarded.
Program:
perl parse_Illumina_multiplex2.pl <Solexa File1> <Solexa File2> <mapping> <output_prefix>
<mapping> input file should have the following fields (tab delimited, here's an example file):
Barcode construction, output name, forward barcode name, forward barcode seq, forward barcode orientation, index barcode name, index barcode seq, index barcode orientation, reverse barcode name, reverse barcode seq, reverse barcode orientation
Samples with the same output name will be in the same file. Barcode construction must be one of the following exact fields: single, double or forbar. Use single for option 1 above (single barcodes identify the samples), the 2
The output should be forward and reverse files labeled output_prefix.output_name.1 and output.2 respectively.
These can be used as the fastq files in downstream processes.
You can also use just the mapping files that would be the input to QIIME (not in the example above, but default for QIIME), and the index read generated as previously stated, if you just want to limit the data to a set of barcodes in your mapping file. In that case run the following command:
perl parse_Illumina_multiplex_from_map_index.pl <Solexa File1> <Solexa File2> <mapping> <output_prefix> <index read>
The fastq files will contain only those found in your mapping file and can be used in downstream analysis.
Overlapping the reads
You may have sufficient length to overlap the forward and reverse reads to create a longer sequence. This process will be time consuming, but it gains phylogenetic resolution and can be useful for many applications. We use SHE-RA, which was created to have a sophisticated calculation of quality for an overlapped base, given the quality of each overlapped base and whether or not they match. Other software exists (and is faster), but will do multiple things at once, including trimming the sequences for quality and will not provide as good an estimate of the quality of the overlapped bases. If other programs are used, it might be necessary to use other ways to de-multiplex samples after using. With SHE-RA, we overlap paired end sequences, then re-generate the fastq files to use with QIIME split_libraries_fastq.py.
First, divide up your samples into about 1 million reads per file, forward and reverse reads separately (SHERA has code for parallelization, but I couldn't get it to work).
general form-
perl split_fastq_qiime_1.8.pl <read> <number needed> <output prefix>
Example-
perl ~/bin/split_fastq_qiime_1.8.pl 131001Alm_D13-4961_1_sequence.fastq 100 131001Alm_D13-4961_1_sequence.split
perl ~/bin/split_fastq_qiime_1.8.pl 131001Alm_D13-4961_2_sequence.fastq 100 131001Alm_D13-4961_2_sequence.split
Then, overlap each of the 100 files with SHERA where ${PBS_ARRAYID} is the process number for parallel processing (remember to change the lib path in the code of concatReads.pl for the code to run from any folder- text editor like emacs, change the second line to the directory where the .pm files are, save):
general form-
perl concatReads.pl fastq_1 fastq2 --qualityScaling sanger
example of actual command-
perl concatReads.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.fastq 131001Alm_D13-4961_2_sequence.split.${PBS_ARRAYID}.fastq --qualityScaling sanger
Filter out the bad overlaps from the fa and quala generated with SHERA:
perl filterReads.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.fa 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.quala 0.8
Use mothur to re-generate the fastq files:
mothur "#make.fastq(fasta=131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fa, qfile=131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.quala)"
Now, you will either have to fix the index file to contain only the reads in your file (if the index read is a separate file):
perl fix_index.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fastq 131001Alm_D13-4961_3_sequence.fastq > 131001Alm_D13-4961_3_${PBS_ARRAY_ID}.filter_0.8.fastq
Or, if you have to generate it from the header (if the index is already present in the header):
perl fastq2Qiime_barcode2.pl 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fastq > 131001Alm_D13-4961_1_sequence.split.index.filter_0.8.fastq
This file can be used for specific header configuration where the fastq files look like this, where it pulls out the longest string of base letters (ATGCKMRYSWBVHDNX) after the #, in this case it would be TGGGACCT and creates a false quality for each base as the lower case of each barcode letter:
@MISEQ:1:2106:21797:11095#TGGGACCT_204bp_199.2_0.90
TGTAGTGCCAGCCGCCGCGGTAATACGTAGGTGGCGAGCGTTGTTCGGATTTATTGGGCGTAAAGGGTCCGCAGGGGGTT
CGCTAAGTCTGATGTGAAATCCCGGAGCTCAACTCCGGAACTGCATTGGAGACTGGTGGACTAGAGTATCGGAGAGGTAA
GCGGAATTCCAGGTGTAGCGGTGGAATGCGTAGATATCTGGAAGAACACCGAAAGCGAAGGCAGCTTACTGGACGGTAAC
TGACCCTCAGGGACGAAAGCGTGGGGATCAAACAGGATTAGAAACCCCTGTAGTCC
Result is:
@MISEQ:1:2106:21797:11095#TGGGACCT_204bp_199.2_0.90
TGGGACCT
+
tgggacct
De-multiplex
Now the re-created fasta and index read can be used as normal with QIIME or other software of your choice:
split_libraries_fastq.py -i 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fastq -m mapping_file.txt -b 131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.index.filter_0.8.fastq --barcode_type 8 --rev_comp_barcode --min_per_read_length .8 -q 10 --max_bad_run_length 0 -o unique_output_${PBS_ARRAYID} --phred_offset 33
This will create a seqs.fna file which can be used in down stream analysis.
Please fill in as appropriate, also on google docs spreadsheet that Carrie maintains.
Listed below are all of the sequencing runs and what was sequenced:
Nov 18 2012 121114Alm
Feb 22 2013 121217Alm
Mar 14 2013 130308Alm
Apr 12 2013 130320Alm
?
May 16 2013 130423Alm
?
Jul 24 18:59 130719Alm -Sequenced by Sarah Preheim with Julie Khodor for the Brandeis high school Genesis (?) program
Sep 2 20:49 130823Alm- Sequenced by Sarah Preheim with all environmental samples
Oct 7 10:47 131001Alm - Sequenced by Sarah Preheim with all environmental samples and one Willimam's pond
Oct 17 15:05 131011Alm - Sequenced by Sarah Preheim with all environmental samples testing the staggered primers (?)
Dec 2 12:04 131114AlmA -Sequenced by Sarah Preheim with environmental samples, Spence TR samples, testing the staggered and flipped primers
Nov 25 12:23 131114Alm
?
Dec 13 23:05 131126Alm - Sequenced by Sarah Preheim with environmental samples, and mouse IGA samples with staggered and flipped primers
Materials Needed:
- Bio-Rupter (Parson’s 3rd floor in between Polz and Chisholm labs)
- Quick blunting and ligation kit (NEB, E0542L 100RXNs $473.60)
- 10mM dNTP mix ( NEB, N0447L $216.00)
- also need a 1:10 dilution for 1mM dNTPs
- IGA adapter A# (10uM working solution)
- IGA adapter B#-PE (10uM working solution)
- SPRI beads (Beckman Coulter, A63882, 450mL $4,100)
- BST Polymerase large fragment (NEB, M0275S 1,600U $49.60)
- IGA-PCR-PE-F primer (40uM working solution)
- IGA-PCR-PE-R primer (40uM working solution)
- Phusion, with HF buffer (NEB, M0530L 500U $329.60 )
- SybrGreen (Invitrogen S7563 500uL 10,000x $235.00 )
- Qiaquick PCR cleanup column (Qiagen, 50 columns: 28104, $98.94; 250 columns: 28106, $465.60)
- MinElute Reaction clean up column (Qiagen 50 columns: 28204, $109.61; 250 columns: 28206, $503.43)
Protocol for library whole genome construction
- Shear DNA by sonication. Make sure your sample is in 50ul of solution. Start with 2-20ug of DNA. Fill BioRupter with water (upto .5 inches from line) and then ice upto line. Do 6 cycles, replace ice. Repeat for a total of 18-20 cycles of 30seconds on/off with “H” setting. Average 200-400 base pairs. Use Agilent Bioanalyzer to confirm shear size.
2. End-repair
- Blunt and 5’-phosporyate the DNA from step 2 using Quick blunting kit.
- Mix:
sheared DNA (2μg) 45.5μl
10x Blunting Buffer 6μl
1mM dNTP Mix 6μl
Blunt enzyme mix 2.5μl
TOTAL 60μl
- Incubate at RT for 30 minutes
- Purify using Qiagen MinElute column (these are kept in the fridge.) Elute in 12μl.
3. Ligate Solexa adaptors
- Solexa adapters must be hybridized before use. Heat to 95 for 5 minutes, cool slowly to Room temperature.
- Ligate adaptors, using a 10x molar excess of each one, and as much DNA as possible.
- Mix:
End-repaired DNA 10μl
100μM IGA adapter A# 1.25μl
100μM IGA adapter B#-PE 1.25μl
2X Quick Ligation Reaction Buffer (NEB) 15μl
Quick T4 Ligase (NEB) 2.5μl
TOTAL 30μl
- Incubate at RT for 15 minutes.
4. Size selection and purification using SPRI beads.
- Mix DNA and beads to appropriate ratio: 0.65X SPRI beads: Add 19.5 μl of SPRI beads to 30μl reaction from step 3.
- Incubate at RT for 20 minutes.
- Place tubes on magnet for 6 minutes.
- Transfer all SN to new tube. Discard beads.
- Mix DNA and beads to appropriate ratio, 1X SPRI beads: Add 10.5 μl SPRI beads to 49.5μl reaction.
- Vortex, spin.
- Incubate at RT for 7-20 minutes.
- Place tubes on magnet for 6 minutes.
- Remove all SN, keep beads.
- Wash with 500μl 70% EtOH, incubate for 30 seconds, remove all SN.
- Repeat: Wash with 500μl 70% EtOH, incubate for 30 seconds, remove all SN.
- Let dry completely for 15 minutes. Remove from magnet.
- Elute in 30μl EM.
- Vortex.
- Incubate at RT for 2 minutes.
- Put on magnet for 2 minutes
- Transfer SN to new tube.
5. Nick translation
- Bst polymerase can be used for nick translation---it can be used at elevated temperatures which is good for melting and secondary structures and lacks both 3’-5’ and 5’3’ exonuclease activity.
- Mix:
Purified DNA 14 μl
10X Buffer (NEB) 2μl
10mM dNTPs 0.4μl
1mg/ml BSA 2μl
Water 0.6μl
Bst polymerase (Enzymatics) 1μl
TOTAL 20μl
- Incubate at 65 degrees, 25 minutes.
6. Library Enrichment by PCR.
- Perform 2 25μl reactions:
- Mix:
H2O 16.6μl
5X HF Buffer 5μl
dNTPs (10mM) 0.5μl
40μM Solexa PCR-A-PE 0.25μl
40μM Solexa PCR-B-PE 0.25μl
SybrGreenI 0.125μl
Nick-translated DNA 2μl
Phusion 0.25μl
TOTAL 25μl
- Program:
- 98˚C 70sec
- 98˚C 15sec
- 65˚C 20sec
- 72˚C 30sec
- Go to step 2 34 more times.
- 72˚C 5 min
- 4˚C Forever
- These 2 reactions are to check cycle time only. Look at the melting curves---use the mid-log point to pick the ultimate cycle time.
- Prep PCR as above, but in 2 100μl reactions using 8μl of sample in each, and cycle with cycle number.
- Mix:
H2O 66.8μl
5X HF Buffer 20μl
dNTPs(10mM) 2μl
40μM Solexa PCR-A-PE 1μl
40μM Solexa PCR-B-PE 1μl
Nick-translated DNA 8μl
Phusion 1μl
TOTAL 100μl
- Run on a QIAElute column. Elute in 50ul. (You could also do a single SPRI---check the ratios of beads to reaction volume)
- Analyze using Bioanalyzer.
Introduction
I've collected information about tricks that a newbie might not know, but which is useful to help you get around computational work. I'll try to keep adding stuff as I learn it. Please add your tricks too!
Parallel computing
I'm trying to get a better sense about designing parallel scripts. Typically, I've inherited someone's code and I've made it work for me. However, I have been looking for a good basic resource and I've found at least one site that looks promising. They have a few free courses that look relevant like "Parallel computing explained", "Introduction to MPI" and "Intermediate MPI". I found this by looking at an MIT computing course which pointed to this site.
http://www.citutor.org/index.php
Although it's got a lot of basic information, it's hard to figure out how it helps because I'm really not sure what type of clusters I'm actually using (i.e. which parts are relevant to me). Didn't really help me do any actual coding yet, although some background about computers was semi-interesting.
How to find stuff out about computing clusters
I wanted to know whether there was a website where you could just find out about how to run stuff on a computer cluster (i.e. beagle, aces, or coyote). Basically, Scott said that only the sys admin knows all of the specific rules associated with each cluster and if you don't pick their brain about it, you won't really know how to use it right. I will hopefully pick brains for you and put it on this website in another post about each system. That's a work in progress.
You can find out about specifics of aces queues with:
qstat -Qf | more
or
qstat -q
Which results in this on aces:
server: login
Queue Memory CPU Time Walltime Node Run Que Lm State
--------------- ---- ------ ------ -- -- -- - -----
geom - - - - 0 0 -- E R
one - - 06:00:00 1 8 319 10 E R
four-twelve - - 12:00:00 -- 8 4 10 E R
four - - 02:00:00 16 8 437 10 E R
long - - 24:00:00 16 1 0 10 E R
all - - 02:00:00 1024 0 0 4 E R
mchen - - 02:00:00 1024 0 0 4 E R
mediumlong - - 96:00:00 30 0 0 10 E R
special - - - 36 0 0 - E R
toolong - - 168:00:0 4 0 0 10 E R
---- ----
25 760
An this on coyote:
server: wiley
Queue Memory CPU Time Walltime Node Run Que Lm State
--------------- ---- ------ ------ -- -- -- - -----
speedy - - 00:30:00 - 0 0 - E R
short - - 12:00:00 - 2 -2 - E R
long - - 48:00:00 - 68 46 - E R
quick - - 03:00:00 - 0 0 - E R
be320 - - 00:30:00 - 0 0 - E R
ultra - - 336:00:0 - 2 0 - E R
---- ----
72 44
You can also use this to find more information about qsub (I would be in a place like the head node because not all nodes have the same qsub data):
man qsub
You can find out more about the various flags you can use with qsub.
Queuing system on clusters
Never run anything on the head node!!! When you log into a cluster, you need to submit jobs to a queue or work interactively on a dedicated interactive node. The dedicated interactive nodes will have different names, so you just have to find them. Sometimes you can request nodes by qsub -I or qsub to a dedicated interactive node (qubert on aces and super-genius on coyote), but these also depend on your system.
So, on any given cluster, there might be different queues (i.e. short, long, ultra-long) that you want to submit your jobs to. To find out (if you don't know already) you can qstat and the queue names will be the last column. It might be obvious about what each of these queues mean from the name and the amount of times thing have run in each queu (short < 12 hours, ultra-long > 2500 hours), but this is likely just something you need to find out from someone who knows about the cluster or from the sys admin again. Then if you want to submit to a specific queue, use something like this (I think, but I actually haven't done it exactly like this):
qsub -q short ....
Shortcut for ssh'ing
Scott also told me about how to set up your computer to automatically fill-in ssh information so you don't have to type it in each time.You have a folder ~/.ssh/ and file ~/.ssh/config which should be modified to contain each of the following for each host
Host aces
Hostname login.acesgrid.org
Username spacocha
Then each time you want to ssh just type:
ssh aces
Works for scp too (and presumably other things).
Downloading directly to the clusters
You can get stuff from a website using wget, for example:
wget https://github.com/swo/lake_matlab_sens/archive/master.zip
Running something on a detatched screen:
use screen. This will help you figure stuff out:
screen -man
or
screen --help
This starts screen:
screen -S SPPtest
This detaches but keeps it running:
hold "control" and "A" keys then type "D"
To reattach to detached screen:
screen -R SPPtest
To get rid of the screen altogether type this from within a screen:
exit
Introduction
I'm always trying to figure out which statistical test to use to analyze data, and I wish there was some sort of summary of when to use which statistic, what the limitations are and how to use it correctly. I'm going to try to add stuff including when to use each test, with the hypothesis you are interested in testing and how you can implement it. This is a work in progress, and I'm going to try to keep adding stuff that I learn about.
Background and good references
I found a pretty good book that I thought might be useful, but I haven't gotten it yet. It's not in the MIT library, so I ordered it through the some order.
Statistical analysis in microbiology : statnotes
Richard A Armstrong; Anthony C Hilton
Book
Contingency tables
I have run into situations where I want to test my observations against a model. The observations that I are the counts of an OTU found at a bunch of discrete depths. The model I want to test is whether this new OTU is the same as the distribution with depth of another more abundant and closely related OTU as applied in distribution-based clustering:
http://aem.asm.org/content/79/21/6593.full?sid=76732af5-84eb-4f2b-8465-bd1c66283323
Here's a good basic video explaining the chi square test and the use of contingency tables:
http://mv.ezproxy.com.ezproxyberklee.flo.org/help/genetics-and-statistics
I use R to calculate the chi-square value. These are my basic cheat-sheet notes for using R:
> alleles <- matrix(c(12, 4, 15, 17, 25, 4), nr=3,
+ dimnames=list(c("A1A1", "A1A2", "A2A2"), c("A1", "A2")))
> alleles
A1 A2
A1A1 12 17
A1A2 4 25
A2A2 15 4
> chisq.test(alleles)
Pearson's Chi-squared test
data: alleles
X-squared = 20.2851, df = 2, p-value = 3.937e-05
However, new information that I'm getting is that for very large values (like Illumina reads), your model will never fit because you have too much information and even small variations will be significant. I found this problem to be true so I got over it by determining whether the information content was the same (using SQRT JSD), which is basically a work around. However, I'm looking into the Root Mean Square Error of Approximation, although I haven't done so yet, to get over problems with big numbers like illumina count data.
http://www.rasch.org/rmt/rmt254d.htm
Determining a bug of importance from 16S data
Alex Sheh, postdoc in Fox lab, was looking at changes in the microbiome associated with cancer. He had an output from the Biomicro Center bioinformatics pipeline that indicated two significant bugs, one type of Clostridia associated with caner and a Bacteridetes associated with wildtype (or health, I'm not sure). In another analysis using a software package PLS-DA in SIMCA, two bugs seemed to be significantly associated with protection and with cancer. I suggested that he figure out whether the two results were similar (initially we thought they might be). I wasn't sure which test was (or could have been) applied and how to interpret the data. I suggests he use Slime to figure out which bugs were associated with disease and protection, but wasn't sure whether that used the same tests as the other two, or if it would be and additional independent confirmation of the other results (by yet another test). Below, I plan to outline which tests to do, what are the caveats and when to apply these tests, and when not to apply these tests. Also, to figure out whether the results are worth investing more money to verify.
This is all about how to compute on coyote. There's another site with other information at:
https://wikis-mit-edu.ezproxyberklee.flo.org/confluence/display/ParsonsLabMSG/Coyote
Gaining access:
I'm pretty sure that [greg] still helps out with this, you would probably need to ask him for access. Also, if you are working off campus, you need to log into your on-campus athena account first then ssh onto coyote. If you have questions about this, just ask me. Also, put yourself on the mailing list by checking out the other link above.
What is available:
You can find out with "module avail"
modules of interest (to me):
module add python/2.7.3
module add atlas/3.8.3
module add suitesparse/20110224
module add numpy/1.5.1
module add scipy/0.11.0
module add biopython
module add matplotlib/1.1.0
module add matlab
(matlab above one above is 2009)
module add matlab/2012b
QIIME has been installed (both of the following commands in order!):
module add python/2.7.6
module add qiime/1.8.0
Interactive computing:
I've was able to get onto a node with:
qsub -I -l nodes=1
But when I tried to use matlab with (module add matlab) it didn't work (although it did work upon ssh'ing)
To run matlab with the window, first log in with X11:
ssh -X user@coyote.mit.edu
ssh -X super-genius
module add matlab/2012b
matlab
Submitting multiple jobs:
Before running the program below, make sure to load the following modules (I tried without and I got an error loading argparse):
module add python/2.7.3
module add atlas/3.8.3
module add suitesparse/20110224
module add biopython
module add java/1.6.0_21
You can also just source csmillie's .bashrc to make sure it works (if you didn't do anything else to yours that you need).
Also, there are different timed queues, so make sure if you get this working that it submits to the right queue. If you type qstat -q you can see a list of queues and how many running and queued items each has. At the time I checked there are six queues: speedy, short, long, quick, be320, and ultra. These have different allowed runtimes.
From Mark and Chris-
I've been using a script chris wrote which works pretty well:/home/csmillie/bin/ssub
What it does
It streamlines job submission. If you give it a list of commands, it will (1) create scripts for them, and (2) submit them as a job array. You can give it the list of commands as command line arguments or through a pipe.
Quick examples
1. Submit a single command to the cluster ssub "python /path/to/script.py > /path/to/output.txt"
2. Submit multiple commands to the cluster (use semicolon separator) ssub "python /path/to/script1.py; python /path/to/script2.py"
3. Submit a list of commands to the cluster (newline separator) cat /list/of/commands.txt | ssub
Detailed example /home/csmillie/alm/mammals/aln/95/
In this directory, I have 12,352 fasta files I want to align. I can do this on 100 nodes quite easily:
1. First, I create a list of commands: for x in `ls *fst`; do y=${x%.*}; echo muscle -in $x -out $y.aln; done > commands.txt
The output looks like this:
...
muscle -in O95_9990.fst -out O95_9990.aln
muscle -in O95_9991.fst -out O95_9991.aln
muscle -in O95_9992.fst -out O95_9992.aln
muscle -in O95_9993.fst -out O95_9993.aln
...
2. Then I submit these commands as a job array: cat commands.txt | ssub
How to configure it
Copy it to your ~/bin (or wherever). Then edit the top of the script:uname = your username tmpdir = directory where scripts are created max_size = number of nodes you want to use
Other things
It automatically creates random filenames for your scripts and job arrays. These files are created in the directory specified by "tmpdir" It can also submit individual scripts instead of a job array.
Coyote queue
qstat -Qf | more
This will tell you the specifics of each job. There is also no priority allocation, so please be polite and choose the right queue for your job.
Submitting multiple files to process in the same job
Example from Spence -
I wanted to write bash files that would submit multiple files for the same analysis command on coyote. I used PBS_ARRAYID, which will take on values that you designate with the -t option of qsub.
I got access to qiime functions by adding the following line to the bottom of my .bashrc file:
export PATH="$PATH:/srv/pkg/python/python-2.7.6/pkg/qiime/qiime-1.8.0/qiime-1.8.0-release/bin"
Then I made my .bashrc file the source in my submission script (see below). The DATA variable just shortens a directory where I store my data.
To run all this, I created the file below then typed the following at the command line:
$ module add python/2.7.6
$ module add qiime/1.8.0
$ qsub -q long -t 1-10 pickRepSet.sh
(the -t option will vary my PBS_ARRAYID variable from 1 to 10, iterating through my 10 experimental files).
#!/bin/sh
#filename: pickRepSet.sh
#
# PBS script to run a job on the myrinet-3 cluster.
# The lines beginning #PBS set various queuing parameters.
#
# -N Job Name
#PBS -N pickRepSet
#
# -l resource lists that control where job goes
#PBS -l nodes=1
#
# Where to write output
#PBS -e stderr
#PBS -o stdout
#
# Export all my environment variables to the job
#PBS -V
#
source /home/sjspence/.bashrc
DATA="/net/radiodurans/alm/sjspence/data/140509_ML1/"
pick_rep_set.py -i ${DATA}fwd_otu/uclust_picked_otus${PBS_ARRAYID}/ML1-${PBS_ARRAYID}_filt_otus.txt -f ${DATA}fwd_filt/dropletBC/ML1-${PBS_ARRAYID}_filt.fna -o ${DATA}fwd_otu/repSet/ML1-${PBS_ARRAYID}_rep.fna
Installing and Running Slime on coyote (also note the trick about installing packages below):
ssh onto super-genius
Clone slime into the ~/lib/:
git clone https://github.com/cssmillie/slime.git
Then add r/3.0.2
module add r/3.0.2
I wanted to install some package in R, but I couldn't get them directly, so I did the following:
In R:
> Sys.setenv(http_proxy="http://10.0.2.1:3128")
Then it should work.
install.packages('optparse')
install.packages('randomForest')
install.packages('ggplot2')
install.packages('plyr')
install.packages('reshape')
install.packages('car')
(However, I'm still having trouble running slime)
If i exit out now try to run these on the command line, I had some success with this (although Chris said I need to be in the slime folder because I edited run.r to include the path to until.r to work):
Rscript ~/lib/slime/run.r -m setup -x unique.f5.final.mat.transpose2 -y enigma.meta -o output > commands.sh
http://acesgrid.org/http://acesgrid.org/
Overview
The ACES clusters stands for Alliance for Computational Earth Science (ACES) ad I got access from [greg] by emailing him, and he provided access within 24 hours. This came with an email from the grid system itself with a link to their website with lots of useful information. This includes the times for office hours (Tuesdays from 11:30-1:30) and a list of software.
Information
Their website has some useful information, although it's not perfect. If you have any specific questions, you might find it there.
http://acesgrid.org/getting_started.html
Some of the information might be old. For example, there are not currently office hours and you should just email [greg] if you need help with something specific. If you sign up for the aces-support list, you will get some emails (although I imagine very infrequently) about this.
I'll just summarize a few things I am aware of.
Storage
Because we don't have our own Alm lab storage and back-up system, you are only allowed to have access to 1GB on your home drive. Although I think there is quite a lot of memory available on a scratch drive, but it is for short term storage and will be deleted automatically. So this might be a good option when you want to do something which requires a lot of computational power to finish processing something, then you can move it off and store else where. Look at the webpage for specifics.
Interactive computing
After normal login to the head node, ssh to a dedicated compute node with:
ssh qubert
or get a node through the queue interactively:
qsub -I -l nodes=1
The Queue system
The website has some examples of how to submit a job. Try to follow their examples.
From the login (head) node, you can find out about the queue system with the command:
qstat -q
And from the login (head) node, you can find out about the qsub command-line options and what they mean with (if you do this from an interactive node you will get a different manual):
man qsub
Note: You can not qsub when you are logged into an interactive node (qubert), it says "qsub: command not found". Instead, qsub from the head node upon logging in.
This script ran on aces by qsubbing the file below like this:
qsub multiple_scripts1-10
multiple_scripts1-10 is:
#!/bin/csh
#filename: pbs_script
#PBS -N pbs_script
#PBS -l nodes=1
#PBS -e stderr
#PBS -o stdout
# o Export all my environment variables to the job
#PBS -V
if ( -f /etc/profile.d/modules.csh ) then
source /etc/profile.d/modules.csh
endif
module load matlab
matlab -nojvm -r "run(0.2985,1.0,75.0,10000.0,600.0,25.0,0.1,'rates_0_1.csv'); exit;"
MatLAB
I was specifically looking for a new way to use Matlab since I was using it on beagle but that went down. This is how to get matlab working with Aces
http://acesgrid.org/matlab.html
However, if you want run interactively, you reverse a node:
qsub -I -l nodes=1
Then in the next terminal window, you login as you would normally with -X (maybe not as directed on the above webpage):
ssh -X aces
Then you ssh from aces onto the reserved node:
ssh -X reserved.node.name
This should work and bring up X11 (that's what I'm using):
module add matlab
matlab
QIIME
You can also get qiime to work, using the qsub command above to get interactive computing and use:
module add qiime
It's QIIME version 1.6 (I think) so it's got a few quirks that are different from the version 1.3 that was on beagle. You might need to change some variables names etc, but the QIIME documentation should be helpful for this.
Space
So I was able to make my own directory in /data/ and /scratch/, but it didn't exist before I made it:
mkdir /scratch/spacocha
mkdir /data/spacocha
I'm able to write to that folder, so I think it should work fine.
File transfer
Although I can make the /scratch/spacocha folder, I can't scp to it. Instead, I can scp big files, like fastq files, to /data/spacocha/ just fine.
Notes
I have been having trouble getting and interactive node using qsub for a few days. Seems like the cluster fills up occasionally making it difficult to get stuff done in a hurry.
Summary
In summary, this cluster might be good for a few specific tasks, but it's not good for long term storage, and it's geared towards the earth science crowd and modeling ocean circulation etc. It might have some functional capabilities that you could use (i.e. Matlab, QIIME), but be careful not to leave data on their scratch because it will be deleted.
This information is specific for the 16S Illumina Libraries. Multiplexed genome libraries should follow the information for the genome barcodes.
Outline:
In order to multiplex more than 96 samples into a lane, a forward barcode is required. This is because the reverse barcodes cost a lot of money to make, and you can get more bang for your buck using the same reverse barcodes again with a different forward barcode. The forward barcode is a 5 bp sequence before the U515 F primer sequence. The forward primer must also include homology to the second step forward primer sequence. The entire construct is depicted in Fig S1.pdf.
Forward Primer Barcode Sequences:
These are the forward barcode sequences that we currently have are here: Manually_copied_forward_barcodes.xls
- From Ilana Brito from Alm lab (posted by Sarah Preheim)
Protocol for library whole genome construction
- Shear DNA by sonication. Make sure your sample is in 50ul of solution. Start with 2-20ug of DNA. Fill BioRupter with water (upto .5 inches from line) and then ice upto line. Do 6 cycles, replace ice. Repeat for a total of 18-20 cycles of 30seconds on/off with “H” setting. Average 200-400 base pairs.
- 2. End-repair
- Blunt and 5’-phosporyate the DNA from step 2 using Quick blunting kit.
- Mix:
sheared DNA (2μg) 45.5μl
10x Blunting Buffer 6μl
1mM dNTP Mix 6μl
Blunt enzyme mix 2.5μl
TOTAL 60μl
- Incubate at RT for 30 minutes
- Purify using Qiagen MinElute column (these are kept in the fridge.) Elute in 12μl.
- 3. Ligate Solexa adaptors
- Solexa adapters must be hybridized before use. Heat to 95 for 5 minutes, cool slowly to Room temperature.
- Ligate adaptors, using a 10x molar excess of each one, and as much DNA as possible.
- Mix:
End-repaired DNA 10μl (12.5 pmol)
100μM IGA adapter A# 1.25μl (125 pmol)
100μM IGA adapter B#-PE 1.25μl (125 pmol)
2X Quick Ligation Reaction Buffer (NEB) 15μl
Quick T4 Ligase (NEB) 2.5μl
TOTAL 30μl
- Incubate at RT for 15 minutes.
- 4. Size selection and purification using SPRI beads.
- Mix DNA and beads to appropriate ratio: 0.65X SPRI beads: Add 19.5 μl of SPRI beads to 30μl reaction from step 3.
- Incubate at RT for 20 minutes.
- Place tubes on magnet for 6 minutes.
- Transfer all SN to new tube. Discard beads.
- Mix DNA and beads to appropriate ratio, 1X SPRI beads: Add 10.5 μl SPRI beads to 49.5μl reaction.
- Vortex, spin.
- Incubate at RT for 7-20 minutes.
- Place tubes on magnet for 6 minutes.
- Remove all SN, keep beads.
- Wash with 500μl 70% EtOH, incubate for 30 seconds, remove all SN.
- Repeat: Wash with 500μl 70% EtOH, incubate for 30 seconds, remove all SN.
- Let dry completely for 15 minutes. Remove from magnet.
- Elute in 30μl EM.
- Vortex.
- Incubate at RT for 2 minutes.
- Put on magnet for 2 minutes
- Transfer SN to new tube.
- 5. Nick translation
- Bst polymerase can be used for nick translation---it can be used at elevated temperatures which is good for melting and secondary structures and lacks both 3’-5’ and 5’3’ exonuclease activity.
- Mix:
Purified DNA 14 μl
10X Buffer (NEB) 2μl
10mM dNTPs 0.4μl
1mg/ml BSA 2μl
Water 0.6μl
Bst polymerase (Enzymatics) 1μl
TOTAL 20μl
- Incubate at 65 degrees, 25 minutes.
- 6. Library Enrichment by PCR.
- Perform 2 25μl reactions: (100μM primer)
- Mix:
H2O 19.125μl
5X Pfu Turbo buffer 5μl
dNTPs 10mM 0.5μl
40μl Solexa PCR-A-PE 0.25μl
40μl Solexa PCR-B-PE 0.25μl
SybrGreenI 0.125μl
Nick-translated DNA 2μl
Pfu Turbo 0.25μl
TOTAL 25μl
- Program:
- 95˚C 120sec
- 95˚C 30sec
- 60˚C 30sec
- 72˚C 60sec
- 95˚C 120sec
- Go to step 2 34 more times.
- 72˚C 5 min
- 4˚C Forevair
- These 2 reactions are to check cycle time only. Look at the melting curves---use the mid-log point to pick the ultimate cycle time.
- Prep PCR as above, but in 2 100μl reactions using 8μl of sample in each, and cycle with cycle number.
- Mix:
H2O 77μl
5X Pfu Turbo buffer 20μl
dNTPs 10mM 2μl
40μl Solexa PCR-A-PE 1μl
40μl Solexa PCR-B-PE 1μl
Nick-translated DNA 8μl
Pfu Turbo 1μl
TOTAL 100μl
- Run on a QIAElute column. Elute in 50ul. (You could also do a single SPRI---check the ratios of beads to reaction volume)
- Analyze using Bioanalyzer.
Overview
We have designed barcodes to multiplex samples together in a single Illumina lane. Currently, only three reads supported by illumina, a forward read, a reverse read and a barcode read. However, we had incorporated an additional barcode read into the first read as well. The current design outlined in Fig S1.pdf.
Designing Illumina amplicon libraries
Any PCR amplicon (16S, TCR-beta, etc.) can be used with this scheme, since it was designed to be modular. The first step primers must contain the following:
1.) The genomic DNA primer binding sites (to attach and extend the PCR product)
2.) The forward primer must contain some site diversity. This diversity is important for cluster identification and having the first read begin with conserved primer sequence will severely impact the quality of the data. In Fig_S1.pdf, this diversity region is a string of YRYR (N's can not be used -with IDT anyway- unless specifying an equal ratio of the four bases and that might be costly). However, you can additionally add another set of barcodes and order different step one primers with the forward barcodes attached. The only caveat with this method is that you need at least four different forward barcodes in one lane to get enough diversity. The barcodes should be relatively evenly added to the sample in a ratio of 1:1:1:1 of each barcode. More than four barcodes in the forward read should increase the quality of the calls.
Specs
Here are specs for the most recent reverse barcodes:
Uri Laserson_6957574_6123588.XLS
In addition, there are 9 additional barcodes outside of the 96 in the plate above: 097-105. These can be used for multiplexing mock or control samples into your lane separately.
Name |
Sequence |
PE-IV-PCR-097 |
CATTTCGCT |
PE-IV-PCR-098 |
TTGCTCGTG |
PE-IV-PCR-099 |
TCCGCTCAC |
PE-IV-PCR-100 |
CCCAACAAA |
PE-IV-PCR-101 |
GCAGACCAA |
PE-IV-PCR-102 |
TGGCGATAT |
PE-IV-PCR-103 |
TGGTTCTGC |
PE-IV-PCR-104 |
GGTACGAGT |
PE-IV-PCR-105 |
ACCCGTTCG |
Overview
In order to make this site more useful to all, here are a few tips on how to use this site and how to make a Wiki blog post so others will be able to find the information they need. This should make the system more user friendly for all.
How to use this site to find information
Labels heatmap:
In order to find information that you want, look at the bottom of the home page for the specific labels containing the words that you are interested in. For example, if you want to learn about how to process raw fastq data from an Illumina library, you might start by clicking the Illumina label at the bottom of the home page.
Theme pages:
Alternatively, there are some theme pages which list all posts that have a particular theme. You can go to the Bioinformatics link on the left had site of the home page to get to this page. This page lists all of the posts with the Bioinformatics label. Look through all of those posts.
Search box:
You can also search the whole Wiki from the search box at the top right.
How to add information to this site
Blog posts:
The easiest way to have the pages be self-organizing is to input your information as blog posts and use the appropriate labels. When choosing labels to use, consider each word as a meaningful label (if you want to put a space in a term like 16S library use an underscore as 16S_library since library in itself might not be the most useful term). Try to use labels that other have already chosen if possible but add new labels if it seems appropriate.
Theme pages:
In addition to the list of all labels on the front page, it might be nice to make pages that have similar themes. Some of these major themes are already there, but feel free to add a theme when it is appropriate (for example, if there are any sampling protocols, we might want to make a field work page or something).
Additional Information:
For those of you who have interest, explore all of the options available on the Wiki, which includes calendars and the like. If you have a need, please use these tools to increase the utility of this site.
How to make a blog post
It's easy to make a blog post. Just go to the top right hand corner of any page where it says "Add" and choose "Blog Post". The great thing is that you can add attachments, links and images to the post with the insert button. There are also a lot of great macros to choose from if you have something specific in mind you might be able to find one. But, most importantly, you should add the labels at the bottom under "Labels:". This is an important step in order for the site to be self-organizing and for information to be readily accessible to others. You can add pages as well, but this might take some organization. Pages and blogs seem to me to be identical in the way they are created, so the same things apply if you want to make a page.
Thanks for sharing your expertise with the group!