...
Alm
...
lab
...
Protocol
...
for
...
processing
...
overlapping
...
16S
...
rRNA
...
reads
...
from
...
a
...
MiSeq
...
run
...
Experimental
...
design
...
The
...
specifics
...
of
...
the
...
sequencing
...
set-up
...
and
...
molecular
...
construct
...
will
...
determine
...
exactly
...
how
...
the
...
data
...
needs
...
to
...
be
...
sequenced
...
and
...
processed.
...
There
...
are
...
a
...
few
...
different
...
designs
...
that
...
the
...
Alm
...
lab
...
has
...
set
...
up:
...
1.)
...
(Standard)
...
Multiplexing
...
different
...
samples
...
together
...
to
...
be
...
sequenced
...
in
...
one
...
lane
...
of
...
Illumina,
...
marking
...
each
...
unique
...
sample
...
with
...
a
...
unique
...
barcode
...
on
...
the
...
reverse
...
primer
...
of
...
step
...
2
...
that
...
is
...
read
...
during
...
the
...
indexing
...
read.
...
This
...
is
...
common
...
for
...
up
...
to
...
96
...
samples
...
(or
...
105
...
including
...
additional
...
barcodes
...
that
...
are
...
not
...
in
...
the
...
96-well
...
format).
...
The
...
sequencing
...
should
...
be
...
done,
...
not
...
with
...
the
...
standard
...
Illumina
...
indexing
...
primer,
...
but
...
with
...
the
...
reverse
...
complement
...
of
...
the
...
2nd
...
read
...
sequencing
...
primer.
...
This
...
is
...
a
...
custom
...
barcode
...
that
...
should
...
be
...
included
...
in
...
the
...
sequencing
...
set-up.
...
See
...
sequencing
...
section
...
below
...
for
...
protocol.
...
2.)
...
Multiplexing
...
multiple
...
different
...
plates
...
of
...
samples
...
together
...
using
...
a
...
barcode
...
located
...
5'
...
to
...
the
...
primer
...
used
...
in
...
genome
...
amplification
...
(typically
...
U515)
...
and
...
a
...
reverse
...
barcode
...
that
...
is
...
read
...
during
...
the
...
indexing
...
read.
...
This
...
is
...
not
...
typical
...
for
...
the
...
Alm
...
lab
...
MiSeq
...
protocol,
...
since
...
getting
...
only
...
about
...
12-25
...
million
...
reads
...
from
...
MiSeq
...
is
...
sufficient
...
for
...
about
...
100
...
samples,
...
not
...
more.
...
However,
...
it
...
is
...
a
...
possible
...
scenario
...
for
...
samples
...
which
...
do
...
not
...
require
...
high
...
coverage.
...
3.)
...
Mixing
...
both
...
genome
...
sequencing
...
and
...
16S
...
rRNA
...
amplicon
...
sequencing
...
together
...
in
...
one
...
lane.
...
Adding
...
genome
...
library
...
preps
...
to
...
16S
...
amplicon
...
lanes
...
improves
...
the
...
quality
...
of
...
the
...
base
...
calling
...
by
...
adding
...
diversity
...
without
...
loosing
...
much
...
to
...
phiX
...
sequencing.
...
The
...
genome
...
library
...
constructs
...
typically
...
contain
...
barcodes
...
in
...
the
...
forward
...
and
...
reverse
...
read
...
sequences,
...
and
...
do
...
not
...
typcially
...
have
...
an
...
indexing
...
read
...
associated
...
with
...
them.
...
However,
...
adding
...
them
...
to
...
a
...
lane
...
which
...
does
...
have
...
a
...
index
...
read
...
is
...
ok.
...
4.)
...
An
...
experimental
...
set-up
...
using
...
both
...
forward
...
and
...
reverse
...
orientation
...
of
...
the
...
16S
...
rRNA
...
among
...
different
...
samples,
...
and
...
staggering
...
the
...
diversity
...
region
...
5'
...
of
...
both
...
primers
...
used
...
in
...
genome
...
amplification
...
allows
...
for
...
sufficient
...
base
...
diversity
...
to
...
run
...
16S
...
rRNA
...
libraries
...
without
...
wasting
...
phiX
...
data.
...
In
...
this
...
case,
...
half
...
of
...
the
...
samples
...
begin
...
by
...
sequencing
...
from
...
the
...
U515
...
primer
...
in
...
the
...
forward
...
read,
...
and
...
half
...
begin
...
by
...
sequencing
...
from
...
the
...
U786
...
from
...
the
...
forward
...
read.
...
Additionally,
...
the
...
number
...
of
...
bases
...
before
...
the
...
primer
...
sequence
...
varies
...
from
...
4-9
...
bp.
...
Sequencing
Sequencing our construct on MiSeq is slightly different than standard Illumina MiSeq sequencing. Load the providedsample sheet, (which arbitrarily specifies a 250 paired-end run with an 8nt barcode read) and spike in 15uL of the anti-reverse BMC index primer @ 100uM (5' AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG 3') into tube 13 of the cartridge. This should provide three reads (forward, reverse and index) at 250, 8, 250 bp each.
De-multiplexing
Depending on whether your samples contains data from other projects that you don not want to process, you can demultiplex at various stages. Typically if there are multiple un-related projects in the same run, I will pull out all of the reads that map to the specific barcodes I am interested in first, so that I don't have to process extra data. You also have the option of removing unwanted barcodes at the qiime: split_libraries_fastq.py
...
step
...
by
...
providing
...
a
...
mapping
...
file
...
containing
...
only
...
the
...
barcodes
...
you
...
want.
...
This
...
makes
...
sense
...
if
...
you
...
if
...
you
...
will
...
eventually
...
work
...
with
...
all
...
of
...
the
...
data,
...
but
...
in
...
sets.
...
Otherwise,
...
if
...
you
...
will
...
never
...
need
...
to
...
work
...
with
...
the
...
other
...
data
...
in
...
the
...
lane,
...
it
...
doesn't
...
make
...
sense
...
to
...
process
...
it
...
at
...
all.
...
Run
...
this
...
program
...
to
...
parse
...
out
...
sequences
...
from
...
the
...
raw
...
data
...
according
...
to
...
the
...
following
...
order:
...
1.)
...
Single
...
barcodes
...
that
...
are
...
unique
...
in
...
the
...
sample.
...
These
...
are
...
possibly
...
control
...
samples
...
or
...
extra
...
barcodes
...
that
...
were
...
done
...
uniquely
...
and
...
are
...
the
...
only
...
piece
...
of
...
information
...
indicating
...
which
...
samples
...
that
...
sequence
...
came
...
from
...
2.)
...
Next,
...
it
...
looks
...
for
...
the
...
presence
...
of
...
the
...
forward
...
and
...
reverse
...
barcodes
...
in
...
the
...
forward
...
and
...
reverse
...
read.
...
These
...
are
...
typically
...
from
...
genome
...
sequences.
...
3.)
...
Finally,
...
it
...
looks
...
for
...
paired
...
data,
...
pulling
...
out
...
reads
...
that
...
have
...
both
...
the
...
forward
...
barcode
...
and
...
the
...
indexing
...
barcodes
...
that
...
match
...
input
...
samples.
...
All
...
other
...
samples
...
that
...
do
...
not
...
match
...
are
...
discarded.
...
Program:
...
...
<Solexa
...
File1>
...
<Solexa
...
File2>
...
<mapping>
...
<output_prefix>
...
<mapping>
...
input
...
file
...
should
...
have
...
the
...
following
...
fields
...
(tab
...
delimited,
...
here's
...
an
...
...
...
):
...
Barcode
...
construction,
...
output
...
name,
...
forward
...
barcode
...
name,
...
forward
...
barcode
...
seq,
...
forward
...
barcode
...
orientation,
...
index
...
barcode
...
name,
...
index
...
barcode
...
seq,
...
index
...
barcode
...
orientation,
...
reverse
...
barcode
...
name,
...
reverse
...
barcode
...
seq,
...
reverse
...
barcode
...
orientation
...
Samples
...
with
...
the
...
same
...
output
...
name
...
will
...
be
...
in
...
the
...
same
...
file.
...
Barcode
...
construction
...
must
...
be
...
one
...
of
...
the
...
following
...
exact
...
fields:
...
single,
...
double
...
or
...
forbar.
...
Use
...
single
...
for
...
option
...
1
...
above
...
(single
...
barcodes
...
identify
...
the
...
samples),
...
the
...
2
...
The
...
output
...
should
...
be
...
forward
...
and
...
reverse
...
files
...
labeled
...
output_prefix.output_name.1
...
and
...
output.2
...
respectively.
...
These
...
can
...
be
...
used
...
as
...
the
...
fastq
...
files
...
in
...
downstream
...
processes.
...
Overlapping
...
the
...
reads
...
You
...
may
...
have
...
sufficient
...
length
...
to
...
overlap
...
the
...
forward
...
and
...
reverse
...
reads
...
to
...
create
...
a
...
longer
...
sequence.
...
This
...
process
...
will
...
be
...
time
...
consuming,
...
but
...
it
...
gains
...
phylogenetic
...
resolution
...
and
...
can
...
be
...
useful
...
for
...
many
...
applications.
...
We
...
use
...
SHE-RA,
...
which
...
was
...
created
...
to
...
have
...
a
...
sophisticated
...
calculation
...
of
...
quality
...
for
...
overlapped
...
bases,
...
given
...
the
...
quality
...
of
...
the
...
overlapped
...
bases
...
and
...
whether
...
or
...
not
...
they
...
match.
...
Other
...
software
...
exists
...
(and
...
is
...
faster),
...
but
...
will
...
do
...
multiple
...
things
...
at
...
once,
...
including
...
trimming
...
the
...
sequences
...
for
...
quality
...
and
...
will
...
not
...
provide
...
as
...
good
...
an
...
estimate
...
of
...
the
...
quality
...
of
...
the
...
overlapped
...
bases.
...
If
...
other
...
programs
...
are
...
used,
...
it
...
might
...
be
...
necessary
...
to
...
use
...
other
...
programs
...
to
...
de-multiplex
...
samples
...
after
...
using.
...
With
...
SHE-RA,
...
we
...
overlap,
...
then
...
re-generate
...
the
...
fastq
...
files
...
to
...
use
...
with
...
QIIME
...
split_libraries_fastq.py.
...
First,
...
divide
...
up
...
your
...
samples
...
into
...
about
...
1
...
million
...
reads
...
per
...
file.
...
This
...
can
...
typically
...
be
...
processed
...
on
...
our
...
computers
...
in
...
about
...
10
...
hours.
...
perl
...
~/bin/split_fastq_qiime_1.8.pl
...
131001Alm_D13-4961_1_sequence.fastq
...
100
...
131001Alm_D13-4961_1_sequence.split
...
perl
...
~/bin/split_fastq_qiime_1.8.pl
...
131001Alm_D13-4961_2_sequence.fastq
...
100
...
131001Alm_D13-4961_2_sequence.split
...
Then,
...
overlap
...
each
...
of
...
the
...
100
...
files
...
with
...
SHERA
...
where
...
$
...
{PBS_ARRAYID}
...
is
...
the
...
process
...
number
...
for
...
parallel
...
processing
...
perl
...
~/bin/SHERA_code/concatReads_1.8.pl
...
131001Alm_D13-4961_1_sequence.split.$
...
{PBS_ARRAYID}.fastq
...
131001Alm_D13-4961_2_sequence.split.$
...
{PBS_ARRAYID}.fastq
...
--qualityScaling
...
illumina
...
Filter
...
out
...
the
...
bad
...
overlaps
...
from
...
the
...
fa
...
and
...
quala
...
generated
...
with
...
SHERA:
...
perl
...
/mit/spacocha/bin/SHERA_code/filterReads.pl
...
131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}
...
.fa
...
131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.quala
...
0.8
...
Use
...
mothur
...
to
...
re-generate
...
the
...
fastq
...
files:
...
mothur
...
"#make.fastq(fasta=131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fa,
...
qfile=131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.quala)"
...
Now,
...
you
...
will
...
either
...
have
...
to
...
fix
...
the
...
index
...
file
...
to
...
contain
...
only
...
the
...
reads
...
in
...
your
...
file
...
(if
...
the
...
index
...
read
...
is
...
a
...
separate
...
file):
...
perl
...
~/bin/fix_index.pl
...
131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.
...
fastq 131001Alm_D13-4961_3_sequence.fastq
...
>
...
131001Alm_D13-4961_3_${PBS_ARRAY_ID}.filter_0.8.
...
fastqOr,
...
if
...
you
...
have
...
to
...
generate
...
it
...
from
...
the
...
header
...
(if
...
the
...
index
...
is
...
already
...
present
...
in
...
the
...
header):
...
perl
...
/mit/spacocha/bin/fastq2Qiime_barcode2.pl
...
131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fastq
...
>
...
131001Alm_D13-4961_1_sequence.split.index.filter_0.8.fastq
...
This
...
file
...
can
...
be
...
used
...
for
...
specific
...
header
...
configuration
...
where
...
the
...
fastq
...
files
...
look
...
like
...
this,
...
where
...
it
...
pulls
...
out
...
the
...
longest
...
string
...
of
...
base
...
letters
...
(ATGCKMRYSWBVHDNX)
...
after
...
the
...
#,
...
in
...
this
...
case
...
it
...
would
...
be
...
TGGGACCT
...
and
...
creates
...
a
...
false
...
quality
...
for
...
each
...
base
...
as
...
the
...
lower
...
case
...
of
...
each
...
barcode
...
letter:
...
@MISEQ:1:2106:21797:11095#TGGGACCT_204bp_199.2_0.90
...
TGTAGTGCCAGCCGCCGCGGTAATACGTAGGTGGCGAGCGTTGTTCGGATTTATTGGGCGTAAAGGGTCCGCAGGGGGTT
...
CGCTAAGTCTGATGTGAAATCCCGGAGCTCAACTCCGGAACTGCATTGGAGACTGGTGGACTAGAGTATCGGAGAGGTAA
...
GCGGAATTCCAGGTGTAGCGGTGGAATGCGTAGATATCTGGAAGAACACCGAAAGCGAAGGCAGCTTACTGGACGGTAAC
...
TGACCCTCAGGGACGAAAGCGTGGGGATCAAACAGGATTAGAAACCCCTGTAGTCC
...
Result
...
is:
...
@MISEQ:1:2106:21797:11095#TGGGACCT_204bp_199.2_0.90
...
TGGGACCT
...
+
...
tgggacct
De-multiplex
...
Now
...
the
...
re-created
...
fasta
...
and
...
index
...
read
...
can
...
be
...
used
...
as
...
normal
...
with
...
QIIME
...
or
...
other
...
software
...
of
...
your
...
choice:
...
split_libraries_fastq.
...
py -i
...
131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.filter_0.8.fastq
...
-m
...
mapping_file.
...
txt -b
...
131001Alm_D13-4961_1_sequence.split.${PBS_ARRAYID}.index.filter_0.8.fastq
...
--barcode_
...
type 8 --rev_comp_barcode
...
--min_per_read_length
...
.8
...
-q
...
10
...
--max_bad_run_length
...
0
...
-o
...
unique_output_${PBS_ARRAYID}
...
--phred_offset
...
33
...
This
...
will
...
create
...
a
...
seqs.fna
...
file
...
which
...
can
...
be
...
used
...
in
...
down
...
stream
...
analysis.