This is the html version of the file http://bmbolstad.com/talks/Bolstad%20-%20Data%20Normalization%20and%20Standardization.pdf. Google automatically generates html versions of documents as we crawl the web.
Tip: To quickly find your search term on this page, press Ctrl+F or ⌘-F (Mac) and use the find bar.
Data Normalization and Standardization
Page 1
1
Data Normalization
and Standardization
the benefits of pre-processing
microarray data
Ben Bolstad
Statistics, University of California, Berkeley
bmb@bmbolstad.com
http://bmbolstad.com

Page 2
2
Outline
• Introduction
• Pre-processing methodologies as they relate to
▪ Two channel arrays
▪ Affymetrix GeneChips (a popular single channel
array)

Page 3
3
Biological Question
Experimental Design
Microarray Experiment
Pre-processing
Low-level
analysis
Image Quantification
Normalization
Summarization
Background Adjustment
Quality Assessment
High-level
analysis
Estimation
Testing Annotation
…..
Clustering Discrimination
Biological verification and interpretation
Images
Expression Values
Array 1
Array 2
Array 3
Gene 1
10.05
9.58
9.76
Gene 2
4.12
4.16
4.05
Gene 3
6.05
6.04
6.08
Workflow for a
typical microarray
experiment

Page 4
4
Introduction to preprocessing
• Pre-processing typically constitutes the initial (and
possibly most important) step in the analysis of
data from any microarray experiment
• Often ignored or treated like a black box (but it
shouldn’t be)
• Consists of:
▪ Data exploration
▪ Background correction, normalization,
summarization
▪ Quality Assessment
• These are interlinked steps

Page 5
5
Background Correction/Signal
Adjustment
• A method which does some or all of the following:
▪ Corrects for background noise, processing effects on
the array
▪ Adjusts for cross hybridization (non-specific binding)
▪ Adjust estimated expression values to fall across an
appropriate range

Page 6
6
Normalization
“Non-biological factors can contribute to the variability of data ...
In order to reliably compare data from multiple probe arrays,
differences of non-biological origin must be minimized.“1
• Normalization is the process of reducing unwanted variation
either within or between arrays. It may use information from
multiple chips.
• Typical assumptions of most major normalization methods
are (one or both of the following):
▪ Only a minority of genes are expected to be differentially
expressed between conditions
▪ Any differential expression is as likely to be up-regulation as
down-regulation (ie about as many genes going up in
expression as are going down between conditions)
1 GeneChip 3.1 Expression Analysis Algorithm Tutorial, Affymetrix technical support

Page 7
7
A brief word on the term
“Normalization”
• Many use the term “normalization” to refer to
everything being discussed in this session. In other
words they treat “normalization” and “pre-
processing” as being synonymous with each other.
• I view normalization as just one of the steps in the
process (although a very important one).

Page 8
8
Summarization
• Reducing multiple measurements on the same
gene down to a single measurement by combining
in some manner.
• Most relevant to Affymetrix Arrays as we will see a
little later ….

Page 9
9
Quality Assessment
• Need to be able to differentiate between good and
bad data.
• Bad data could be caused by poor hybridization,
artifacts on the arrays, inconsistent sample
handling, …..
• An admirable goal would be to reduce systematic
differences with data analysis techniques.
• Sometimes there is no option but to completely
discard an array from further analysis. How to
decide …..

Page 10
10
Two-channel arrays

Page 11
11
Image analysis for two color
arrays
• The raw data from a cDNA microarray experiment
consist of pairs of image files, 16-bit TIFFs, one for
each of the dyes.
• Image analysis is required to extract measures of
the red and green fluorescence intensities for each
spot on the array.

Page 12
12
Image analysis
1. Addressing. Estimate location of
spot centers.
2. Segmentation. Classify pixels as
foreground (signal) or background.
3. Information extraction. For
each spot on the array and each
dye
• signal intensities;
• background intensities;
• quality measures.
R and G for each spot on the array.

Page 13
13
Good: low bg, lots of d.e. Bad: high bg, ghost spots, little d.e.
Co-registration and overlay offers a quick visualization,
revealing information on colour balance, uniformity of
hybridization, spot uniformity, background, and artifiacts
such as dust or scratches
Red/Green overlay images

Page 14
14
Signal/Noise = log
2
(spot intensity/background intensity)
Histograms

Page 15
15
Slide 3 of the swirl data: used in all that follows.

Page 16
16
Tools for exploring the data
R vs G
Important: Always log, always rotate
Bad

Page 17
17
Tools for exploring the data
log
2
R vs log
2
G
Important: Always log, always rotate
Better

Page 18
18
Tools for exploring the data
M=log
2
R/G vs A=log
2
√RG
Important: Always log, always rotate
Best

Page 19
19
MA-plot

Page 20
20
Spatial plots: background

Page 21
21
Spatial plots: log ratios (M)
No reason to constrain
yourself to red/green
when visualizing

Page 22
22
Boxplots

Page 23
23
Background correction
• Normally this is just a matter of subtracting the background
value in the Red channel of the foreground Red intensity
and the same for the Green channel intensities for each
spot.
i.e. R’= R – Rb, G’=G-Gb
where R, Rb, G, Gb are all from the output of the image
analysis stage (there are some who use models based on
these to derive corrections)
• From here on in we will assume that background correction
has taken place.

Page 24
24
Background Correction
• Note that the image analysis program you use can
have quite an impact at this stage by drastically
increasing variability, particularly in low intensities.
Note this not swirl.3
GenePix
Spot
Same array, different image analysis and background correction

Page 25
25
Normalization for two color
arrays
• Why?
▪ To correct for systematic differences
between samples on the same slide, or
between slides, which do not represent
true biological variation between samples.
• How do we know it is necessary?
▪ By examining self-self hybridizations,
where no true differential expression is
occurring.
▪ We find dye biases which vary with overall
spot intensity, location on the array, plate
origin, pins, scanning parameters,….

Page 26
26
Levels of Normalization
for two color arrays
• Within-slides
▪ Which genes to use?
▪ Location normalization
▪ Scale normalization
• Paired-slides (dye-swap)
▪ Self-normalization
• Between-slides

Page 27
27
False color overlay
Boxplots within Grid plots
MA-plots
Self-self hybridizations

Page 28
28
log
2
R/G log
2
R/G - c = log
2
R/ (kG)
Standard practice (in most software)
c is a constant such as the mean or median log ratio.
Scaling Normalization

Page 29
29
MA-plot after scaling
Before Scaling
After Scaling

Page 30
30
Intensity dependent
adjustment
log
2
R/G -> log
2
R/G - c(A) = log
2
R/(k(A)G)
• Compute c by robust locally weighted regression of
M on A.
• We typically use a loess curve for this purpose.

Page 31
31
MA-plot after loess
normalization
After global loess normalization

Page 32
32
Boxplot: print-tip effects remain
after global loess normalization

Page 33
33
Within print-tip group
normalization
• In addition to intensity-dependent variation in log ratios,
spatial bias can also be a significant source of systematic
error. Most normalization methods do not correct for spatial
effects produced by hybridization artifacts or print-tip or
plate effects during the construction of the microarrays.
• It is possible to correct for both print-tip and intensity-
dependent bias by performing LOWESS fits to the data
within print-tip groups, i.e.
log
2
R/G -> log
2
R/G - c
i
(A) = log
2
R/(k
i
(A)G),
• where c
i
(A) is the LOWESS fit to the MA-plot for the ith grid
only.

Page 34
34
Print-tip normalized data:
MA-plot

Page 35
35
Print-tip normalized data:
boxplot

Page 36
36
Smoothed histograms of M
values
Black: unnormalized; red: global median; green: global lowess; blue: print-tip lowess

Page 37
37
MSP titration series
(Microarray Sample Pool)
Control set to aid intensity- dependent normalization
Different concentrations
Spotted evenly spread across the slide
Pool the
whole library

Page 38
38
Yellow: GAPDH, tubulin
Light blue: MSP pool / titration
Orange: Schadt-Wong rank invariant set
Red line: lowess smooth
MSP normalization compared to other methods

Page 39
39
Composite normalization
Before and after composite
normalization
-MSP lowess curve
-Global lowess curve
-Composite lowess curve
(Other colours control spots)
c
i
(A)=α
A
g(A)+(1-α
A
)f
i
(A)

Page 40
40
Paired-slides: dye-swap
Slide 1, M = log
2
(R/G) - c
Slide 2, M’ = log
2
(R’/G’) - c’
Combine by subtracting the normalized log-ratios:
[ (log
2
(R/G) - c) - (log
2
(R’/G’) - c’) ] / 2
≈ [ log
2
(R/G) + log
2
(G’/R’) ] / 2
≈ [ log
2
(RG’/GR’) ] / 2
provided c = c’.
Assumption: the normalization functions are the
same for the two slides.

Page 41
41
Checking the assumption
MA plot for slides 1 and 2: it isn’t always like this.

Page 42
42
Result of self-normalization
(M - M’)/2 vs. (A + A’)/2

Page 43
43
One way of taking scale
into account
MAD
i
MAD
i
i =1
I
I
Assumption: All slides have the same spread in M
True log ratio is m
ij
where i represents different slides and
j represents different spots.
Observed is M
ij
, where
M
ij
= a
i
m
ij
Robust estimate of a
i
is
MADi = median
j
{ |y
ij
- median(y
ij
) | }

Page 44
44
Scale normalization: between
slides
Boxplots of log ratios from 3 replicate self-self hybridizations.
Before normalization
After location normalization After scale normalization

Page 45
45
Before normalization
After location normalization After scale normalization
Scale normalization: swirl
dataset

Page 46
46
Other between slide
normalizations
• Quantile normalization applied separately to R and
G channels (after within chip normalization)

Page 47
47
Two Channel Summary
• Background Correction
▪ Taking too much off can greatly increase variability
• Normalization
▪ Reduces systematic (not random) effects
▪ Makes it possible to compare several arrays
▪ Use logratios (M vs A-plots)
▪ Lowess normalization (dye bias)
▪ MSP titration series – composite normalization
▪ Pin-group location normalization
▪ Pin-group scale normalization
▪ Between slide scale normalization

Page 48
48
Single-channel arrays

Page 49
49
Affymetrix GeneChip
• Commericial mass produced high
density oligonucleotide array
technology developed by Affymetrix
http://www.affymetrix.com
• Single channel microarray
Image courtesy of Affymetrix.

Page 50
50
Probes and Probesets
Typically 11 probe(pairs) in a probeset
Latest GeneChips have as many as:
54,000 probesets
1.3 Million probes
Counts for HG-U133A plus 2.0 arrays

Page 51
51
Two Probe Types
TAGGTCTGTATGACAGACACAAAGAAGATG
CAGACATAGTGTCTGTGTTTCTTCT
CAGACATAGTGTGTGTGTTTCTTCT
PM: the Perfect Match
MM: the Mismatch
Reference Sequence

Page 52
52
Image Analysis

Page 53
53
Chip dat file – checkered board – close up pixel selection

Page 54
54
Chip cel file – checkered board
Courtesy: F. Colin

Page 55
55
Boxplot raw intensities
Array 1 Array 2 Array 3 Array 4

Page 56
56
Density plots

Page 57
57
Pairwise MA plots
Array 1
Array 2
Array 3
Array 4
M=log
2
array
i
/array
j
A=1/2*log
2
(array
i
*array
j
)

Page 58
58
Boxplots comparing M
Array 1 Array 2 Array 3 Array 4
M

Page 59
59
RMA Background Approach
• Convolution Model
=
+
Observed
PM
Signal
S
Noise
N
( )
Exp α
( )2
,
N μ σ
(
)
2
E
,
1
a
pm a
b
b
S PM pm a b
a
pm a
b
a o
b
b
μ σ α
φ
σ
φ
⎛ ⎞
⎜ ⎟
⎝ ⎠
⎛ ⎞
⎜ ⎟
⎝ ⎠
=
= +
= − −
=
Φ

Page 60
60
GCRMA Background
Approach
• PM=O
pm
+N
pm
+S
• MM=O
mm
+N
mm
• O – Optical noise
• N – non-specific binding
• S – Signal
• Assume O is distributed Normal
• log(N
pm
)and log(N
mm
) are assumed bi-variate
normal with correlation 0.7
• log(S) assumed exponential(1)

Page 61
61
GCRMA continued
• An experiment was carried out where yeast RNA was
hybridized to human chips, so all binding expected to be
non specific.
• Fitted a model to predict log intensity from sequence
composition gives base and position effects
• Uses these effects to predict an affinity for any given
sequence call this A. The means of the distributions for the
N
pm,
N
mm
terms are functions of the affinities.

Page 62
62
Non-Biological variability is a
problem for single channel
arrays
5 scanners for 6 dilution groups
Log2 PM intensity

Page 63
63
Normalization
• In case of single channel microarray data this is
carried out only across arrays.
• Could generalize methods we applied to two color
arrays, but several problems:
▪ Typically several orders of magnitude more probes on
an Affymetrix array then spots on a two channel array
▪ With single channel arrays we are dealing with
absolute intensities rather than relative intensities.
• Need something fast

Page 64
64
Quantile Normalization
• Normalize so that the quantiles of each chip are
equal. Simple and fast algorithm. Goal is to give
same distribution to each chip.
Target
Distribution
Original
Distribution

Page 65
65
It works!!
Unnormalized
Scaling
Quantile
Normalization

Page 66
66
It Reduces Variability
Fold change
Expression Values
Also no serious bias effects. For more see Bolstad et al (2003)
Unnormalized
Quantile
Scaling
Unnormalized Quantile
Scaling

Page 67
67
Summarization
• Problem: Calculating gene expression values.
• How do we reduce the 11-20 probe intensities for each
probeset on to a gene expression value?
• Our Approach
▪ RMA – a robust multi-chip linear model fit on the log scale
• Some Other Approaches
▪ Single chip
▪ AvDiff (Affymetrix) – no longer recommended for use due to many
flaws
▪ Mas 5.0 (Affymetrix) – use a 1 step Tukey-biweight to combine
the probe intensities in log scale
▪ Multiple Chip
▪ MBEI (Li-Wong dChip) – a multiplicative model on natural scale

Page 68
68
General Probe Level Model
• Where f(X) is function of factor (and possibly
covariate) variables (our interest will be in linear
functions)
is a pre-processed probe intensity (usually log
scale)
• Assume that
f( )
kij
kij
y
ε
=
+
X
E
0
kij
ε⎡ ⎤ =
⎣ ⎦
2
Var
kij
k
ε
σ
⎡ ⎤ =
⎣ ⎦
kij
y

Page 69
69
Parallel Behavior Suggests
Multi-chip Model
Array
Array
PM probe intensity
PM probe intensity
Differentially expressing
Non Differential

Page 70
70
Probe Pattern Suggests
Including Probe-Effect
PM probe intensity
PM probe intensity
Differentially expressing
Non Differential
Probe Number
Probe Number

Page 71
71
Also Want Robustness
PM probe intensity
Non Differential
PM probe intensity
PM probe intensity
Differentially expressing
PM probe intensity
Differentially expressing
Non Differential

Page 72
72
The RMA model
where
is a probe-effect i= 1,…,I
is chip-effect (
is log2 gene
expression on array j) j=1,…,J
k=1,…,K is the number of probesets
( )
(
)
2
log N B
kij
kij
y
PM
=
kij
k
ki
kj
kij
y
m α β ε
=
+
+
+
ki
α
kj
β
k
kj
m β
+

Page 73
73
Median Polish Algorithm
11
1
1
0
0
0
0 0
J
I
IJ
y
y
y
y
L
M O M
M
L
L
11
1
1
1
1
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
J
I
IJ
I
J
m
ε
ε
α
ε
ε
α
β
β
L
M O M
M
L
L
Iterate
Sweep Rows
Sweep Columns
median
median
0
i
j
α
β
=
=
median
median
0
i ij
j ij
ε
ε
=
=
Imposes
Constraints

Page 74
74
RMA mostly does well in
practice
Detecting Differential Expression Not noisy in low intensities
RMA
MAS 5.0

Page 75
75
One Drawback
RMA
MAS 5.0
Linearity across concentration. GCRMA fixes this problem
Concentration
Concentration
log2 Expression Value
log2 Expression Value

Page 76
76
GCRMA improve linearity

Page 77
77
An Alternative Method for Fitting
a PLM
• Robust regression using M-estimation
• In this talk, we will use Huber’s influence function.
The software handles many more.
• Fitting algorithm is IRLS with weights dependent on
current residuals ( )kij
kij
r
r
ψ

Page 78
78
Variance Covariance
Estimates
• Suppose model is
• Huber (1981) gives three forms for estimating variance
covariance matrix
Y Xβ ε
=
+
(
)
( )
(
)
2
1
1
1
1/
T
i
i
n p
r W
X X W
ψ
κ
(
)
( )
( )
(
)
2
1
2
2
1/
1/
i
T
i
i
i
n p
r
X X
n
r
ψ
κ
ψ
( )
( )
2
1
1/(
)
1/
i
i
i
i
n p
r
W
n
r
ψ
κ
ψ
We will use this form
'
T
W X
X
=
Ψ

Page 79
79
We Will Focus on the
Summarization PLM
• Array effect model
With constraint
kij
ki
kj
kij
y
α β ε
=
+
+
1
0
I
ki
i
α
=
=
Probe Effect
Array Effect
Pre-processed
Log PM intensity

Page 80
80
Quality Assessment
• Problem: Judge quality of chip data
• Question: Can we do this with the output of the
Probe Level Modeling procedures?
• Answer: Yes. Use weights, residuals, standard
errors and expression values.

Page 81
81
Chip pseudo-images

Page 82
82
An Image Gallery
http://PLMImageGallery.bmbolstad.com
“Tricolor”
“Crop Circles”
“Ring of Fire”

Page 83
83
NUSE Plots
Normalized
Unscaled
Standard
Errors

Page 84
84
RLE Plots
Relative
Log
Expression

Page 85
85
Summary of One Channel
Arrays
• Background correction
▪ RMA model
▪ GCRMA model
• Normalization
▪ Quantile normalization
• Summarization
▪ Robust multi-chip probe level modeling
• Quality Assessment

Page 86
86
Acknowledgements
• Terry Speed
• Rafael Irizarry
• Julia Brettschneider
• Francois Colin
• Jean Yang
• Zhijin (Jean) Wu
• Gordon Smyth
• James Westenhall
• Any one else …

Page 87
87
References
• Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA
microarray data: a robust composite method addressing single and multiple slide
systematic variation. Nucleic Acids Res. 2002 Feb 15;30(4):e15.
• Yang, Y. H., Buckley, M. J., Dudoit, S., and Speed, T. P. (2002). Comparison of methods
for image analysis on cDNA microarray data. Journal of Computational and Graphical
Statistics, 11 (1), 108-136.
• Smyth, G. K., Thorne, N. P. and Wettenhall J. (2004) limma: Linear Models for Microarray
Data User's Guide. The Walter and Eliza Hall Institute of Medical Research.
• Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P., A comparison of
normalization methods for high density oligonucleotide array data based on variance and
bias, Bioinformatics, 19, 185 (2003).
• Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, and Speed TP. Summaries of
Affymetrix GeneChip Probe Level Data. Nucleic Acids Research, 31(4):e15, 2003.
• Bolstad BM, Collin F, Brettschneider J, Simpson K, Cope L, Irizarry RA, and Speed TP.
(2005) Quality Assessment of Affymetrix GeneChip Data in Bioinformatics and
Computational Biology Solutions Using R and Bioconductor. Gentleman R, Carey V, Huber
W, Irizarry R, and Dudoit S. (Eds.), Springer
• Wu, Z., Irizarry, R., Gentleman, R., Martinez Murillo, F. Spencer, F. A Model Based
Background Adjustment for Oligonucleotide Expression Arrays. Journal of American
Statistical Association 99, 909-917 (2004)