Data Normalization and Standardization

Page 1

Data Normalization

and Standardization

the benefits of pre-processing

microarray data

Ben Bolstad

Statistics, University of California, Berkeley

bmb@bmbolstad.com

http://bmbolstad.com

Page 2

Outline

• Introduction

• Pre-processing methodologies as they relate to

▪ Two channel arrays

▪ Affymetrix GeneChips (a popular single channel

array)

Page 3

Biological Question

Experimental Design

Microarray Experiment

Pre-processing

Low-level

analysis

Image Quantification

Normalization

Summarization

Background Adjustment

Quality Assessment

High-level

analysis

Estimation

Testing Annotation

…..

Clustering Discrimination

Biological verification and interpretation

Images

Expression Values

Array 1

Array 2

Array 3

Gene 1

10.05

9.58

9.76

Gene 2

4.12

4.16

4.05

Gene 3

6.05

6.04

6.08

Workflow for a

typical microarray

experiment

Page 4

Introduction to preprocessing

• Pre-processing typically constitutes the initial (and

possibly most important) step in the analysis of

data from any microarray experiment

• Often ignored or treated like a black box (but it

shouldn’t be)

• Consists of:

▪ Data exploration

▪ Background correction, normalization,

summarization

▪ Quality Assessment

• These are interlinked steps

Page 5

Background Correction/Signal

Adjustment

• A method which does some or all of the following:

▪ Corrects for background noise, processing effects on

the array

▪ Adjusts for cross hybridization (non-specific binding)

▪ Adjust estimated expression values to fall across an

appropriate range

Page 6

Normalization

“Non-biological factors can contribute to the variability of data ...

In order to reliably compare data from multiple probe arrays,

differences of non-biological origin must be minimized.“1

• Normalization is the process of reducing unwanted variation

either within or between arrays. It may use information from

multiple chips.

• Typical assumptions of most major normalization methods

are (one or both of the following):

▪ Only a minority of genes are expected to be differentially

expressed between conditions

▪ Any differential expression is as likely to be up-regulation as

down-regulation (ie about as many genes going up in

expression as are going down between conditions)

1 GeneChip 3.1 Expression Analysis Algorithm Tutorial, Affymetrix technical support

Page 7

A brief word on the term

“Normalization”

• Many use the term “normalization” to refer to

everything being discussed in this session. In other

words they treat “normalization” and “pre-

processing” as being synonymous with each other.

• I view normalization as just one of the steps in the

process (although a very important one).

Page 8

Summarization

• Reducing multiple measurements on the same

gene down to a single measurement by combining

in some manner.

• Most relevant to Affymetrix Arrays as we will see a

little later ….

Page 9

Quality Assessment

• Need to be able to differentiate between good and

bad data.

• Bad data could be caused by poor hybridization,

artifacts on the arrays, inconsistent sample

handling, …..

• An admirable goal would be to reduce systematic

differences with data analysis techniques.

• Sometimes there is no option but to completely

discard an array from further analysis. How to

decide …..

Page 10

Two-channel arrays

Page 11

Image analysis for two color

arrays

• The raw data from a cDNA microarray experiment

consist of pairs of image files, 16-bit TIFFs, one for

each of the dyes.

• Image analysis is required to extract measures of

the red and green fluorescence intensities for each

spot on the array.

Page 12

Image analysis

1. Addressing. Estimate location of

spot centers.

2. Segmentation. Classify pixels as

foreground (signal) or background.

3. Information extraction. For

each spot on the array and each

dye

• signal intensities;

• background intensities;

• quality measures.

R and G for each spot on the array.

Page 13

Good: low bg, lots of d.e. Bad: high bg, ghost spots, little d.e.

Co-registration and overlay offers a quick visualization,

revealing information on colour balance, uniformity of

hybridization, spot uniformity, background, and artifiacts

such as dust or scratches

Red/Green overlay images

Page 14

Signal/Noise = log

(spot intensity/background intensity)

Histograms

Page 15

Slide 3 of the swirl data: used in all that follows.

Page 16

Tools for exploring the data

R vs G

Important: Always log, always rotate

Bad

Page 17

Tools for exploring the data

log

R vs log

Important: Always log, always rotate

Better

Page 18

Tools for exploring the data

M=log

R/G vs A=log

√RG

Important: Always log, always rotate

Best

Page 19

MA-plot

Page 20

Spatial plots: background

Page 21

Spatial plots: log ratios (M)

No reason to constrain

yourself to red/green

when visualizing

Page 22

Boxplots

Page 23

Background correction

• Normally this is just a matter of subtracting the background

value in the Red channel of the foreground Red intensity

and the same for the Green channel intensities for each

spot.

i.e. R’= R – Rb, G’=G-Gb

where R, Rb, G, Gb are all from the output of the image

analysis stage (there are some who use models based on

these to derive corrections)

• From here on in we will assume that background correction

has taken place.

Page 24

Background Correction

• Note that the image analysis program you use can

have quite an impact at this stage by drastically

increasing variability, particularly in low intensities.

Note this not swirl.3

GenePix

Spot

Same array, different image analysis and background correction

Page 25

Normalization for two color

arrays

• Why?

▪ To correct for systematic differences

between samples on the same slide, or

between slides, which do not represent

true biological variation between samples.

• How do we know it is necessary?

▪ By examining self-self hybridizations,

where no true differential expression is

occurring.

▪ We find dye biases which vary with overall

spot intensity, location on the array, plate

origin, pins, scanning parameters,….

Page 26

Levels of Normalization

for two color arrays

• Within-slides

▪ Which genes to use?

▪ Location normalization

▪ Scale normalization

• Paired-slides (dye-swap)

▪ Self-normalization

• Between-slides

Page 27

False color overlay

Boxplots within Grid plots

MA-plots

Self-self hybridizations

Page 28

log

R/G → log

R/G - c = log

R/ (kG)

Standard practice (in most software)

c is a constant such as the mean or median log ratio.

Scaling Normalization

Page 29

MA-plot after scaling

Before Scaling

After Scaling

Page 30

Intensity dependent

adjustment

log

R/G -> log

R/G - c(A) = log

R/(k(A)G)

• Compute c by robust locally weighted regression of

M on A.

• We typically use a loess curve for this purpose.

Page 31

MA-plot after loess

normalization

After global loess normalization

Page 32

Boxplot: print-tip effects remain

after global loess normalization

Page 33

Within print-tip group

normalization

• In addition to intensity-dependent variation in log ratios,

spatial bias can also be a significant source of systematic

error. Most normalization methods do not correct for spatial

effects produced by hybridization artifacts or print-tip or

plate effects during the construction of the microarrays.

• It is possible to correct for both print-tip and intensity-

dependent bias by performing LOWESS fits to the data

within print-tip groups, i.e.

log

R/G -> log

R/G - c

(A) = log

R/(k

(A)G),

• where c

(A) is the LOWESS fit to the MA-plot for the ith grid

only.

Page 34

Print-tip normalized data:

MA-plot

Page 35

Print-tip normalized data:

boxplot

Page 36

Smoothed histograms of M

values

Black: unnormalized; red: global median; green: global lowess; blue: print-tip lowess

Page 37

MSP titration series

(Microarray Sample Pool)

Control set to aid intensity- dependent normalization

Different concentrations

Spotted evenly spread across the slide

Pool the

whole library

Page 38

Yellow: GAPDH, tubulin

Light blue: MSP pool / titration

Orange: Schadt-Wong rank invariant set

Red line: lowess smooth

MSP normalization compared to other methods

Page 39

Composite normalization

Before and after composite

normalization

-MSP lowess curve

-Global lowess curve

-Composite lowess curve

(Other colours control spots)

(A)=α

g(A)+(1-α

(A)

Page 40

Paired-slides: dye-swap

• Slide 1, M = log

(R/G) - c

• Slide 2, M’ = log

(R’/G’) - c’

Combine by subtracting the normalized log-ratios:

[ (log

(R/G) - c) - (log

(R’/G’) - c’) ] / 2

≈ [ log

(R/G) + log

(G’/R’) ] / 2

≈ [ log

(RG’/GR’) ] / 2

provided c = c’.

Assumption: the normalization functions are the

same for the two slides.

Page 41

Checking the assumption

MA plot for slides 1 and 2: it isn’t always like this.

Page 42

Result of self-normalization

(M - M’)/2 vs. (A + A’)/2

Page 43

One way of taking scale

into account

MAD

i =1

∏

Assumption: All slides have the same spread in M

True log ratio is m

where i represents different slides and

j represents different spots.

Observed is M

, where

= a

Robust estimate of a

MADi = median

{ |y

- median(y

) | }

Page 44

Scale normalization: between

slides

Boxplots of log ratios from 3 replicate self-self hybridizations.

Before normalization

After location normalization After scale normalization

Page 45

Before normalization

After location normalization After scale normalization

Scale normalization: swirl

dataset

Page 46

Other between slide

normalizations

• Quantile normalization applied separately to R and

G channels (after within chip normalization)

Page 47

Two Channel Summary

• Background Correction

▪ Taking too much off can greatly increase variability

• Normalization

▪ Reduces systematic (not random) effects

▪ Makes it possible to compare several arrays

▪ Use logratios (M vs A-plots)

▪ Lowess normalization (dye bias)

▪ MSP titration series – composite normalization

▪ Pin-group location normalization

▪ Pin-group scale normalization

▪ Between slide scale normalization

Page 48

Single-channel arrays

Page 49

Affymetrix GeneChip

• Commericial mass produced high

density oligonucleotide array

technology developed by Affymetrix

http://www.affymetrix.com

• Single channel microarray

Image courtesy of Affymetrix.

Page 50

Probes and Probesets

Typically 11 probe(pairs) in a probeset

Latest GeneChips have as many as:

54,000 probesets

1.3 Million probes

Counts for HG-U133A plus 2.0 arrays

Page 51

Two Probe Types

TAGGTCTGTATGACAGACACAAAGAAGATG

CAGACATAGTGTCTGTGTTTCTTCT

CAGACATAGTGTGTGTGTTTCTTCT

PM: the Perfect Match

MM: the Mismatch

Reference Sequence

Page 52

Image Analysis

Page 53

Chip dat file – checkered board – close up pixel selection

Page 54

Chip cel file – checkered board

Courtesy: F. Colin

Page 55

Boxplot raw intensities

Array 1 Array 2 Array 3 Array 4

Page 56

Density plots

Page 57

Pairwise MA plots

Array 1

Array 2

Array 3

Array 4

M=log

array

/array

A=1/2*log

(array

*array

)

Page 58

Boxplots comparing M

Array 1 Array 2 Array 3 Array 4

Page 59

RMA Background Approach

• Convolution Model

Observed

Signal

Noise

( )

Exp α

( )2

N μ σ

(

)

pm a

S PM pm a b

pm a

a o

μ σ α

⎛ ⎞

⎛

⎞

⎜ ⎟

⎜

⎟

⎝ ⎠

⎝

⎠

⎛ ⎞

⎛

⎞

⎜ ⎟

⎜

⎟

⎝ ⎠

⎝

⎠

−

= +

−

= − −

+Φ

−

Page 60

GCRMA Background

Approach

• PM=O

• MM=O

• O – Optical noise

• N – non-specific binding

• S – Signal

• Assume O is distributed Normal

• log(N

)and log(N

) are assumed bi-variate

normal with correlation 0.7

• log(S) assumed exponential(1)

Page 61

GCRMA continued

• An experiment was carried out where yeast RNA was

hybridized to human chips, so all binding expected to be

non specific.

• Fitted a model to predict log intensity from sequence

composition gives base and position effects

• Uses these effects to predict an affinity for any given

sequence call this A. The means of the distributions for the

pm,

terms are functions of the affinities.

Page 62

Non-Biological variability is a

problem for single channel

arrays

5 scanners for 6 dilution groups

Log2 PM intensity

Page 63

Normalization

• In case of single channel microarray data this is

carried out only across arrays.

• Could generalize methods we applied to two color

arrays, but several problems:

▪ Typically several orders of magnitude more probes on

an Affymetrix array then spots on a two channel array

▪ With single channel arrays we are dealing with

absolute intensities rather than relative intensities.

• Need something fast

Page 64

Quantile Normalization

• Normalize so that the quantiles of each chip are

equal. Simple and fast algorithm. Goal is to give

same distribution to each chip.

Target

Distribution

Original

Distribution

Page 65

It works!!

Unnormalized

Scaling

Quantile

Normalization

Page 66

It Reduces Variability

Fold change

Expression Values

Also no serious bias effects. For more see Bolstad et al (2003)

Unnormalized

Quantile

Scaling

Unnormalized Quantile

Scaling

Page 67

Summarization

• Problem: Calculating gene expression values.

• How do we reduce the 11-20 probe intensities for each

probeset on to a gene expression value?

• Our Approach

▪ RMA – a robust multi-chip linear model fit on the log scale

• Some Other Approaches

▪ Single chip

▪ AvDiff (Affymetrix) – no longer recommended for use due to many

flaws

▪ Mas 5.0 (Affymetrix) – use a 1 step Tukey-biweight to combine

the probe intensities in log scale

▪ Multiple Chip

▪ MBEI (Li-Wong dChip) – a multiplicative model on natural scale

Page 68

General Probe Level Model

• Where f(X) is function of factor (and possibly

covariate) variables (our interest will be in linear

functions)

•

is a pre-processed probe intensity (usually log

scale)

• Assume that

f( )

kij

ε⎡ ⎤ =

⎣ ⎦

Var

kij

⎡ ⎤ =

⎣ ⎦

kij

Page 69

Parallel Behavior Suggests

Multi-chip Model

Array

PM probe intensity

Differentially expressing

Non Differential

Page 70

Probe Pattern Suggests

Including Probe-Effect

PM probe intensity

Differentially expressing

Non Differential

Probe Number

Page 71

Also Want Robustness

PM probe intensity

Non Differential

PM probe intensity

Differentially expressing

PM probe intensity

Differentially expressing

Non Differential

Page 72

The RMA model

where

is a probe-effect i= 1,…,I

is chip-effect (

is log2 gene

expression on array j) j=1,…,J

k=1,…,K is the number of probesets

( )

(

)

log N B

kij

m α β ε

m β

Page 73

Median Polish Algorithm

0 0

M O M

Iterate

Sweep Rows

Sweep Columns

median

i ij

j ij

Imposes

Constraints

Page 74

RMA mostly does well in

practice

Detecting Differential Expression Not noisy in low intensities

RMA

MAS 5.0

Page 75

One Drawback

RMA

MAS 5.0

Linearity across concentration. GCRMA fixes this problem

Concentration

log2 Expression Value

Page 76

GCRMA improve linearity

Page 77

An Alternative Method for Fitting

a PLM

• Robust regression using M-estimation

• In this talk, we will use Huber’s influence function.

The software handles many more.

• Fitting algorithm is IRLS with weights dependent on

current residuals ( )kij

kij

Page 78

Variance Covariance

Estimates

• Suppose model is

• Huber (1981) gives three forms for estimating variance

covariance matrix

Y Xβ ε

(

)

( )

(

)

n p

r W

X X W

−

∑

(

)

( )

(

)

n p

X X

−

⎡

⎤

′

⎢

⎥

⎣

⎦

∑

( )

1/(

)

n p

−

′

∑

We will use this form

W X

Page 79

We Will Focus on the

Summarization PLM

• Array effect model

With constraint

kij

α β ε

∑

Probe Effect

Array Effect

Pre-processed

Log PM intensity

Page 80

Quality Assessment

• Problem: Judge quality of chip data

• Question: Can we do this with the output of the

Probe Level Modeling procedures?

• Answer: Yes. Use weights, residuals, standard

errors and expression values.

Page 81

Chip pseudo-images

Page 82

An Image Gallery

http://PLMImageGallery.bmbolstad.com

“Tricolor”

“Crop Circles”

“Ring of Fire”

Page 83

NUSE Plots

Normalized

Unscaled

Standard

Errors

Page 84

RLE Plots

Relative

Log

Expression

Page 85

Summary of One Channel

Arrays

• Background correction

▪ RMA model

▪ GCRMA model

• Normalization

▪ Quantile normalization

• Summarization

▪ Robust multi-chip probe level modeling

• Quality Assessment

Page 86

Acknowledgements

• Terry Speed

• Rafael Irizarry

• Julia Brettschneider

• Francois Colin

• Jean Yang

• Zhijin (Jean) Wu

• Gordon Smyth

• James Westenhall

• Any one else …

Page 87

References

• Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA

microarray data: a robust composite method addressing single and multiple slide

systematic variation. Nucleic Acids Res. 2002 Feb 15;30(4):e15.

• Yang, Y. H., Buckley, M. J., Dudoit, S., and Speed, T. P. (2002). Comparison of methods

for image analysis on cDNA microarray data. Journal of Computational and Graphical

Statistics, 11 (1), 108-136.

• Smyth, G. K., Thorne, N. P. and Wettenhall J. (2004) limma: Linear Models for Microarray

Data User's Guide. The Walter and Eliza Hall Institute of Medical Research.

• Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P., A comparison of

normalization methods for high density oligonucleotide array data based on variance and

bias, Bioinformatics, 19, 185 (2003).

• Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, and Speed TP. Summaries of

Affymetrix GeneChip Probe Level Data. Nucleic Acids Research, 31(4):e15, 2003.

• Bolstad BM, Collin F, Brettschneider J, Simpson K, Cope L, Irizarry RA, and Speed TP.

(2005) Quality Assessment of Affymetrix GeneChip Data in Bioinformatics and

Computational Biology Solutions Using R and Bioconductor. Gentleman R, Carey V, Huber

W, Irizarry R, and Dudoit S. (Eds.), Springer

• Wu, Z., Irizarry, R., Gentleman, R., Martinez Murillo, F. Spencer, F. A Model Based

Background Adjustment for Oligonucleotide Expression Arrays. Journal of American

Statistical Association 99, 909-917 (2004)