Linear regression. Statistical modelling. Gilles Guillot. September 17,

Transkript

1 Linear regression Statistical modelling Gilles Guillot September 17, 2013 Gilles Guillot Linear regression September 17, / 33

2 Example Example Concentration of DDT (a toxic chemical) in 15 pike sh as a function of sh age... See le pike_data.txt in data folder. Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

3 Example Various questions Does concentration increase with age? How much? Parameter estimation (aka inference) How much condence should we place in the answer? Testing signicance, model checking What is the average concentration of a 3.5 or 8 year old sh? Prediction Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

4 Describing a linear pattern Empirical covariance and correlation I Denition: Empirical covariance Cov(x, y) = 1 (x i x)(y i ȳ) n i Tends to be large when x i and y i are large simultaneously. Hence quanties how much x and y co-vary together. The covariance is scale-dependent: Cov(ax, y) = acov(x, y) Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

5 Describing a linear pattern Empirical covariance and correlation II Denition: Empirical correlation (aka Pearson's correlation coef.) i Cor(x, y) = (x i x)(y i ȳ) i (x i x) 2 i (y i ȳ) 2 Cor. = Cov. rescaled by standard deviations Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

6 Describing a linear pattern Correlation coecient: interpretation and pitfalls Correlation for various data patterns (reprinetd from wikipedia) Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

7 Describing a linear pattern The correlation coecient ρ does not tell the whole story All four sets have identical ρ 0.816, but vary considerably when graphed. Data plotted above are synthetic data made up by F. Anscombe in 1973 to illustrate the pitfalls associated with ρ. Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

8 Describing a linear pattern Structure eect (aka Simpson's paradox) Quick interpretation of ρ values can lead to erroneaous conclusions. See also Simpson's eect on wikipedia for further detail and examples. Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

9 The simple regression model The simple regression model Notation: x 1,..., x n ages of the various individiduals, y 1,..., y n DDT concentrations The simple linear model Y i = ax i + b + ε i the x i s are deterministic variables (a somehow arbitrary modelling choice) a and b are unknown deterministic coecients the ε i 's are independent realisations of a N (0, σ 2 ) variable (made for convienience, consistency with data has to be checked, see later) There are three parameters in this problem: (a, b, σ) = θ Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

10 Parameter estimation Least square error estimation For arbitrary values a and b, e i = y i ax i b measures the error made by the linear model on obs. i A good model should yield low errors on all observations Denition: the Least Square estimator The unknown parameters (a, b) are estimated as (a, b) LS that jointly minimize i (y i ax i b) 2. In math style: (a, b) LS = Argmin a,b (y i ax i b) 2 i Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

11 Parameter estimation Connection with lecture on Statistical Estimation (lecture 1) The vector (a, b) LS is an estimate of (a, b) The procedure Data (a, b) LS, i.e. the generic process associating an estimate to a dataset is an estimator This procedure or function is deterministic in the sense that the same dataset will yield the same estimate In the framework of this course, Data = (x i, y i ) i=1,...,n with Y i = ax i + b + ε i and where ε i is a random variable Data are random therefore (a, b) LS should be seen as random. Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

12 Parameter estimation Remarks and questions on the Least Square principle What if we attempt to minimize y i ax i b? What if we swap x and y? What if the data points are approximately located on a circle? Why the squared error? Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

13 Parameter estimation Explicit expression of (a, b) LS (a, b) LS = Argmin a,b (y i ax i b) 2 SSE(a, b) = i (y i ax i b) 2 is a second order polynomial in a and b i Solving a SSE(a, b) = 0 and SSE(a, b) = 0 yields: b Expression of MSE estimates: i â LS = (x i x)(y i ȳ) i (x i x) 2 and ˆbLS = ȳ â LS x Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

14 Parameter estimation Computational detail I Zero-ing the partial derivatives in a and b we get a SSE(a, b) = 2 i b SSE(a, b) = 2 i x i (y i ax i b) = 0 (1) (y i ax i b) = 0 (2) Hence a i x 2 i + b x i = i x i y i (3) a i x i + nb = i y i (4) We have a linear system in a and b. Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

15 Parameter estimation Computational detail II A subsitution yields: â LS = n i x iy i x i yi n x 2 i ( i x i) 2 (5) and ˆbLS = 1/n i x iy i xȳ 1/n x 2 i x2 (6) Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

16 Parameter estimation Estimating the variance σ 2 of the residuals I The LSE does not provide an estimate of σ 2. A reasonable idea could be to dene ˆε i = y i â LS x + ˆb LS and estimate σ 2 as 1 ˆε 2 i n. This would lead to a biased estimator i Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

17 Parameter estimation Estimating the variance σ 2 of the residuals II Unbiased estimate of σ 2 : We dene ˆσ 2 = 1 ˆε 2 i n 2 i with ˆε i = y i â LS x i + ˆb LS It is an unbiased estimator of σ 2, i.e. E[ˆσ 2 ] = σ 2 Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

18 Parameter estimation Connection to maximum likelihood estimation I Remember: in the Linear Model, the x i 's are considered deterministic. i.i.d And our model says: Y i = ax i + b + ε i and (ε i ) i=1,...,n N (0, σ 2 ). (i.i.d stands for independent and identically distributed) The two lines above can be re-written equivalently as indep. (Y i ) i=1,...,n N (ax i + b, σ 2 ) Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

19 Parameter estimation Connection to maximum likelihood estimation II The density of probability of Y i is Gaussian with mean µ i = ax i + b and variance σ 2 : [ f Yi (y) = 1 σ 2π exp 1 ( ) ] y axi b 2 2 σ The likelihood in this problem is L(y 1,..., y n ; a, b, σ) = n f Yi (y i ) = i=1 n i=1 [ 1 σ 2π exp 1 2 ( ) ] yi ax i b 2 σ Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

20 Parameter estimation Connection to maximum likelihood estimation III The log-likelihood (up to an additive constant) is: ( yi ax i b ) 2 l(a, b, σ) = ln L(a, b, σ) = n ln σ 1/2 i σ The MLE can be estimated by zero-ing a l(a, b, σ), bl(a, b, σ) and l(a, b, σ): σ a l(a, b, σ) = 1 σ 2 x i (y i ax i b) = 0 (7) i b l(a, b, σ) = 1 σ 2 (y i ax i b) = 0 (8) i σ l(a, b, σ) = n σ + 1 σ 3 (y i ax i b) 2 = 0 (9) i Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

21 Parameter estimation Connection to maximum likelihood estimation IV In Eq. (7-8), we recognize the expressions in the LS estimator. Hence â ML = â LS and ˆb ML = ˆb LS Plugging â ML and ˆb ML in Eq. (9) yields: ˆσ ML 2 = 1 (y i â ML x i n ˆb ML ) 2 i Which is biased... Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

22 Geometric interpretation Geometric interpretation of linear regression x = (x 1,..., x n ) and y = (y 1,..., y n ) are vectors in R n the values ax i + b can be seen as the entries of the vector ax + b1 in R n i (y i ax i b) 2 is the square of the norm of y ax b1 âx + ˆb1 is the projection of y on Span(1, x) Cor(x, y) is the cosine of the angle formed by x and y in R n Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

23 Goodness of t Goodness of t The quality of the model can be assessed by the coecent of determination: Coecient of determination: R 2 ˆ=1 SS err SS tot with SS tot = i (y i ȳ) 2 and SS err = i (ŷ i y i ) 2 0 R 2 1 Good model low SS err high R 2 If the regression model includes an intercept, then R 2 = ρ 2 Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

24 Testing parameters Testing H 0 : a = a 0 I We are often interested in assessing if a is signicantly dierent from a particular value a 0. Often H 0 : a = 0 which corresponds to the absence of dependence between x and y. The question is: should the dierence between â and a 0 be considered large enough to reject H 0? Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

25 Testing parameters Testing H 0 : a = a 0 II The Student test Under the assumptions given here then T = â a 0 ŝd(â) St n 2 H 0 is rejected at level α if t is larger than the quantile with probability 1 α/2 of a St n 2 distribution Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

26 Checking model assumptions Checking model assumptions The ε i 's are assumed to be i.i.d. If (a, b) estimates (a, b) correctly, the ˆε i 's should be close to i.i.d. A plot of ( ˆε i ) i=1,...,n against (x i ) i=1,...,n should not display any pattern. Visual check of residuals Visual check of standardised residuals plot(lm(...)) in R Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

27 Prediction out-of sample value under a linear regression model Prediction What value y new should be expected for an extra individual with observed explanatory variable x new? Denition: prediction y new = âx new + ˆb Straightforward in R, see also use of the generic function predict. Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

28 Linear regression in practice with R Parameter estimation in practice with R Assuming data objects are named x and y in your R session # fit a linear model and store the (long) output list res.lm = lm(formula = y ~ x) # extract estimated coef res.coef = coefficients(res.lm) Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

29 Back to pikes R code to t a linear regression on the pike data pike=read.table(" header=true) ## fitting regression line lm.res = lm(formula=ddt~1+age,data=pike) Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

30 Back to pikes Pike data analysis output in R ## display the R object lm.res summary(lm.res) Call: lm(formula = DDT ~ 1 + Age, data = pike) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) Age e-05 *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 13 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 13 DF, p-value: 2.165e-05 Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

31 Back to pikes R code. Link to the R script used in this lecture (and more) Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

32 Exercise Exercise Bingham and Fry, exercise 1.3 p. 29. Data le here Hints: - download with download.file(url="", destfile="./running.txt") - read in R with read.table("./running.txt",header=true) - model t on log data can be obtained by lm(log(y) ~ log(x)) - use R function confint for condence interval Sheather, exercise 1 p. 38 Data are available on the book web site as le playbill.csv Hints: For testing β 0 = 10000, use function test.coef available from here. Gilles Guillot (gigu@dtu.dk) Linear regression September 17, / 33

33 References References Suggested reading Chapter Regression and correlation, Introductory statistics with R, P. Dalgaard, Series Statistics and Computing, Springer, Gilles Guillot Linear regression September 17, / 33