Regression and correlation

Correlation lets us know the association or absence of relationship between two variables ‘x’ and ‘y’. Regression, on the other end,predicts the value of the dependent variable based on the known value of the independent variable, assuming the existence of relationship between two or more variables.

Usage:

  • Correlation - to represent linear relationship between two variables.

  • Regression - to fit a best line and estimate one variable on the basis of another variable.

 

The term correlation is a combination of two words ‘Co’ (together) and relation (connection) between two quantities. Correlation is when, at the time of study of two variables, it is observed that a unit change in one variable is retaliated by an equivalent change in another variable. Or else, the variables are said to be uncorrelated when the movement in one variable does not amount to any movement in another variable in a specific direction.

It is a statistical technique that represents the strength of the connection between pairs of variables.

Correlation can be positive or negative. When the two variables move in the same direction, i.e. an increase in one variable will result in the corresponding increase in another variable and vice versa, then the variables are considered to be positively correlated.

On the contrary, when the two variables move in different directions, in such a way that an increase in one variable will result in a decrease in another variable and vice versa, This situation is known as a negative correlation. The measures of correlation may, for instance, be given as:

Karl Pearson’s Product-moment correlation coefficient
Spearman’s rank correlation coefficient

 

In [1]:
#cor() computes the correlation coefficient
#cor.test() test for association/correlation between paired samples;
#It returns both the correlation coefficient and the significance level(or p-value) of the correlation .

#The simplified formats are:

cor(x, y, method = c('pearson', 'kendall', 'spearman'))
cor.test(x, y, method=c('pearson', 'kendall', 'spearman'))
 
 

A brief comparison of Pearson, Kendall and Spearman can be found at https://www.phdata.io/blog/data-science-stats-review/

In [1]:
#install.packages('ggpubr')
library('ggpubr')
 
Loading required package: ggplot2
Loading required package: magrittr
In [2]:
my_data <- mtcars
head(my_data, 6)
 
  mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
In [3]:
res <- cor.test(my_data$wt, my_data$mpg, 
                    method = 'pearson')
res
 
	Pearson's product-moment correlation

data:  my_data$wt and my_data$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9338264 -0.7440872
sample estimates:
       cor 
-0.8676594 
 

The p-value of the test is less than the significance level alpha = 0.05. We can conclude that wt and mpg are significantly correlated with a correlation coefficient of -0.88 and p-value of 1,488e-11.

In [4]:
res <- cor.test(my_data$wt, my_data$mpg, 
                    method = 'spearman')
res
 
Warning message in cor.test.default(my_data$wt, my_data$mpg, method = 'spearman'):
“Cannot compute exact p-value with ties”

Note: Since the spearman correlation coefficient considers the rank of values, the correlation test ignores the same ranks to find the p-values as a result we get the warning “Cannot compute exact p-value with ties”. This can be avoided by using exact = FALSE inside the cor.test function.

 
	Spearman's rank correlation rho

data:  my_data$wt and my_data$mpg
S = 10292, p-value = 1.488e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
-0.886422 
In [6]:
library('ggpubr')
ggscatter(my_data, x = 'mpg', y = 'wt', 
          add = 'reg.line', conf.int = TRUE, 
          cor.coef = TRUE, cor.method = 'pearson',
          xlab = 'Miles/(US) gallon', ylab = 'Weight (1000 lbs)')
 
 
 

Linear Regression

Linear regression is used to predict the value of an outcome variable Y based on one or more input predictor variables X. The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response Y, when only the predictors (Xs) values are known.

 

The mathematical equation can be generalized as follows:

Y = β1 + β2X + ϵ

where, β1 is the intercept and β2 is the slope. Collectively, they are called regression coefficients. ϵ is the error term, the part of Y the regression model is unable to explain

 

Example

For this analysis, we will use the cars dataset that comes with R by default. cars is a standard built-in dataset, that makes it convenient to demonstrate linear regression in a simple and easy to understand fashion. You can access this dataset simply by typing in cars in your R console. You will find that it consists of 50 observations(rows) and 2 variables (columns) – dist and speed. Lets print out the first six observations here..

In [3]:
head(cars)
length(cars$speed)
 
speed dist
4 2
4 10
7 4
7 22
8 16
9 10
 
50
 

Before we begin building the regression model, it is a good practice to analyze and understand the variables. The graphical analysis and correlation study below will help with this.

 

Graphical Analysis

The aim of this exercise is to build a simple regression model that we can use to predict Distance (dist) by establishing a statistically significant linear relationship with Speed (speed). But before jumping in to the syntax, lets try to understand these variables graphically. Typically, for each of the independent variables (predictors), the following plots are drawn to visualize the following behavior:

Scatter plot: Visualize the linear relationship between the predictor and response

Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.

Density plot: To see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us see how to make each one of them.
 

Scatter Plot

Scatter plots can help visualize any linear relationships between the dependent (response) variable and independent (predictor) variables. Ideally, if you are having multiple predictor variables, a scatter plot is drawn for each one of them against the response, along with the line of best as seen below.

In [3]:
scatter.smooth(x=cars$speed, y=cars$dist, main='Dist ~ Speed')  # scatterplot
 
 

The scatter plot along with the smoothing line above suggests a linearly increasing relationship between the ‘dist’ and ‘speed’ variables. This is a good thing, because one of the underlying assumptions in linear regression is that the relationship between the response and predictor variables is linear.

 

BoxPlot – Check for outliers

Generally, any datapoint that lies outside the 1.5 interquartile-range (1.5  IQR) is considered an outlier, where IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.

http://rhesus.amu.edu.pl/databank/projects/project_files/94a1d561c0/boxplot.png

In [5]:
par(mfrow=c(1, 2))  # divide graph area into 2 columns
boxplot(cars$speed, main='Speed', sub=paste('Outlier rows: ', boxplot.stats(cars$speed)$out))  # box plot for 'speed'
boxplot(cars$dist, main='Distance', sub=paste('Outlier rows: ', boxplot.stats(cars$dist)$out))  # box plot for 'distance'
 
 

Density plot

Check if the response variable is close to normality

In [8]:
#install.packages('e1071')
# e1071: Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shor# test path computation, bagged clustering, naive Bayes classifier, generalized k-nearest neighbour ...
library(e1071)

par(mfrow=c(1, 2))  # divide graph area in 2 columns
plot(density(cars$speed), main='Density Plot: Speed', ylab='Frequency', sub=paste('Skewness:', round(e1071::skewness(cars$speed), 2)))  # density plot for 'speed'
polygon(density(cars$speed), col='red')
plot(density(cars$dist), main='Density Plot: Distance', ylab='Frequency', sub=paste('Skewness:', round(e1071::skewness(cars$dist), 2)))  # density plot for 'dist'
polygon(density(cars$dist), col='red')
 
 

Correlation

Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair – just like what we have here in speed and dist. Correlation can take values between -1 to +1. If we observe for every instance where speed increases, the distance also increases along with it, then there is a high positive correlation between them and therefore the correlation between them will be closer to 1. The opposite is true for an inverse relationship, in which case, the correlation between the variables will be close to -1.

A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response variable (Y) is unexplained by the predictor (X), in which case, we should probably look for better explanatory variables.

In [9]:
cor(cars$speed, cars$dist)
 
0,80689490068921
 

Build a Linear Model

Now that we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, lets see the syntax for building the linear model. The function used for building linear models is lm(). The lm() function takes in two main arguments, namely: 1. Formula 2. Data. The data is typically a data.frame and the formula is a object of class formula. But the most common convention is to write out the formula directly in place of the argument as written below.

In [7]:
linearMod <- lm(dist ~ speed, data=cars)  # build linear regression model on full data
print(linearMod)
 
Call:
lm(formula = dist ~ speed, data = cars)

Coefficients:
(Intercept)        speed  
    -17,579        3,932  

 

Interpretation

Now that we have built the linear model, we also have established the relationship between the predictor and response in the form of a mathematical formula for Distance (dist) as a function for speed. For the above output, you can notice the ‘Coefficients’ part having two components: Intercept: -17.579, speed: 3.932 These are also called the beta coefficients. In other words,

dist = Intercept + (β ∗ speed)

=> dist = −17.579 + 3.932∗speed

 

Linear Regression Diagnostics

Now the linear model is built and we have a formula that we can use to predict the dist value if a corresponding speed is known. Is this enough to actually use this model? NO! Before using a regression model, you have to make sure it is statistically significant. Lets begin by printing the summary statistics for linearMod.

In [11]:
summary(linearMod)  # model summary
 
 
Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29,069  -9,525  -2,272   9,215  43,201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17,5791     6,7584  -2,601   0,0123 *  
speed         3,9324     0,4155   9,464 1,49e-12 ***
---
Signif. codes:  0 ‘***’ 0,001 ‘**’ 0,01 ‘*’ 0,05 ‘.’ 0,1 ‘ ’ 1

Residual standard error: 15,38 on 48 degrees of freedom
Multiple R-squared:  0,6511,	Adjusted R-squared:  0,6438 
F-statistic: 89,57 on 1 and 48 DF,  p-value: 1,49e-12
 

The p Value: Checking for statistical significance

The summary statistics above tells us a number of things. One of them is the model p-Value (bottom last line) and the p-Value of individual predictor variables (extreme right column under ‘Coefficients’). The p-Values are very important, because we can consider a linear model to be statistically significant only when both of these p-Values are less than the pre-determined statistical significance level, which is typically 0.05.

Null and alternate hypothesis

When there is a p-value, there is a null and alternative hypothesis associated with it. In Linear Regression, the Null Hypothesis is that the coefficients associated with the variables is equal to zero. The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).


t-value

We can interpret the t-value something like this. A larger t-value indicates that it is less likely that the coefficient is not equal to zero purely by chance. So the higher t-value, the better.

Pr(>|t|) or p-value is the probability that you get a t-value as high or higher than the observed value when the Null Hypothesis (the β coefficient is equal to zero or that there is no relationship) is true. So if the Pr(>|t|) is low, the coefficients are significant (significantly different from zero). If the Pr(>|t|) is high, the coefficients are not significant.

What this means to us? when p Value is less than significance level (< 0.05), we can safely reject the null hypothesthat the co-efficient β of the predictor is zero. In our case, both these p-Values are well below the 0.05 threshold, so we can conclude our model is indeed statistically significant.

It is absolutely important for the model to be statistically significant before we can go ahead and use it to predict (or estimate) the dependent variable.
 

How to calculate the t Statistic and p-Values?

When the model co-efficients and standard error are known, the formula for calculating t Statistic and p-Value is as follows:

 

t−Statistic = β−coefficient / Std.Error

In [25]:
modelSummary <- summary(linearMod)  # capture model summary as an object
modelCoeffs <- modelSummary$coefficients  # model coefficients
beta.estimate <- modelCoeffs['speed', 'Estimate']  # get beta estimate for speed
std.error <- modelCoeffs['speed', 'Std. Error']  # get std.error for speed
t_value <- beta.estimate/std.error  # calc t statistic
p_value <- 2*pt(-abs(t_value), df=nrow(cars)-ncol(cars))  # calc p Value
f_statistic <- linearMod$fstatistic[1]  # fstatistic
f <- summary(linearMod)$fstatistic  # parameters for model p-value calc
In [28]:
# t-value
t_value
# p-value
p_value
#F-statistics
f[1]
 
9,46398999029837
 
1,4898364962951e-12
 
value: 89,5671065364677
In [30]:
summary(linearMod)
 
Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29,069  -9,525  -2,272   9,215  43,201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17,5791     6,7584  -2,601   0,0123 *  
speed         3,9324     0,4155   9,464 1,49e-12 ***
---
Signif. codes:  0 ‘***’ 0,001 ‘**’ 0,01 ‘*’ 0,05 ‘.’ 0,1 ‘ ’ 1

Residual standard error: 15,38 on 48 degrees of freedom
Multiple R-squared:  0,6511,	Adjusted R-squared:  0,6438 
F-statistic: 89,57 on 1 and 48 DF,  p-value: 1,49e-12
 

R-Squared

The actual information in a data is the total variation it contains. What R-Squared tells us is the proportion of variation in the dependent (response) variable that has been explained by this model.

R2 = (1−SSE) /SST

where, SSE is the sum of squared errors divided bythe sum of squared total.

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.

We don’t necessarily discard a model based on a low R-Squared value. Its a better practice to look at the AIC and prediction accuracy on validation sample when deciding on the efficacy of a model.

In [34]:
rsq <- function (x, y) cor(x, y) ^ 2
R2 <- rsq(cars$speed, cars$dist)
R2
 
0,651079380758251
 

Standard Error and F-Statistic

Both standard errors and F-statistic are measures of goodness of fit.

In [36]:
std.error
 
0,415512776657122
In [37]:
f[1]
 
value: 89,5671065364677
 

AIC and BIC

The Akaike’s information criterion - AIC (Akaike, 1974) and the Bayesian information criterion - BIC (Schwarz, 1978) are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. Both criteria depend on the maximized value of the likelihood function L for the estimated model.

The AIC is defined as:

AIC = (−2) × ln(L) + (2×k)

where, k is the number of model parameters and the BIC is defined as:

BIC = (−2) × ln(L) + k × ln(n)

where n is the sample size.

For model comparison, the model with the lowest AIC and BIC score is preferred.

In [39]:
AIC(linearMod)
BIC(linearMod)
 
419,156863027353
 
424,892932043638
 

How to know if the model is best fit for your data?

The most common metrics to look at while selecting the model are:

     STATISTIC             CRITERION
    R-Squared              Higher the better (> 0.70)
Adj R-Squared              Higher the better
  F-Statistic              Higher the better
   Std. Error              Closer to zero the better
  t-statistic              Should be greater 1.96 for p-value to be less than 0.05
          AIC              Lower the better
          BIC              Lower the better
   Mallows cp              Should be close to the number of predictors in model
         MAPE              Lower the better
          MSE              Lower the better
Min_Max Accuracy          Higher the better


MSE - (Mean squared error)

MAPE - (Mean absolute percentage error)

Min_Max Accuracy => mean(min(actual, predicted)/max(actual, predicted))

 

Predictions

So far we have seen how to build a linear regression model using the whole dataset. If we build it that way, there is no way to tell how the model will perform with new data. So the preferred practice is to split your dataset into a 80:20 sample (training:test), then build the model on the 80% sample and then use the model to predict the dependent variable on test data.

Doing it this way, we will have the model predicted values for the 20% data (test) as well as the actuals (from the original dataset). By calculating accuracy measures (like min_max accuracy) and error rates (MAPE or MSE), we can find out the prediction accuracy of the model. Now, lets see how to actually do this.

 

Our model: dist = −17.579 + 3.932∗speed

 

Step 1:

Create the training (development) and test (validation) data samples from the original data.

In [8]:
# Create Training and Test data -
set.seed(100)  # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(cars), 0.8*nrow(cars))  # row indices for training data
trainingData <- cars[trainingRowIndex, ]  # model training data
testData  <- cars[-trainingRowIndex, ]   # test data
 

Step 2:

Develop the model on the training data and use it to predict the distance on test data

In [9]:
# Build the model on training data -
lmMod <- lm(dist ~ speed, data=trainingData)  # build the model
distPred <- predict(lmMod, testData)  # predict distance
 

Step 3:

Review diagnostic measures.

In [10]:
summary (lmMod)
 
 
Call:
lm(formula = dist ~ speed, data = trainingData)

Residuals:
    Min      1Q  Median      3Q     Max 
-23,350 -10,771  -2,137   9,255  42,231 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -22,657      7,999  -2,833  0,00735 ** 
speed          4,316      0,487   8,863 8,73e-11 ***
---
Signif. codes:  0 ‘***’ 0,001 ‘**’ 0,01 ‘*’ 0,05 ‘.’ 0,1 ‘ ’ 1

Residual standard error: 15,84 on 38 degrees of freedom
Multiple R-squared:  0,674,	Adjusted R-squared:  0,6654 
F-statistic: 78,56 on 1 and 38 DF,  p-value: 8,734e-11
 

From the model summary, the model p value and predictor’s p value are less than the significance level, so we know we have a statistically significant model.

Step 4:

Calculate prediction accuracy and error rates

A simple correlation between the actuals and predicted values can be used as a form of accuracy measure. A higher correlation accuracy implies that the actuals and predicted values have similar directional movement, i.e. when the actuals values increase the predicteds also increase and vice-versa.

In [11]:
actuals_preds <- data.frame(cbind(actuals=testData$dist, predicteds=distPred))  # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds) 
correlation_accuracy
head(actuals_preds)
 
  actuals predicteds
actuals 1,0000000 0,8277535
predicteds 0,8277535 1,0000000
 
  actuals predicteds
1 2 -5,392776
4 22 7,555787
8 26 20,504349
20 26 37,769100
26 54 42,085287
31 50 50,717663
 

Now lets calculate the Min Max accuracy and MAPE:

In [51]:
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))  
min_max_accuracy 
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)  
mape
 
0,380048922445362
 
0,699503247384078
 

At home: k- Fold Cross validation

Suppose, the model predicts satisfactorily on the 20% split (test data), is that enough to believe that your model will perform equally well all the time? It is important to rigorously test the model’s performance as much as possible. One way is to ensure that the model equation you have will perform well, when it is ‘built’ on a different subset of training data and predicted on the remaining data.

How to do this is? Split your data into ‘k’ mutually exclusive random sample portions. Keeping each portion as test data, we build the model on the remaining (k-1 portion) data and calculate the mean squared error of the predictions. This is done for each of the ‘k’ random sample portions. Then finally, the average of these mean squared errors (for ‘k’ portions) is computed. We can use this metric to compare different linear models.

By doing this, we need to check two things:

If the model’s prediction accuracy isn’t varying too much for any one particular sample, and
If the lines of best fit don’t vary too much with respect the the slope and level.

In other words, they should be parallel and as close to each other as possible. You can find a more detailed explanation for interpreting the cross validation charts when you learn about advanced linear model building.

In [12]:
#install.packages('DAAG')
library(DAAG)
library(ggplot2)
cvResults <- suppressWarnings(CVlm(cars, form.lm = dist ~ speed, m=5, dots=FALSE, seed=29, legend.pos='topleft',  printit=FALSE, main='Small symbols are predicted values while bigger ones are actuals.'));  # performs the CV
attr(cvResults, 'ms')  # => 251.2783 mean squared error
 
Loading required package: lattice
 
251,278295313108
 
 

Assumptions of Linear Regression

Building a linear regression model is only half of the work. In order to actually be usable in practice, the model should conform to the assumptions of linear regression.

In [77]:
#install.packages('gvlma')
library('gvlma')
par(mfrow=c(2,2))  # draw 4 plots in same window

mod <- lm(dist ~ speed, data=cars)
gvlma::gvlma(mod)

plot(mod)
 
Call:
lm(formula = dist ~ speed, data = cars)

Coefficients:
(Intercept)        speed  
    -17,579        3,932  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0,05 

Call:
 gvlma::gvlma(x = mod) 

                    Value  p-value                   Decision
Global Stat        15,801 0,003298 Assumptions NOT satisfied!
Skewness            6,528 0,010621 Assumptions NOT satisfied!
Kurtosis            1,661 0,197449    Assumptions acceptable.
Link Function       2,329 0,126998    Assumptions acceptable.
Heteroscedasticity  5,283 0,021530 Assumptions NOT satisfied!
 
 

From the above plot the data points: 23, 35 and 49 are marked as outliers. Lets remove them from the data and re-build the model.

In [13]:
par(mfrow=c(2,2)) 
mod <- lm(dist ~ speed, data=cars[-c(23, 35, 49), ])
gvlma::gvlma(mod)
 
Call:
lm(formula = dist ~ speed, data = cars[-c(23, 35, 49), ])

Coefficients:
(Intercept)        speed  
    -15,137        3,608  
In [14]:
#install.packages('DAAG')
library(DAAG)
library(ggplot2)

new_data <- (cars[-c(23, 35, 49), ])

cvResults <- suppressWarnings(CVlm(new_data, form.lm = dist ~ speed, m=5, dots=FALSE, seed=29, legend.pos='topleft',  printit=FALSE, main='Small symbols are predicted values while bigger ones are actuals.'));  # performs the CV
attr(cvResults, 'ms')
 
149,924266304934
 
In [15]:
summary (mod)
 
Warning message in printCoefmat(coefs, digits = digits, signif.stars = signif.stars, :
“NAs introduced by coercion”
 
Call:
lm(formula = dist ~ speed, data = cars[-c(23, 35, 49), ])

Residuals:
    Min      1Q  Median      3Q     Max 
-25,032  -7,686  -1,032   6,576  26,185 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -15,1371     5,3053  -2,853  0,00652 ** 
speed         3,6085     0,3302  10,928    3e-14 ***
---
Signif. codes:  0 ‘***’ 0,001 ‘**’ 0,01 ‘*’ 0,05 ‘.’ 0,1 ‘ ’ 1

Residual standard error: 11,84 on 45 degrees of freedom
Multiple R-squared:  0,7263,	Adjusted R-squared:  0,7202 
F-statistic: 119,4 on 1 and 45 DF,  p-value: 3,003e-14
 

R-squard and Adjusted R-squared improved MSE decreased twice

 

Full model from scratch

In [76]:
# Create Training and Test data -
set.seed(100)  # setting seed to reproduce results of random sampling

new_data <- (cars[-c(23, 35, 49), ])


trainingRowIndex <- sample(1:nrow(new_data), 0.8*nrow(new_data))  # row indices for training data
trainingData <- cars[trainingRowIndex, ]  # model training data
testData  <- cars[-trainingRowIndex, ]   # test data

# Build the model on training data -
lmMod <- lm(dist ~ speed, data=trainingData)  # build the model

distPred <- predict(lmMod, testData)  # predict distance

actuals_preds <- data.frame(cbind(actuals=testData$dist, predicteds=distPred))  # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds) 
correlation_accuracy

min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))  
min_max_accuracy 
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)  
mape
 
  actuals predicteds
actuals 1,0000000 0,8934843
predicteds 0,8934843 1,0000000
 
0,826323503374534
 
0,216315142018462
 

Practice

Multiple (Linear) Regression

R provides comprehensive support for multiple linear regression. The topics below are provided in order of increasing complexity.

Fitting the Model

In [ ]:
# Multiple Linear Regression Example 
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit) # show results
In [ ]:
#Optional: Other useful functions (may be usefull) 
coefficients(fit) # model coefficients
confint(fit, level=0.95) # CIs for model parameters 
fitted(fit) # predicted values
residuals(fit) # residuals
anova(fit) # anova table 
vcov(fit) # covariance matrix for model parameters 
influence(fit) # regression diagnostics
 

Diagnostic Plots

Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.

In [ ]:
# diagnostic plots 
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page 
plot(fit)
 

Comparing Models

You can compare nested models with the anova( ) function. The following code provides a simultaneous test that x3 and x4 add to linear prediction above and beyond x1 and x2.

In [ ]:
# compare models
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2)
anova(fit1, fit2)
 

Cross Validation

You can do K-Fold cross-validation using the cv.lm( ) function in the DAAG package.

In [ ]:
# K-fold cross-validation
library(DAAG)
cv.lm(df=mydata, fit, m=3) # 3 fold cross-validation
 

Task: Build multiple regression model

Examples of data

We’ll use the marketing data set [datarium package], which contains the impact of the amount of money spent on three advertising medias (youtube, facebook and newspaper) on sales.

We want to build a model for estimating sales based on the advertising budget invested in youtube, facebook and newspaper, as follow:

sales = b0 + b1 youtube + b2 facebook + b3 * newspaper

In [17]:
library(DAAG)
#getwd()
marketing <- read.table('~/Desktop/Statistics_R/data/marketing.tsv', sep='\t', header=TRUE)
head(marketing, 4)
 
youtube facebook newspaper sales
276,12 45,36 83,04 26,52
53,40 47,16 54,12 12,48
20,64 55,08 83,16 11,16
181,80 49,56 70,20 22,20
In [18]:
fit <- lm(sales ~ youtube + facebook + newspaper, data=marketing)
summary(fit)
 
Warning message in printCoefmat(coefs, digits = digits, signif.stars = signif.stars, :
“NAs introduced by coercion”
 
Call:
lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)

Residuals:
     Min       1Q   Median       3Q      Max 
-10,5932  -1,0690   0,2902   1,4272   3,3951 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3,526667   0,374290   9,422   <2e-16 ***
youtube      0,045765   0,001395  32,809   <2e-16 ***
facebook     0,188530   0,008611  21,893   <2e-16 ***
newspaper   -0,001037   0,005871  -0,177     0,86    
---
Signif. codes:  0 ‘***’ 0,001 ‘**’ 0,01 ‘*’ 0,05 ‘.’ 0,1 ‘ ’ 1

Residual standard error: 2,023 on 196 degrees of freedom
Multiple R-squared:  0,8972,	Adjusted R-squared:  0,8956 
F-statistic: 570,3 on 3 and 196 DF,  p-value: < 2,2e-16
In [20]:
fit <- lm(sales ~ youtube + facebook, data=marketing)
summary(fit)
 
Warning message in printCoefmat(coefs, digits = digits, signif.stars = signif.stars, :
“NAs introduced by coercion”
 
Call:
lm(formula = sales ~ youtube + facebook, data = marketing)

Residuals:
     Min       1Q   Median       3Q      Max 
-10,5572  -1,0502   0,2906   1,4049   3,3994 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3,50532    0,35339   9,919   <2e-16 ***
youtube      0,04575    0,00139  32,909   <2e-16 ***
facebook     0,18799    0,00804  23,382   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0,001 ‘**’ 0,01 ‘*’ 0,05 ‘.’ 0,1 ‘ ’ 1

Residual standard error: 2,018 on 197 degrees of freedom
Multiple R-squared:  0,8972,	Adjusted R-squared:  0,8962 
F-statistic: 859,6 on 2 and 197 DF,  p-value: < 2,2e-16
In [ ]:
 
 

For example, for a fixed amount of youtube and newspaper advertising budget, spending an additional 1 000 dollars on facebook advertising leads to an increase in sales by approximately 0.1885*1000 = 189 sale units, on average.