Regression Modeling in R (Part 1)

A guide to Simple Linear Regression

Namitha Deshpande
The Startup

--

Though straightforward comparative tests of individual statistics are useful in their own right, you’ll often want to learn more from your data.

In this story, you’ll look at linear regression models: a suite of methods used to evaluate precisely how variables relate to each other.

Regression analysis

Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the ‘outcome variable’) and one or more independent variables (often called ‘predictors’, ‘covariates’, or ‘features’). — Wikipedia

To understand regression analysis fully, it’s essential to understand the following terms:

  • Dependent Variable: This is the main factor that you’re trying to understand or predict.
  • Independent Variables: These are the factors that you hypothesize have an impact on your dependent variable.
  • Outlier: Observation that differs significantly from other observations. It should be avoided since it may hamper the result.

While there are many types of regression analysis, in this story, the goal is to explore the Linear Regression method alone.

When is Linear Regression Appropriate?

The sensible use of linear regression on a data set requires that four assumptions about that data set be true:

  • The relationship between the variables is linear.
  • Your data needs to show homoscedasticity, meaning that the difference in the real and predicted values is more or less constant.
  • There should be no significant outliers.
  • Finally, the residuals (errors) of the regression line should be approximately normally distributed.

Simple Linear Regression

It is the simplest form of regression. The relationship between the dependent variable and independent variables is assumed to be linear.

In statistics, linear regression is a linear approach to modeling the relationship between an outcome(or dependent variable) and one or more predictor (or independent variables). — Wikipedia

The Dataset

As an example to start with, let’s look at the student survey data (the survey data frame in the package MASS) a little more closely. Install and load the “MASS” package as shown below :

install.packages("MASS")
library(MASS)

Turn your attention to the data frame survey, located in the MASS package. These data record particular characteristics of 237 first-year undergraduate statistics students collected from a class at the University of Adelaide, South Australia.

Use ?survey to explore the different variables present in the data frame.

Graphical Analysis

The aim here is to build a simple regression model that we can use to predict Height by establishing a statistically significant linear relationship with Handspan. But before jumping into the model, let's try to understand these variables graphically.

Firstly, plot the student heights on the y-axis and their handspans (of their writing hand) on the x-axis using the plot function.

plot(survey$Height~survey$Wr.Hnd,xlab="Writing handspan(cm)",ylab="Height (cm)")
A scatterplot of height against writing handspan for a sample of first-year statistics students

As one might expect, there’s a positive association between a student’s handspan and their height. That relationship appears to be linear in nature.

Correlation

Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair — just like what we have here in Height and Handspan.

Correlation can take values between -1 and+1. If we observe for every instance where handspan increases, the height also increases along with it, then there is a high positive correlation between them and therefore the correlation between them will be closer to 1.

cor(survey$Wr.Hnd,survey$Height,use="complete.obs")
[1] 0.6009909

Defining a model

The purpose of a linear regression model is to come up with a function that estimates the mean of one variable given a particular value of another variable.

In terms of the student survey example, you might ask something like “What’s the expected height of a student if their handspan is 14.5 cm?”

Assume you’re looking to determine the value of response variable Y given the value of an explanatory variable X. The simple linear regression model states that the value of response is expressed as the following equation:

Parameters

  • Y is the output or response variable.
  • X is the input or predictor variable.
  • ε term represents random error.
  • ß0 is the intercept, interpreted as the expected value of the output variable when the input is zero.
  • ß1 is the slope, interpreted as the change in the mean response for each one-unit increase in the predictor.

Fitting the model with lm

A linear regression model can be calculated in R with the command lm. It takes many arguments as shown below:

lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, …)

For example, the following line of code creates a fitted linear model object of the mean student height by handspan and stores it in your global environment in the model variable as:

model <- lm(Height~Wr.Hnd,data=survey)

The first argument is the now-familiar response ~ predictor formula, which specifies the desired model. The specific dataframe is provided in the second argument for the function to look for.

With the command summary(model) you can see detailed information on the model’s performance and coefficients.

summary(model)Call:
lm(formula = Height ~ Wr.Hnd, data = survey)
Residuals:
Min 1Q Median 3Q Max
-19.7276 -5.0706 -0.8269 4.9473 25.8704
Coefficients:
(Intercept) 113.9536
Wr.Hnd 3.1166

Coefficients

In the summary square above, you can see the values of the intercept and the slope for the writing handspan variable. This reveals that the linear model for this scenario is estimated as follows:

If you evaluate the mathematical function for y at a range of different values for x, you end up with a straight line when you plot the results.

Considering the definition of intercept given earlier as the expected value of the response variable when the predictor is zero, in the current example, this would imply that the mean height of a student with a handspan of 0 cm is 113.9536 cm.

The slope is 3.1166. This states that, on average, for every 1 cm increase in handspan, a student’s height is estimated to increase by 3.1166 cm.

With all this knowledge, let’s plot the fitted linear regression model using the abline function which adds one or more straight lines through the current plot using the command below:

abline(model,lwd=2)
The simple linear regression line (solid, bold) fitted to the observed data.

Residuals

A good way to test the quality of the fit of the model is to look at the residuals or the differences between the real values and the predicted values.

When the parameters are estimated, as shown above, the fitted line is referred to as an implementation of least-squares regression because it’s the line that minimizes the average squared difference between the observed data and itself.

The idea here is that the sum of the residuals is approximately zero or as low as possible. In real life, most cases will not follow a perfectly straight line, so residuals are expected.

In the R summary of the lm function, you can see descriptive statistics about the residuals of the model, the following are the values for our fitted model.

Residuals:
Min 1Q Median 3Q Max
-19.7276 -5.0706 -0.8269 4.9473 25.8704

Coefficient of Determination

One measure very used to test how good is your model is the coefficient of determination or R². This measure is defined by the proportion of the total variability explained by the regression model.

The output of summary provides you with the values of Multiple R-squared and Adjusted R-squared, which are particularly interesting. Both of these are referred to as the coefficient of determination.

Multiple R-squared:  0.3612, Adjusted R-squared:  0.3581

For simple linear regression, the Multiple R-squared measure is simply obtained as the square of the estimated correlation coefficient. This tells you that about 36.1 per cent of the variation in the student heights can be attributed to handspan which is the Multiple R-squared values.

The adjusted measure is an alternative estimate that takes into account the number of parameters that require estimation. The adjusted measure is generally important only if you’re using the coefficient of determination to assess the overall “quality” of the fitted model in terms of a balance between the goodness of fit and complexity

This can seem a little bit complicated, but in general, for models that fit the data well, R² is near 1.

Conclusion

You made it to the end! Yes, this was a long one. Linear regression is a big topic. Hence I have decided to make a series of article on Regression Modeling in R. Stay tuned!

Thanks for reading :)

--

--