Jump to a key chapter
Can you use this data to predict someone's grade based on the number of hours studied?
Using linear regression, it is actually possible to make a reasonable estimate based on past data. This article will show you how to find the Least Squares Linear Regression line in order to make predictions based on data already collected.
Least Squares Linear Regression explanation
When analysing bivariate data, you have two variables: the dependent or response variable, usually denoted by \(y\), and the independent or explanatory variable usually denoted by \(x\).
When \(y\) is the dependent variable and \(x\) is the independent variable, you can say '\(y\) depends on \(x\)'.
Suppose you have collected data on two variables \(y\) and \(x\) where the result of \(y\) depends on \(x\). There also appears to be a linear relationship between the variables. How would you go about predicting a value of \(y\) for a given value of \(x\)?
At GCSE, you may have had to draw a line of best fit where you would use your own judgement to determine in which "direction" the data was going. The least squares regression line does this mathematically.
A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.
Residuals
If you've seen any bivariate data you'll know that very rarely do the data points fall exactly along a straight line, even if there is a confirmed linear 'relationship' between variables.
There could be a number of reasons for these inaccuracies (i.e. other factors effecting the dependent variable or inaccurate readings when collecting the data). There are so many possible factors and causes of these inaccuracies that you can assume these are entirely random.
In the image below, you can see a 'line of best fit' for the data points \((x_1,y_1)\), \((x_2,y_2)\), \((x_3,y_3)\) and \((x_4,y_4)\). Note that the line does not touch any of these points.
The vertical difference between these points and the line of best fit is labelled with \(\epsilon _1\), \(\epsilon _2\), \(\epsilon _3\) and \(\epsilon _4\). These are the residuals associated with each data point.
The difference between the observed dependent variable (\(y_i\)) and the predicted dependent variable \(x_i\) is called the residual (\(\epsilon _i\)).
Although these residuals mean that the prediction is not 100% accurate, they are in fact crucial to how you find the least squares regression line: by minimising the squares of these residuals. Hence the name "least squares regression".
The least squares regression line of of \(y\) on \(x\) is that which minimises the sum of the squares of the residuals,
$$\epsilon _1 ^2 +\epsilon _2 ^2 + \epsilon _3 ^2 + ...$$
where \(\epsilon _i\) is the residual of data point \((x_i,y_i)\).
Least Squares Linear Regression method
The Least Squares linear regression method is used to find the regression line. The main objective in this method is to minimize the sum of the squares of residuals of data points in a data set.
Deriving the Least Squares Linear Regression line
Although this may sound complicated, actually finding the regression line is pretty straightforward.
As with finding any straight line in mathematics, you need two things: a \(y\)-intercept and a gradient. Luckily, there is a straightforward formula for finding these.
Least Squares Linear Regression formula
The regression line of \(y\) on \(x\) is
$$y=ax+b$$
where \(a=\dfrac{S_{xy}}{S_{xx}}\) and \(b=\bar{y}-a\bar{x}\), where
$$S_{xy}=\sum x_iy_i - \dfrac{\sum x_i \sum y_i}{n}$$ $$S_{xx}=\sum x_i^2 - \dfrac{(\sum x_i)^2}{n}$$ $$S_{yy}=\sum y_i^2 - \dfrac{(\sum y_i)^2}{n}$$
The summary statistics \(S_{xy}\), \(S_{xx}\) and \(S_{yy}\) may be given to you in an exam, or you may also need to find them from the raw data using a calculator.
Least Squares Linear Regression solved example
You are now ready to apply this method to a possible exam question.
The number of hours students studied and their exam results are recorded in the table below.
Time studied in hours | \(1\) | \(2\) | \(3\) | \(4\) | \(5\) |
Exam result | \(49\) | \(81\) | \(71\) | \(83\) | \(99\) |
a. Calculate \(S_{xy}\) and \(S_{xx}\).
b. Find the regression line of \(y\) on \(x\).
c. Plot the data points and the regression line on the same graph.
d. Interpret the meaning of \(a=10.2\) and \(b=46\) in the context of the question.
e. Predict the grade for a student who studies for
i) \(2.5\) hours
ii) \(8\) hours.
f. Comment on your answers for part e).
Solution
a. Using your calculator, you can easily find the following results,
\(\sum x=15\) \(\sum x^2=55\) \(\bar{x}=3\) \(\sum xy=1,251\) \(\sum y=383\) \(\sum y^2=30,693\) \(\bar{y}=76.6\).
Simply plug these results into the formulae detailed above to get the summary statistics.
\( \begin{align} S_{xx} &=\sum x^2 - \dfrac{(\sum x)^2}{n} \\&= 55 - \dfrac{15^2}{5} \\&= 10. \end{align}\)
\( \begin{align} S_{xy} &= \sum xy - \dfrac{\sum x \sum y}{n}\\&= 1251 - \dfrac{15 \times 383}{5} \\&= 102. \end{align}\)
b. Starting with \(a\), the gradient of the line,
\[a=\dfrac{S_{xy}}{S_{xx}}=\frac{102}{10}=10.2.\]
Then, the \(y\)-intercept is
\(b=\bar{y}-a\bar{x}=76.6-10.2 \times 3=46\).
Therefore, the regression line is \(y=10.2x+46\).
c. This is a great question for double-checking your working - it'll be pretty obvious if you've made any serious calculation errors!
d. Since \(a=10.2\), for every extra hour increase along the \(x\)-axis, the student receives \(10.2\) more marks in the exam.
Since \(b=46\), if a student weren't to study at all, they would still (according to the regression line) receive 46 marks.
e. Simply input the above numbers for \(x\).
i) If \(x=2.5\), \(y=10.2\times 2.5+46=71.5\).
ii) If \(x=8\), \(y=10.2\times 8+46=127.6\).
f. There is a fundamental problem for part ii): since the exams are graded in percentages, the grade \(127.6\) doesn't exist! The truth is, for any amount of time longer than 5 hours, the data doesn't have any information on what happens to the grades of the students.
While you could deduce that for any length of time above 5 hours, 100% would be a good prediction, this is beyond the scope of the data and the linear regression model.
You should keep in mind that using a regression line should only ever be used to predict the values that fall within the range of the data from which you are deriving said regression line, i.e. interpolation.
If you attempt to make predictions outside of this range, it would be called extrapolation and is less reliable since the data may behave differently.
The most difficult thing in this topic is making sure you enter the correct numbers into your calculator! Make sure you double-check your calculations in the exam so you don't lose easy marks.
Least Squares Linear Regression - Key takeaways
- A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.
- The difference between the observed dependent variable (\(y_i\)) and the predicted dependent variable is called the residual (\(\epsilon _i\)).
- The least squares regression line of of \(y\) on \(x\) is that which minimises the sum of the squares of the residuals:
$$\epsilon _1 ^2 +\epsilon _2 ^2 + \epsilon _3 ^2 + ...$$
where \(\epsilon _i\) is the residual of data point \((x_i,y_i)\).
The regression line of \(y\) on \(x\) is
$$y=ax+b$$
where \(a=\dfrac{S_{xy}}{S_{xx}}\) and \(b=\bar{y}-a\bar{x}\).
- The summary statistics are:
\(S_{xy}=\sum xy - \dfrac{\sum x \sum y}{n}\)
\(S_{xx}=\sum x^2 - \dfrac{(\sum x)^2}{n}\)
\(S_{yy}=\sum y^2 - \dfrac{(\sum y)^2}{n}\)
Learn with 8 Least Squares Linear Regression flashcards in the free StudySmarter app
We have 14,000 flashcards about Dynamic Landscapes.
Already have an account? Log in
Frequently Asked Questions about Least Squares Linear Regression
How do you find the least squares regression line?
You can find the least squares regression line either from the raw data or from summary statistics.
Is least squares the same as linear regression?
The least squares method is a type of linear regression analysis.
What is SSE and SST?
The SSE is the sum of squares error and the SST is the sum of squares total. You do not need to know these at A-level.
What is least squares regression for?
Least squares regression is used for predicting a dependent variable given an independent variable using data you have collected.
What is ordinary least squares regression analysis?
A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.
What is the least squares criterion for linear regression equations?
The least squares regression line is that which minimises the sum of the squares of the residuals.
About StudySmarter
StudySmarter is a globally recognized educational technology company, offering a holistic learning platform designed for students of all ages and educational levels. Our platform provides learning support for a wide range of subjects, including STEM, Social Sciences, and Languages and also helps students to successfully master various tests and exams worldwide, such as GCSE, A Level, SAT, ACT, Abitur, and more. We offer an extensive library of learning materials, including interactive flashcards, comprehensive textbook solutions, and detailed explanations. The cutting-edge technology and tools we provide help students create their own learning materials. StudySmarter’s content is not only expert-verified but also regularly updated to ensure accuracy and relevance.
Learn more