Find study content
Learning Materials

Discover learning materials by subject, university or textbook.

Explanations
All Subjects

Anthropology

Archaeology

Architecture

Art and Design

Bengali

Biology

Business Studies

Chemistry

Chinese

Combined Science

Computer Science

Economics

Engineering

English

English Literature

Environmental Science

French

Geography

German

Greek

History

Hospitality and Tourism

Human Geography

Japanese

Italian

Law

Macroeconomics

Marketing

Math

Media Studies

Medicine

Microeconomics

Music

Nursing

Nutrition and Food Science

Physics

Politics

Polish

Psychology

Religious Studies

Sociology

Spanish

Sports Sciences

Translation
Features
Features

Discover all of these amazing features with a free account.

Flashcards

StudySmarter AI

Notes

Study Plans

Study Sets

Exams
What’s new?

Flashcards
Study your flashcards with three learning modes.

Study Sets
All of your learning materials stored in one place.

Notes
Create and edit notes or documents.

Study Plans
Organise your studies and prepare for exams.
Resources
Discover

All the hacks around your studies and career - in one place.

Find a job

Student Deals

Magazine

Mobile App
Featured

Magazine
Trusted advice for anyone who wants to ace their studies & career.

Job Board
The largest student job board with the most exciting opportunities.

StudySmarter Deals
Verified student deals from top brands.

Our App
Discover our mobile app to take your studies anywhere.

Learning Materials

Features

Discover

Least Squares Linear Regression

Imagine you've collected data from students on their exam mark and how many hours they studied. Plotting this information on a scatter graph, it looks like there is a positive linear relationship between the average grade and the number of hours studied.

Get started

+ Add tag
Immunology
Cell Biology
Mo

$S_{xy}=$...

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

$S_{xx}=$...

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

$S_{yy}=$...

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

Least squares linear regression is used to analyse...

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

For linear regression line $y=ax+b$, which of the following are true?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

For linear regression line $y=ax+b$, which of the following are true?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

A least squares regression line is used to...

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

What does a least squares regression line minimise?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

$S_{xy}=$...

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

$S_{xx}=$...

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

$S_{yy}=$...

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

Least squares linear regression is used to analyse...

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

For linear regression line $y=ax+b$, which of the following are true?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

For linear regression line $y=ax+b$, which of the following are true?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

A least squares regression line is used to...

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

What does a least squares regression line minimise?

Show Answer

Fact Checked Content
Last Updated: 14.11.2022
8 min reading time

Content creation process designed by
Content cross-checked by
Content quality checked by

Can you use this data to predict someone's grade based on the number of hours studied?

Using linear regression, it is actually possible to make a reasonable estimate based on past data. This article will show you how to find the Least Squares Linear Regression line in order to make predictions based on data already collected.

Least Squares Linear Regression explanation

When analysing bivariate data, you have two variables: the dependent or response variable, usually denoted by $y$, and the independent or explanatory variable usually denoted by $x$.

When $y$ is the dependent variable and $x$ is the independent variable, you can say '$y$ depends on $x$'.

Suppose you have collected data on two variables $y$ and $x$ where the result of $y$ depends on $x$. There also appears to be a linear relationship between the variables. How would you go about predicting a value of $y$ for a given value of $x$?

At GCSE, you may have had to draw a line of best fit where you would use your own judgement to determine in which "direction" the data was going. The least squares regression line does this mathematically.

A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.

Residuals

If you've seen any bivariate data you'll know that very rarely do the data points fall exactly along a straight line, even if there is a confirmed linear 'relationship' between variables.

There could be a number of reasons for these inaccuracies (i.e. other factors effecting the dependent variable or inaccurate readings when collecting the data). There are so many possible factors and causes of these inaccuracies that you can assume these are entirely random.

In the image below, you can see a 'line of best fit' for the data points $(x_1,y_1)$, $(x_2,y_2)$, $(x_3,y_3)$ and $(x_4,y_4)$. Note that the line does not touch any of these points.

The vertical difference between these points and the line of best fit is labelled with $\epsilon _1$, $\epsilon _2$, $\epsilon _3$ and $\epsilon _4$. These are the residuals associated with each data point.

An upward sloping line of best fit with vertical dotted lines labelled 'eplison' between the data points and the line of best fit. Least squares regression line with residuals

The difference between the observed dependent variable ($y_i$) and the predicted dependent variable $x_i$ is called the residual ($\epsilon _i$).

Although these residuals mean that the prediction is not 100% accurate, they are in fact crucial to how you find the least squares regression line: by minimising the squares of these residuals. Hence the name "least squares regression".

The least squares regression line of of $y$ on $x$ is that which minimises the sum of the squares of the residuals,

$$\epsilon _1 ^2 +\epsilon _2 ^2 + \epsilon _3 ^2 + ...$$

where $\epsilon _i$ is the residual of data point $(x_i,y_i)$.

Least Squares Linear Regression method

The Least Squares linear regression method is used to find the regression line. The main objective in this method is to minimize the sum of the squares of residuals of data points in a data set.

Deriving the Least Squares Linear Regression line

Although this may sound complicated, actually finding the regression line is pretty straightforward.

As with finding any straight line in mathematics, you need two things: a $y$-intercept and a gradient. Luckily, there is a straightforward formula for finding these.

Least Squares Linear Regression formula

The regression line of $y$ on $x$ is

$$y=ax+b$$

where $a=\dfrac{S_{xy}}{S_{xx}}$ and $b=\bar{y}-a\bar{x}$, where

$$S_{xy}=\sum x_iy_i - \dfrac{\sum x_i \sum y_i}{n}$$ $$S_{xx}=\sum x_i^2 - \dfrac{(\sum x_i)^2}{n}$$ $$S_{yy}=\sum y_i^2 - \dfrac{(\sum y_i)^2}{n}$$

The summary statistics $S_{xy}$, $S_{xx}$ and $S_{yy}$ may be given to you in an exam, or you may also need to find them from the raw data using a calculator.

Least Squares Linear Regression solved example

You are now ready to apply this method to a possible exam question.

The number of hours students studied and their exam results are recorded in the table below.

Time studied in hours	$1$	$2$	$3$	$4$	$5$
Exam result	$49$	$81$	$71$	$83$	$99$

a. Calculate $S_{xy}$ and $S_{xx}$.
b. Find the regression line of $y$ on $x$.

c. Plot the data points and the regression line on the same graph.

d. Interpret the meaning of $a=10.2$ and $b=46$ in the context of the question.

e. Predict the grade for a student who studies for

i) $2.5$ hours

ii) $8$ hours.

f. Comment on your answers for part e).

Solution

a. Using your calculator, you can easily find the following results,

$\sum x=15$ $\sum x^2=55$ $\bar{x}=3$ $\sum xy=1,251$ $\sum y=383$ $\sum y^2=30,693$ $\bar{y}=76.6$.

Simply plug these results into the formulae detailed above to get the summary statistics.

$ \begin{align} S_{xx} &=\sum x^2 - \dfrac{(\sum x)^2}{n} \\&= 55 - \dfrac{15^2}{5} \\&= 10. \end{align}$

$ \begin{align} S_{xy} &= \sum xy - \dfrac{\sum x \sum y}{n}\\&= 1251 - \dfrac{15 \times 383}{5} \\&= 102. \end{align}$

b. Starting with $a$, the gradient of the line,

\[a=\dfrac{S_{xy}}{S_{xx}}=\frac{102}{10}=10.2.\]

Then, the $y$-intercept is

$b=\bar{y}-a\bar{x}=76.6-10.2 \times 3=46$.

Therefore, the regression line is $y=10.2x+46$.

c. This is a great question for double-checking your working - it'll be pretty obvious if you've made any serious calculation errors!

Upward-sloping regression line through 5 data points. Least square regression line, example

d. Since $a=10.2$, for every extra hour increase along the $x$-axis, the student receives $10.2$ more marks in the exam.

Since $b=46$, if a student weren't to study at all, they would still (according to the regression line) receive 46 marks.

e. Simply input the above numbers for $x$.

i) If $x=2.5$, $y=10.2\times 2.5+46=71.5$.

ii) If $x=8$, $y=10.2\times 8+46=127.6$.

f. There is a fundamental problem for part ii): since the exams are graded in percentages, the grade $127.6$ doesn't exist! The truth is, for any amount of time longer than 5 hours, the data doesn't have any information on what happens to the grades of the students.

While you could deduce that for any length of time above 5 hours, 100% would be a good prediction, this is beyond the scope of the data and the linear regression model.

You should keep in mind that using a regression line should only ever be used to predict the values that fall within the range of the data from which you are deriving said regression line, i.e. interpolation.

If you attempt to make predictions outside of this range, it would be called extrapolation and is less reliable since the data may behave differently.

The most difficult thing in this topic is making sure you enter the correct numbers into your calculator! Make sure you double-check your calculations in the exam so you don't lose easy marks.

Least Squares Linear Regression - Key takeaways

A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.
The difference between the observed dependent variable ($y_i$) and the predicted dependent variable is called the residual ($\epsilon _i$).
The least squares regression line of of $y$ on $x$ is that which minimises the sum of the squares of the residuals:
$$\epsilon _1 ^2 +\epsilon _2 ^2 + \epsilon _3 ^2 + ...$$
where $\epsilon _i$ is the residual of data point $(x_i,y_i)$.
The regression line of $y$ on $x$ is
$$y=ax+b$$
where $a=\dfrac{S_{xy}}{S_{xx}}$ and $b=\bar{y}-a\bar{x}$.
The summary statistics are:
- $S_{xy}=\sum xy - \dfrac{\sum x \sum y}{n}$
  $S_{xx}=\sum x^2 - \dfrac{(\sum x)^2}{n}$
  $S_{yy}=\sum y^2 - \dfrac{(\sum y)^2}{n}$

Flashcards in Least Squares Linear Regression

Start learning

$S_{xy}=$...

...$\sum xy - \dfrac{\sum x \sum y}{n}$.

$S_{xx}=$...

...$\sum xy - \dfrac{\sum x \sum y}{n}$.

$S_{yy}=$...

...$\sum y^2 - \dfrac{(\sum y)^2}{n}$.

Least squares linear regression is used to analyse...

...bivariate data.

For linear regression line $y=ax+b$, which of the following are true?

$a=\dfrac{S_{xy}}{S_{xx}}$

For linear regression line $y=ax+b$, which of the following are true?

$b=\bar{y}-a\bar{x}$

Already have an account? Log in

Frequently Asked Questions about Least Squares Linear Regression

How do you find the least squares regression line?

You can find the least squares regression line either from the raw data or from summary statistics.

Is least squares the same as linear regression?

The least squares method is a type of linear regression analysis.

What is SSE and SST?

The SSE is the sum of squares error and the SST is the sum of squares total. You do not need to know these at A-level.

What is least squares regression for?

Least squares regression is used for predicting a dependent variable given an independent variable using data you have collected.

What is ordinary least squares regression analysis?

A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.

What is the least squares criterion for linear regression equations?

The least squares regression line is that which minimises the sum of the squares of the residuals.

Save Article

Test your knowledge with multiple choice flashcards

Score

Access over 700 million learning materials

Study more efficiently with flashcards

Get better grades with AI

Already have an account? Log in

How we ensure our content is accurate and trustworthy?

At StudySmarter, we have created a learning platform that serves millions of students. Meet the people who work hard to deliver fact based content as well as making sure it is verified.

Content Creation Process:

Lily Hulatt is a Digital Content Specialist with over three years of experience in content strategy and curriculum design. She gained her PhD in English Literature from Durham University in 2022, taught in Durham University’s English Studies Department, and has contributed to a number of publications. Lily specialises in English Literature, English Language, History, and Philosophy.

Get to know Lily

Content Quality Monitored by:

Gabriel Freitas is an AI Engineer with a solid experience in software development, machine learning algorithms, and generative AI, including large language models’ (LLMs) applications. Graduated in Electrical Engineering at the University of São Paulo, he is currently pursuing an MSc in Computer Engineering at the University of Campinas, specializing in machine learning topics. Gabriel has a strong background in software engineering and has worked on projects involving computer vision, embedded AI, and LLM applications.

Get to know Gabriel

Discover learning materials with the free StudySmarter app

About StudySmarter

StudySmarter is a globally recognized educational technology company, offering a holistic learning platform designed for students of all ages and educational levels. Our platform provides learning support for a wide range of subjects, including STEM, Social Sciences, and Languages and also helps students to successfully master various tests and exams worldwide, such as GCSE, A Level, SAT, ACT, Abitur, and more. We offer an extensive library of learning materials, including interactive flashcards, comprehensive textbook solutions, and detailed explanations. The cutting-edge technology and tools we provide help students create their own learning materials. StudySmarter’s content is not only expert-verified but also regularly updated to ensure accuracy and relevance.

Learn more

StudySmarter Editorial Team

Team Math Teachers