Least Squares Linear Regression

Imagine you've collected data from students on their exam mark and how many hours they studied. Plotting this information on a scatter graph, it looks like there is a positive linear relationship between the average grade and the number of hours studied.

Get started

Millions of flashcards designed to help you ace your studies

Sign up for free

Need help?
Meet our AI Assistant

Upload Icon

Create flashcards automatically from your own documents.

   Upload Documents
Upload Dots

FC Phone Screen

Need help with
Least Squares Linear Regression?
Ask our AI Assistant

Review generated flashcards

Sign up for free
You have reached the daily AI limit

Start learning or create your own AI flashcards

StudySmarter Editorial Team

Team Least Squares Linear Regression Teachers

  • 8 minutes reading time
  • Checked by StudySmarter Editorial Team
Save Article Save Article
Contents
Contents

Jump to a key chapter

    Can you use this data to predict someone's grade based on the number of hours studied?

    Using linear regression, it is actually possible to make a reasonable estimate based on past data. This article will show you how to find the Least Squares Linear Regression line in order to make predictions based on data already collected.

    Least Squares Linear Regression explanation

    When analysing bivariate data, you have two variables: the dependent or response variable, usually denoted by \(y\), and the independent or explanatory variable usually denoted by \(x\).

    When \(y\) is the dependent variable and \(x\) is the independent variable, you can say '\(y\) depends on \(x\)'.

    Suppose you have collected data on two variables \(y\) and \(x\) where the result of \(y\) depends on \(x\). There also appears to be a linear relationship between the variables. How would you go about predicting a value of \(y\) for a given value of \(x\)?

    At GCSE, you may have had to draw a line of best fit where you would use your own judgement to determine in which "direction" the data was going. The least squares regression line does this mathematically.

    A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.

    Residuals

    If you've seen any bivariate data you'll know that very rarely do the data points fall exactly along a straight line, even if there is a confirmed linear 'relationship' between variables.

    There could be a number of reasons for these inaccuracies (i.e. other factors effecting the dependent variable or inaccurate readings when collecting the data). There are so many possible factors and causes of these inaccuracies that you can assume these are entirely random.

    In the image below, you can see a 'line of best fit' for the data points \((x_1,y_1)\), \((x_2,y_2)\), \((x_3,y_3)\) and \((x_4,y_4)\). Note that the line does not touch any of these points.

    The vertical difference between these points and the line of best fit is labelled with \(\epsilon _1\), \(\epsilon _2\), \(\epsilon _3\) and \(\epsilon _4\). These are the residuals associated with each data point.

    An upward sloping line of best fit with vertical dotted lines labelled 'eplison' between the data points and the line of best fit.Least squares regression line with residuals

    The difference between the observed dependent variable (\(y_i\)) and the predicted dependent variable \(x_i\) is called the residual (\(\epsilon _i\)).

    Although these residuals mean that the prediction is not 100% accurate, they are in fact crucial to how you find the least squares regression line: by minimising the squares of these residuals. Hence the name "least squares regression".

    The least squares regression line of of \(y\) on \(x\) is that which minimises the sum of the squares of the residuals,

    $$\epsilon _1 ^2 +\epsilon _2 ^2 + \epsilon _3 ^2 + ...$$

    where \(\epsilon _i\) is the residual of data point \((x_i,y_i)\).

    Least Squares Linear Regression method

    The Least Squares linear regression method is used to find the regression line. The main objective in this method is to minimize the sum of the squares of residuals of data points in a data set.

    Deriving the Least Squares Linear Regression line

    Although this may sound complicated, actually finding the regression line is pretty straightforward.

    As with finding any straight line in mathematics, you need two things: a \(y\)-intercept and a gradient. Luckily, there is a straightforward formula for finding these.

    Least Squares Linear Regression formula

    The regression line of \(y\) on \(x\) is

    $$y=ax+b$$

    where \(a=\dfrac{S_{xy}}{S_{xx}}\) and \(b=\bar{y}-a\bar{x}\), where

    $$S_{xy}=\sum x_iy_i - \dfrac{\sum x_i \sum y_i}{n}$$ $$S_{xx}=\sum x_i^2 - \dfrac{(\sum x_i)^2}{n}$$ $$S_{yy}=\sum y_i^2 - \dfrac{(\sum y_i)^2}{n}$$

    The summary statistics \(S_{xy}\), \(S_{xx}\) and \(S_{yy}\) may be given to you in an exam, or you may also need to find them from the raw data using a calculator.

    Least Squares Linear Regression solved example

    You are now ready to apply this method to a possible exam question.

    The number of hours students studied and their exam results are recorded in the table below.

    Time studied in hours \(1\)\(2\)\(3\)\(4\)\(5\)
    Exam result \(49\)\(81\)\(71\)\(83\)\(99\)

    a. Calculate \(S_{xy}\) and \(S_{xx}\).
    b. Find the regression line of \(y\) on \(x\).

    c. Plot the data points and the regression line on the same graph.

    d. Interpret the meaning of \(a=10.2\) and \(b=46\) in the context of the question.

    e. Predict the grade for a student who studies for

    i) \(2.5\) hours

    ii) \(8\) hours.

    f. Comment on your answers for part e).

    Solution

    a. Using your calculator, you can easily find the following results,

    \(\sum x=15\) \(\sum x^2=55\) \(\bar{x}=3\) \(\sum xy=1,251\) \(\sum y=383\) \(\sum y^2=30,693\) \(\bar{y}=76.6\).

    Simply plug these results into the formulae detailed above to get the summary statistics.

    \( \begin{align} S_{xx} &=\sum x^2 - \dfrac{(\sum x)^2}{n} \\&= 55 - \dfrac{15^2}{5} \\&= 10. \end{align}\)

    \( \begin{align} S_{xy} &= \sum xy - \dfrac{\sum x \sum y}{n}\\&= 1251 - \dfrac{15 \times 383}{5} \\&= 102. \end{align}\)

    b. Starting with \(a\), the gradient of the line,

    \[a=\dfrac{S_{xy}}{S_{xx}}=\frac{102}{10}=10.2.\]

    Then, the \(y\)-intercept is

    \(b=\bar{y}-a\bar{x}=76.6-10.2 \times 3=46\).

    Therefore, the regression line is \(y=10.2x+46\).

    c. This is a great question for double-checking your working - it'll be pretty obvious if you've made any serious calculation errors!

    Upward-sloping regression line through 5 data points.Least square regression line, example

    d. Since \(a=10.2\), for every extra hour increase along the \(x\)-axis, the student receives \(10.2\) more marks in the exam.

    Since \(b=46\), if a student weren't to study at all, they would still (according to the regression line) receive 46 marks.

    e. Simply input the above numbers for \(x\).

    i) If \(x=2.5\), \(y=10.2\times 2.5+46=71.5\).

    ii) If \(x=8\), \(y=10.2\times 8+46=127.6\).

    f. There is a fundamental problem for part ii): since the exams are graded in percentages, the grade \(127.6\) doesn't exist! The truth is, for any amount of time longer than 5 hours, the data doesn't have any information on what happens to the grades of the students.

    While you could deduce that for any length of time above 5 hours, 100% would be a good prediction, this is beyond the scope of the data and the linear regression model.

    You should keep in mind that using a regression line should only ever be used to predict the values that fall within the range of the data from which you are deriving said regression line, i.e. interpolation.

    If you attempt to make predictions outside of this range, it would be called extrapolation and is less reliable since the data may behave differently.

    The most difficult thing in this topic is making sure you enter the correct numbers into your calculator! Make sure you double-check your calculations in the exam so you don't lose easy marks.

    Least Squares Linear Regression - Key takeaways

    • A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.
    • The difference between the observed dependent variable (\(y_i\)) and the predicted dependent variable is called the residual (\(\epsilon _i\)).
    • The least squares regression line of of \(y\) on \(x\) is that which minimises the sum of the squares of the residuals:

      $$\epsilon _1 ^2 +\epsilon _2 ^2 + \epsilon _3 ^2 + ...$$

      where \(\epsilon _i\) is the residual of data point \((x_i,y_i)\).

    • The regression line of \(y\) on \(x\) is

      $$y=ax+b$$

      where \(a=\dfrac{S_{xy}}{S_{xx}}\) and \(b=\bar{y}-a\bar{x}\).

    • The summary statistics are:
      • \(S_{xy}=\sum xy - \dfrac{\sum x \sum y}{n}\)

        \(S_{xx}=\sum x^2 - \dfrac{(\sum x)^2}{n}\)

        \(S_{yy}=\sum y^2 - \dfrac{(\sum y)^2}{n}\)

    Frequently Asked Questions about Least Squares Linear Regression

    How do you find the least squares regression line?

    You can find the least squares regression line either from the raw data or from summary statistics.

    Is least squares the same as linear regression?

    The least squares method is a type of linear regression analysis.

    What is SSE and SST?

    The SSE is the sum of squares error and the SST is the sum of squares total. You do not need to know these at A-level.

    What is least squares regression for?

    Least squares regression is used for predicting a dependent variable given an independent variable using data you have collected.

    What is ordinary least squares regression analysis?

    A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.

    What is the least squares criterion for linear regression equations?

    The least squares regression line is that which minimises the sum of the squares of the residuals.

    Save Article

    Test your knowledge with multiple choice flashcards

    \(S_{xy}=\)...

    \(S_{xx}=\)...

    \(S_{yy}=\)...

    Next

    Discover learning materials with the free StudySmarter app

    Sign up for free
    1
    About StudySmarter

    StudySmarter is a globally recognized educational technology company, offering a holistic learning platform designed for students of all ages and educational levels. Our platform provides learning support for a wide range of subjects, including STEM, Social Sciences, and Languages and also helps students to successfully master various tests and exams worldwide, such as GCSE, A Level, SAT, ACT, Abitur, and more. We offer an extensive library of learning materials, including interactive flashcards, comprehensive textbook solutions, and detailed explanations. The cutting-edge technology and tools we provide help students create their own learning materials. StudySmarter’s content is not only expert-verified but also regularly updated to ensure accuracy and relevance.

    Learn more
    StudySmarter Editorial Team

    Team Math Teachers

    • 8 minutes reading time
    • Checked by StudySmarter Editorial Team
    Save Explanation Save Explanation

    Study anywhere. Anytime.Across all devices.

    Sign-up for free

    Sign up to highlight and take notes. It’s 100% free.

    Join over 22 million students in learning with our StudySmarter App

    The first learning app that truly has everything you need to ace your exams in one place

    • Flashcards & Quizzes
    • AI Study Assistant
    • Study Planner
    • Mock-Exams
    • Smart Note-Taking
    Join over 22 million students in learning with our StudySmarter App
    Sign up with Email