We’re going to look at one of the easiest forms of supervised learning – and it’s important.
It’s important because if you can grasp some of the concepts behind it – like an algorithm called Gradient Descent – it becomes easier to grasp more complicated models like Neural Networks.
Machine Learning is essentially about looking for patterns, without being explicitly programmed. After all, we might not know what patterns we’re looking for. It is a precursor to artificial intelligence.
For linear regression – we’re looking at problems where there is a continuous output, instead of categories like ‘cancer’ or ‘not cancer’, this species of Iris flower or that species. It’s linear, because its a model that we use where we’re looking for a linear correlation (whether that is a straight line, or curved).
The problem we’re trying to solve is – can I find a linear relationship between some input features, and the output? And, if I do, can I use what I know (from the example data) to get a good idea of what the output will be for a new, unseen example?
For example, let’s look at some house prices from back when I lived in an affordable area. We’re looking at size in square feet to see if it gives an indication as to the price of the house.
Size in square feet 
Price in thousands 

example 1 
1272 
355 
example 2 
1385 
290 
example 3 
1877 
290 
example 4 
1294 
155 
example 5 
873 
125 
example 6 
784 
110 
example 7 
801 
100 
example 8 
729 
60 
example 9 
422 
55 
example 10 
346 
45 
This is just an example – in reality you might want to look at more things, like whether or not there is a garage, garden, location, schools nearby etc.
We’re calling size in square feet a ‘feature’ – because we think it contributes to the output – the price of the house.
The data we have is called the ‘training data’, because we use it to train the model and find a hypothesis.
Because we’re looking for a linear relationship, we can plot it on a graph like this. You can see that there might be a relationship between the size in square feet, and the price of the property.
The algorithms we’re going to use for linear regression make sense if you think about the equation of a straight line:
The equation of a straight line can be described in terms of the gradient of the line – ‘m’ here – and where the line cuts into the y axis – ‘c’.
The hypothesis
For many ML models, you will see these ‘theta’ values – the zero with the line through. This letter, θ, is used in the same way we might use ‘m’ and ‘c’.
Reminder:
 m = the gradient = θ1
 c = the yintercept = θ0
By convention, we use θ, which is a Greek letter. There are other options, some people use the letter ‘w’. One of the reasons to move away from ‘m’ and ‘c’ is that when we start looking at lots of input features, you run out of letters. Using θ0, θ1, θ2… helps when we need to start using vectors and matrices to manage our data and speed up calculations. If you aren’t sure what I mean by vectors and matrices, just think of arrays for now.
The ‘m’ and ‘c’, ‘w’, and the theta values, whatever we use, stand for the same thing. They are sometimes called ‘weights’, and sometimes called the ‘parameters’. They are a placeholder for the actual numerical values that would give us a good approximation to the ‘real’ answers. So when we put in 1271 square feet as ‘x’, we want the hypothesis to output something close to 355.
What is the hypothesis, h(x)? This is the transformation that happens on the input data to get the output. We think, for example, that we might find that multiplying the size in square feet by some value might give a good approximation of how much the house will sell for. Like, take x, multiply it by some θ1, add some θ0 to get the price. h(x): The reason it is written with brackets is that it is saying – there is some function on x, that produces an outcome based on those values. You might see f(x) written in maths, or some programming examples – some function, f, is operating on x. x is your input.
So we want to learn how to chose the best theta values. These are the ones that, when we plug them into the hypothesis, give an answer that approximates closely to the example data.
Let’s start with a guess as to what they could be.
Here, we can see a blue line that represents the straight line we’ve just described.
But from one example, we can see that this doesn’t fit all the data very well.
We could just keep making random guesses, but with complex problem this isn’t very efficient.
Either way what we really need is a way to ‘rank’ a guess, and see how well it does, so we can compare it to other ones.