## Supervised Learning Algorithms

### Introduction

Supervised learning algorithms are, roughly speaking, learning algorithms that learn to associate some input with some output, given a training set of samples of inputs x and outputs y. In different cases, the outputs y could also be difficult to gather automatically and must be provided by a person's “supervisor,” but the term still applies even when the training set targets were collected automatically.

### Probabilistic Supervised Learning

Most supervised learning algorithms are supported by estimating a probability distribution p(y | x). we will do that just by using maximum likelihood estimation to seek out the simplest parameter vector θ for a parametric family of distributions p ( y | x ; θ ). We've already known that rectilinear regression corresponds to the family
p ( y | x ; θ ) = N ( y ; θ x , I ).
We can generalize rectilinear regression to the classification scenario by defining a special family of probability distributions. If we've two classes, class 0 and sophistication 1, then we'd like only specify the probability of 1 of those classes. The probability of sophistication 1 determines the probability of sophistication 0 because these two values must add up to 1.
The normal distribution over real-valued numbers that we used for rectilinear regression is parametrized in terms of a mean. Any value we provide for this mean is valid. Distribution over a binary variable is slightly more complicated because its mean should be between 0 and 1. a method to unravel this problem is to use the logistic sigmoid function to squash the output of the linear function into the interval (0, 1) and interpret that value as a probability:
p ( y = 1 | x ; θ ) = σ (θ x).
This approach is understood as logistic regression (a somewhat strange name since we use the model for classification instead of regression). within the case of rectilinear regression , we were ready to find the optimal weights by solving the traditional equations. Logistic regression is somewhat harder there's no closed-form solution for its optimal weights. Instead, we must look for them by maximizing the log-likelihood. We will do that by minimizing the negative log-likelihood (NLL) using gradient descent. This similar strategy is often applied to essentially any supervised learning problem, by writing down a parametric family of contingent probability distributions over the proper quite input and output variables.