## Supervised Learning Algorithms

### Introduction

Supervised learning algorithms are, roughly speaking, learning algorithms that learn to associate some input with some output, given a training set of samples of inputs x and outputs y. In different cases, the outputs y could also be difficult to gather automatically and must be provided by a person's “supervisor,” but the term still applies even when the training set targets were collected automatically.

supported by estimating a probability distribution p(y | x). we will do that just by using maximum likelihood estimation to seek out the simplest parameter vector θ for a parametric family of distributions p ( y | x ; θ ). We've already known that rectilinear regression corresponds to the family

p ( y | x ; θ ) = N ( y ; θ x , I ).

We can generalizerectilinear regression to the classification scenario by defining a special family of probability distributions. If we've two classes, class 0 and sophistication 1, then we'd like only specify the probability of 1 of those classes. The probability of sophistication 1 determines the probability of sophistication 0 because these two values must add up to 1.

The normal distribution over real-valued numbers that we used forrectilinear regression is parametrized in terms of a mean. Any value we provide for this mean is valid. Distribution over a binary variable is slightly more complicated because its mean should be between 0 and 1. a method to unravel this problem is to use the logistic sigmoid function to squash the output of the linear function into the interval (0, 1) and interpret that value as a probability:

p ( y = 1 | x ; θ ) = σ (θ x).

This approachis understood as logistic regression (a somewhat strange name since we use the model for classification instead of regression). within the case of rectilinear regression , we were ready to find the optimal weights by solving the traditional equations. Logistic regression is somewhat harder . there's no closed-form solution for its optimal weights. Instead, we must look for them by maximizing the log-likelihood. We will do that by minimizing the negative log-likelihood (NLL) using gradient descent. This similar strategy is often applied to essentially any supervised learning problem, by writing down a parametric family of contingent probability distributions over the proper quite input and output variables.

### Probabilistic Supervised Learning

Most supervised learning algorithms arep ( y | x ; θ ) = N ( y ; θ x , I ).

We can generalize

The normal distribution over real-valued numbers that we used for

p ( y = 1 | x ; θ ) = σ (θ x).

This approach