Supervised learning algorithms are, roughly speaking, learning algorithms that learn to associate some input with some output, given a training set of inputs x and outputs y. In different cases, the outputs y difficult automatically and must be provided by “supervisor,” but the term still applies even when the training set targets were collected automatically.
Probabilistic Supervised Learning
Most supervised learning algorithms are estimating a probability distribution p(y | x). using maximum likelihood estimation parameter vector θ for a parametric family of distributions p ( y | x; θ ). Walready known that corresponds to the family
p ( y | x ; θ ) = N ( y ; θ x , I ).
We can generalize to the classification scenario by defining family of probability distributions. If two classes, class 0 1, then only to specify the probability classes. The probability 1 determines the probability 0 because these two values must add up to 1.
The normal distribution over real-valued numbers that we used for is parametrized in terms of a mean. Any value we provide for this mean is valid. Distribution over a binary variable is slightly more complicated because its mean be between 0 and 1. this problem is to use the logistic sigmoid function to squash the output of the linear function into the interval (0, 1) and interpret that value as a probability:
p ( y = 1 | x ; θ ) = σ (θ x).
This approach as logistic regression (a somewhat strange name since we use the model for classification regression). case of , we were find the optimal weights by solving equations. Logistic regression is somewhat . no closed-form solution for its optimal weights. Instead, we must them by maximizing the log-likelihood. Wby minimizing the negative log-likelihood (NLL) using gradient descent. This similar strategy applied to essentially any supervised learning problem, by writing down a parametric family of distributions over input and output variables.