Supervised Learning Algorithms
Probabilistic Supervised Learning
Most supervised learning algorithms are estimating a probability distribution p(y | x). using maximum likelihood estimation parameter vector θ for a parametric family of distributions p ( y | x ; θ ). Walready known that corresponds to the family
p ( y | x ; θ ) = N ( y ; θ x , I ).
We can generalize to the classification scenario by defining family of probability distributions. If two classes, class 0 1, then only specify the probability classes. The probability 1 determines the probability 0 because these two values must add up to 1.
The normal distribution over real-valued numbers that we used for is parametrized in terms of a mean. Any value we provide for this mean is valid. Distribution over a binary variable is slightly more complicated because its mean be between 0 and 1. this problem is to use the logistic sigmoid function to squash the output of the linear function into the interval (0, 1) and interpret that value as a probability:
p ( y = 1 | x ; θ ) = σ (θ x).
This approach as logistic regression (a somewhat strange name since we use the model for classification regression). case of , we were find the optimal weights by solving equations. Logistic regression is somewhat . no closed-form solution for its optimal weights. Instead, we must them by maximizing the log-likelihood. Wby minimizing the negative log-likelihood (NLL) using gradient descent. This similar strategy applied to essentially any supervised learning problem, by writing down a parametric family of distributions over input and output variables.