Learning theory claims that a machine learning algorithm can generalize well from a finite training set of examples. This seems to contradict some basic principles of logic. generalization, or inferring general rules from a limited set of examples, isn’t logically valid. To logically infer a rule describing every member of a group, one must have information about every member of that set. In part, machine learning avoids this problem by offering only probabilistic rules, instead of the entirely certain rules utilized in purely logical reasoning. Machine learning promises to seek out rules that are probably correct about most members of the set they concern.
Unfortunately, even this doesn’t resolve the whole problem. The no free lunch theorem for machine learning (Wolpert, 1996 ) states that averaged over all possible data generating distributions, every classification algorithm has an equivalent error rate when classifying previously unobserved points. In other words, in some sense, no machine learning algorithm is universally any better than the other. the foremost sophisticated algorithm we will imagine has an equivalent average performance (overall possible tasks) as merely predicting that each point belongs to an equivalent class.
Fortunately, these results hold only we average over all possible data generating distributions. If we make assumptions about the sorts of probability distributions we encounter in real-world applications, then we will design learning algorithms that perform well on these distributions. This suggests that the goal of machine learning research isn’t to hunt a universal learning algorithm or absolutely the best learning algorithm. Instead, our goal is to know what sorts of distributions are relevant to the “real world” that an AI agent experiences, and what sorts of machine learning algorithms perform well on data drawn from the sorts of data generating distributions we care about.
The no-free lunch theorem implies that we must design our machine learning algorithms to perform well on a selected task. We do so by building a group of preferences into the training algorithm. When these preferences are aligned with the training problems we ask the algorithm to unravel, it performs better. So far, the sole method of modifying a learning algorithm we’ve discussed is to extend or decrease the model’s capacity by adding or removing functions from the hypothesis space of solutions the training algorithm is in a position to settle on. We gave the specific example of accelerating or decreasing the degree of a polynomial for a regression problem. The view we’ve described thus far is oversimplified. The behavior of our algorithm is strongly affected not just by how large we make the set of functions allowed in its hypothesis space, but by the precise identity of these functions. the training algorithm we’ve studied thus far, rectilinear regression, features a hypothesis space consisting of the set of linear functions of its input.
These linear functions are often very useful for problems where the connection between inputs and outputs truly is on the brink of linear. they’re less useful for problems that behave in a very nonlinear fashion. for instance, rectilinear regression wouldn’t perform all right if we tried to use it to predict sin(x) from x. we will thus control the performance of our algorithms by choosing what quiet functions we allow them to draw solutions from, also as by controlling the quantity of those functions. we will also provide a learning algorithm a preference for one solution in its hypothesis space to a different. this suggests that both functions are eligible, but one is preferred. The unpreferred solution is chosen as long as it fits the training data significantly better than the well-liked solution. for instance, we will modify the training criterion for rectilinear regression to include weight decay. To perform rectilinear regression with weight decay, we minimize the sum comprising both the mean squared error on the training and a criterion J (w) that expresses a preference for the weights to possess a smaller squared L norm. Specifically,
J( w ) = MSE + λw w,
where λ may be a value chosen before the time that controls the strength of our preference for smaller weights. When λ = 0, we impose no preference, and bigger λ forces the weights to become smaller. Minimizing J (w ) leads to a choice of weights that make a tradeoff between fitting the training data and being small. this provides us solutions that have a smaller slope or put weight on fewer of the features. In our weight decay example, we expressed our preference for linear functions defined with smaller weights explicitly, via an additional term within the criterion we minimize. There are many other ways of expressing preferences for various solutions, both implicitly and explicitly. Together, these different approaches are referred to as regularization.
Regularization is one among the central concerns of the sector of machine learning, rivaled in its importance only by optimization. The no free lunch theorem has made it clear that there’s no best machine learning algorithm, and, especially, no best sort of regularization. Instead, we must choose a sort of regularization that’s well-suited to the actual task we would like to unravel. The philosophy of deep learning generally and this book, especially, is that a really wide selection of tasks (such as all of the intellectual tasks that folks can do) may all be solved effectively using very general-purpose sorts of regularization.