Main Understanding Machine Learning: From Theory to Algorithms

Understanding Machine Learning: From Theory to Algorithms

5.0 / 5.0
How much do you like this book?
What’s the quality of the file?
Download the book for quality assessment
What’s the quality of the downloaded files?
Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to students and nonexpert readers in statistics, computer science, mathematics, and engineering.
Cambridge University Press
ISBN 13:
PDF, 2.85 MB
Download (pdf, 2.85 MB)

You may be interested in Powered by Rec2Me


Most frequently terms


You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
Understanding Machine Learning
Machine learning is one of the fastest growing areas of computer science,
with far-reaching applications. The aim of this textbook is to introduce
machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides an extensive theoretical account of the
fundamental ideas underlying machine learning and the mathematical
derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide
array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of
learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks,
and structured output learning; and emerging theoretical concepts such as
the PAC-Bayes approach and compression-based bounds. Designed for
an advanced undergraduate or beginning graduate course, the text makes
the fundamentals and algorithms of machine learning accessible to students and nonexpert readers in statistics, computer science, mathematics,
and engineering.
Shai Shalev-Shwartz is an Associate Professor at the School of Computer
Science and Engineering at The Hebrew University, Israel.
Shai Ben-David is a Professor in the School of Computer Science at the
University of Waterloo, Canada.

From Theory to

Shai Shalev-Shwartz
The Hebrew University, Jerusalem

Shai Ben-David
University of Waterloo, Canada

32 Avenue of the Americas, New York, NY 10013-2473, USA
Cambridge University Press is part of the University of Cambridge.
It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning and research at the highest international levels of excellence.
Information on this title:
c Shai Shalev-Shwartz and Shai Ben-David 2014

This publication is in copyright. Subject to stat; utory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2014
Printed in the United States of America
A catalog record for this publication is available from the British Library
Library of Congress Cataloging in Publication Data
ISBN 978-1-107-05713-5 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party Internet Web sites referred to in this publication,
and does not guarantee that any content on such Web sites is, or will remain,
accurate or appropriate.

Triple-S dedicates the book to triple-M




1.1 What Is Learning?
1.2 When Do We Need Machine Learning?
1.3 Types of Learning
1.4 Relations to Other Fields
1.5 How to Read This Book
1.6 Notation

Part 1

page xv


A Gentle Start


A Formal Learning Model


A Formal Model – The Statistical Learning Framework
Empirical Risk Minimization
Empirical Risk Minimization with Inductive Bias

PAC Learning
A More General Learning Model
Bibliographic Remarks

Learning via Uniform Convergence

Uniform Convergence Is Sufficient for Learnability
Finite Classes Are Agnostic PAC Learnable
Bibliographic Remarks






The Bias-Complexity Tradeoff
5.1 The No-Free-Lunch Theorem
5.2 Error Decomposition
5.3 Summary
5.4 Bibliographic Remarks
5.5 Exercises


The VC-Dimension






Nonuniform Learnability
7.1 Nonuniform Learnability
7.2 Structural Risk Minimization
7.3 Minimum Description Length and Occam’s Razor
7.4 Other Notions of Learnability – Consistency
7.5 Discussing the Different Notions of Learnability
7.6 Summary
7.7 Bibliographic Remarks
7.8 Exercises


The Runtime of Learning



Part 2

Infinite-Size Classes Can Be Learnable
The VC-Dimension
The Fundamental Theorem of PAC learning
Proof of Theorem 6.7
Bibliographic remarks


Computational Complexity of Learning
Implementing the ERM Rule
Efficiently Learnable, but Not by a Proper ERM
Hardness of Learning*
Bibliographic Remarks

From Theory to Algorithms


Linear Regression
Logistic Regression
Bibliographic Remarks









10.1 Weak Learnability
10.2 AdaBoost
10.3 Linear Combinations of Base Hypotheses
10.4 AdaBoost for Face Recognition
10.5 Summary
10.6 Bibliographic Remarks
10.7 Exercises




Convex Learning Problems
12.1 Convexity, Lipschitzness, and Smoothness
12.2 Convex Learning Problems
12.3 Surrogate Loss Functions
12.4 Summary
12.5 Bibliographic Remarks
12.6 Exercises


Regularization and Stability
13.1 Regularized Loss Minimization
13.2 Stable Rules Do Not Overfit
13.3 Tikhonov Regularization as a Stabilizer
13.4 Controlling the Fitting-Stability Tradeoff
13.5 Summary
13.6 Bibliographic Remarks
13.7 Exercises


Stochastic Gradient Descent




Selection and Validation
Model Selection Using SRM
What to Do If Learning Fails


Gradient Descent
Stochastic Gradient Descent (SGD)
Learning with SGD
Bibliographic Remarks

Support Vector Machines
15.1 Margin and Hard-SVM
15.2 Soft-SVM and Norm Regularization
15.3 Optimality Conditions and “Support Vectors”*












Embeddings into Feature Spaces
The Kernel Trick
Implementing Soft-SVM with Kernels
Bibliographic Remarks


One-versus-All and All-Pairs
Linear Multiclass Predictors
Structured Output Prediction
Bipartite Ranking and Multivariate Performance Measures
Bibliographic Remarks


Decision Trees
18.1 Sample Complexity
18.2 Decision Tree Algorithms
18.3 Random Forests
18.4 Summary
18.5 Bibliographic Remarks
18.6 Exercises


Nearest Neighbor





Multiclass, Ranking, and Complex Prediction Problems


Implementing Soft-SVM Using SGD
Bibliographic Remarks



k Nearest Neighbors
Efficient Implementation*
Bibliographic Remarks


Feedforward Neural Networks
Learning Neural Networks
The Expressive Power of Neural Networks
The Sample Complexity of Neural Networks
The Runtime of Learning Neural Networks
SGD and Backpropagation



20.7 Summary
20.8 Bibliographic Remarks
20.9 Exercises

Part 3 Additional Learning Models
21 Online Learning






Online Classification in the Realizable Case
Online Classification in the Unrealizable Case
Online Convex Optimization
The Online Perceptron Algorithm
Bibliographic Remarks

Linkage-Based Clustering Algorithms
k-Means and Other Cost Minimization Clusterings
Spectral Clustering
Information Bottleneck*
A High Level View of Clustering
Bibliographic Remarks


Dimensionality Reduction
23.1 Principal Component Analysis (PCA)
23.2 Random Projections
23.3 Compressed Sensing
23.4 PCA or Compressed Sensing?
23.5 Summary
23.6 Bibliographic Remarks
23.7 Exercises


Generative Models
24.1 Maximum Likelihood Estimator
24.2 Naive Bayes
24.3 Linear Discriminant Analysis
24.4 Latent Variables and the EM Algorithm
24.5 Bayesian Reasoning
24.6 Summary
24.7 Bibliographic Remarks
24.8 Exercises


Feature Selection and Generation
25.1 Feature Selection
25.2 Feature Manipulation and Normalization
25.3 Feature Learning








25.4 Summary
25.5 Bibliographic Remarks
25.6 Exercises

Part 4



Advanced Theory

Rademacher Complexities
26.1 The Rademacher Complexity
26.2 Rademacher Complexity of Linear Classes
26.3 Generalization Bounds for SVM
26.4 Generalization Bounds for Predictors with Low 1 Norm
26.5 Bibliographic Remarks




Covering Numbers
27.1 Covering
27.2 From Covering to Rademacher Complexity via Chaining
27.3 Bibliographic Remarks


Proof of the Fundamental Theorem of Learning Theory


28.1 The Upper Bound for the Agnostic Case
28.2 The Lower Bound for the Agnostic Case
28.3 The Upper Bound for the Realizable Case





Multiclass Learnability
29.1 The Natarajan Dimension
29.2 The Multiclass Fundamental Theorem
29.3 Calculating the Natarajan Dimension
29.4 On Good and Bad ERMs
29.5 Bibliographic Remarks
29.6 Exercises


Compression Bounds
30.1 Compression Bounds
30.2 Examples
30.3 Bibliographic Remarks




31.1 PAC-Bayes Bounds
31.2 Bibliographic Remarks
31.3 Exercises




Appendix A Technical Lemmas


Appendix B


Measure Concentration
Markov’s Inequality
Chebyshev’s Inequality
Chernoff’s Bounds
Hoeffding’s Inequality




Appendix C

Bennet’s and Bernstein’s Inequalities
Slud’s Inequality
Concentration of χ 2 Variables


Linear Algebra
Basic Definitions
Eigenvalues and Eigenvectors
Positive definite matrices
Singular Value Decomposition (SVD)




The term machine learning refers to the automated detection of meaningful patterns
in data. In the past couple of decades it has become a common tool in almost any
task that requires information extraction from large data sets. We are surrounded
by a machine learning based technology: Search engines learn how to bring us the
best results (while placing profitable ads), antispam software learns to filter our email messages, and credit card transactions are secured by a software that learns
how to detect frauds. Digital cameras learn to detect faces and intelligent personal
assistance applications on smart-phones learn to recognize voice commands. Cars
are equipped with accident prevention systems that are built using machine learning
algorithms. Machine learning is also widely used in scientific applications such as
bioinformatics, medicine, and astronomy.
One common feature of all of these applications is that, in contrast to more traditional uses of computers, in these cases, due to the complexity of the patterns that
need to be detected, a human programmer cannot provide an explicit, fine-detailed
specification of how such tasks should be executed. Taking example from intelligent
beings, many of our skills are acquired or refined through learning from our experience (rather than following explicit instructions given to us). Machine learning tools
are concerned with endowing programs with the ability to “learn” and adapt.
The first goal of this book is to provide a rigorous, yet easy to follow, introduction
to the main concepts underlying machine learning: What is learning? How can a
machine learn? How do we quantify the resources needed to learn a given concept?
Is learning always possible? Can we know whether the learning process succeeded or
The second goal of this book is to present several key machine learning algorithms. We chose to present algorithms that on one hand are successfully used in
practice and on the other hand give a wide spectrum of different learning techniques. Additionally, we pay specific attention to algorithms appropriate for large
scale learning (a.k.a. “Big Data”), since in recent years, our world has become
increasingly “digitized” and the amount of data available for learning is dramatically increasing. As a result, in many applications data is plentiful and computation




time is the main bottleneck. We therefore explicitly quantify both the amount of
data and the amount of computation time needed to learn a given concept.
The book is divided into four parts. The first part aims at giving an initial rigorous answer to the fundamental questions of learning. We describe a generalization
of Valiant’s Probably Approximately Correct (PAC) learning model, which is a first
solid answer to the question “What is learning?” We describe the Empirical Risk
Minimization (ERM), Structural Risk Minimization (SRM), and Minimum Description Length (MDL) learning rules, which show “how a machine can learn.” We
quantify the amount of data needed for learning using the ERM, SRM, and MDL
rules and show how learning might fail by deriving a “no-free-lunch” theorem. We
also discuss how much computation time is required for learning. In the second part
of the book we describe various learning algorithms. For some of the algorithms,
we first present a more general learning principle, and then show how the algorithm
follows the principle. While the first two parts of the book focus on the PAC model,
the third part extends the scope by presenting a wider variety of learning models.
Finally, the last part of the book is devoted to advanced theory.
We made an attempt to keep the book as self-contained as possible. However,
the reader is assumed to be comfortable with basic notions of probability, linear
algebra, analysis, and algorithms. The first three parts of the book are intended
for first year graduate students in computer science, engineering, mathematics, or
statistics. It can also be accessible to undergraduate students with the adequate
background. The more advanced chapters can be used by researchers intending to
gather a deeper theoretical understanding.

The book is based on Introduction to Machine Learning courses taught by Shai
Shalev-Shwartz at the Hebrew University and by Shai Ben-David at the University
of Waterloo. The first draft of the book grew out of the lecture notes for the course
that was taught at the Hebrew University by Shai Shalev-Shwartz during 2010–2013.
We greatly appreciate the help of Ohad Shamir, who served as a TA for the course
in 2010, and of Alon Gonen, who served as a TA for the course in 2011–2013. Ohad
and Alon prepared a few lecture notes and many of the exercises. Alon, to whom
we are indebted for his help throughout the entire making of the book, has also
prepared a solution manual.
We are deeply grateful for the most valuable work of Dana Rubinstein. Dana
has scientifically proofread and edited the manuscript, transforming it from lecturebased chapters into fluent and coherent text.
Special thanks to Amit Daniely, who helped us with a careful read of the
advanced part of the book and wrote the advanced chapter on multiclass learnability. We are also grateful for the members of a book reading club in Jerusalem who
have carefully read and constructively criticized every line of the manuscript. The
members of the reading club are Maya Alroy, Yossi Arjevani, Aharon Birnbaum,
Alon Cohen, Alon Gonen, Roi Livni, Ofer Meshi, Dan Rosenbaum, Dana Rubinstein, Shahar Somin, Alon Vinnikov, and Yoav Wald. We would also like to thank
Gal Elidan, Amir Globerson, Nika Haghtalab, Shie Mannor, Amnon Shashua, Nati
Srebro, and Ruth Urner for helpful discussions.


The subject of this book is automated learning, or, as we will more often call it,
Machine Learning (ML). That is, we wish to program computers so that they can
“learn” from input available to them. Roughly speaking, learning is the process of
converting experience into expertise or knowledge. The input to a learning algorithm is training data, representing experience, and the output is some expertise,
which usually takes the form of another computer program that can perform some
task. Seeking a formal-mathematical understanding of this concept, we’ll have to
be more explicit about what we mean by each of the involved terms: What is the
training data our programs will access? How can the process of learning be automated? How can we evaluate the success of such a process (namely, the quality of
the output of a learning program)?

Let us begin by considering a couple of examples from naturally occurring animal
learning. Some of the most fundamental issues in ML arise already in that context,
which we are all familiar with.
Bait Shyness – Rats Learning to Avoid Poisonous Baits: When rats encounter
food items with novel look or smell, they will first eat very small amounts, and subsequent feeding will depend on the flavor of the food and its physiological effect.
If the food produces an ill effect, the novel food will often be associated with the
illness, and subsequently, the rats will not eat it. Clearly, there is a learning mechanism in play here – the animal used past experience with some food to acquire
expertise in detecting the safety of this food. If past experience with the food was
negatively labeled, the animal predicts that it will also have a negative effect when
encountered in the future.
Inspired by the preceding example of successful learning, let us demonstrate
a typical machine learning task. Suppose we would like to program a machine that
learns how to filter spam e-mails. A naive solution would be seemingly similar to the
way rats learn how to avoid poisonous baits. The machine will simply memorize all
previous e-mails that had been labeled as spam e-mails by the human user. When a



new e-mail arrives, the machine will search for it in the set of previous spam e-mails.
If it matches one of them, it will be trashed. Otherwise, it will be moved to the user’s
inbox folder.
While the preceding “learning by memorization” approach is sometimes useful,
it lacks an important aspect of learning systems – the ability to label unseen e-mail
messages. A successful learner should be able to progress from individual examples
to broader generalization. This is also referred to as inductive reasoning or inductive
inference. In the bait shyness example presented previously, after the rats encounter
an example of a certain type of food, they apply their attitude toward it on new,
unseen examples of food of similar smell and taste. To achieve generalization in the
spam filtering task, the learner can scan the previously seen e-mails, and extract a set
of words whose appearance in an e-mail message is indicative of spam. Then, when
a new e-mail arrives, the machine can check whether one of the suspicious words
appears in it, and predict its label accordingly. Such a system would potentially be
able correctly to predict the label of unseen e-mails.
However, inductive reasoning might lead us to false conclusions. To illustrate
this, let us consider again an example from animal learning.
Pigeon Superstition: In an experiment performed by the psychologist
B. F. Skinner, he placed a bunch of hungry pigeons in a cage. An automatic mechanism had been attached to the cage, delivering food to the pigeons at regular
intervals with no reference whatsoever to the birds’ behavior. The hungry pigeons
went around the cage, and when food was first delivered, it found each pigeon
engaged in some activity (pecking, turning the head, etc.). The arrival of food reinforced each bird’s specific action, and consequently, each bird tended to spend some
more time doing that very same action. That, in turn, increased the chance that the
next random food delivery would find each bird engaged in that activity again. What
results is a chain of events that reinforces the pigeons’ association of the delivery of
the food with whatever chance actions they had been performing when it was first
delivered. They subsequently continue to perform these same actions diligently.1
What distinguishes learning mechanisms that result in superstition from useful
learning? This question is crucial to the development of automated learners. While
human learners can rely on common sense to filter out random meaningless learning
conclusions, once we export the task of learning to a machine, we must provide
well defined crisp principles that will protect the program from reaching senseless
or useless conclusions. The development of such principles is a central goal of the
theory of machine learning.
What, then, made the rats’ learning more successful than that of the pigeons?
As a first step toward answering this question, let us have a closer look at the bait
shyness phenomenon in rats.
Bait Shyness revisited – rats fail to acquire conditioning between food and electric
shock or between sound and nausea: The bait shyness mechanism in rats turns out to
be more complex than what one may expect. In experiments carried out by Garcia
(Garcia & Koelling 1996), it was demonstrated that if the unpleasant stimulus that
follows food consumption is replaced by, say, electrical shock (rather than nausea),
then no conditioning occurs. Even after repeated trials in which the consumption


1.2 When Do We Need Machine Learning?

of some food is followed by the administration of unpleasant electrical shock, the
rats do not tend to avoid that food. Similar failure of conditioning occurs when the
characteristic of the food that implies nausea (such as taste or smell) is replaced
by a vocal signal. The rats seem to have some “built in” prior knowledge telling
them that, while temporal correlation between food and nausea can be causal, it is
unlikely that there would be a causal relationship between food consumption and
electrical shocks or between sounds and nausea.
We conclude that one distinguishing feature between the bait shyness learning and the pigeon superstition is the incorporation of prior knowledge that biases
the learning mechanism. This is also referred to as inductive bias. The pigeons in
the experiment are willing to adopt any explanation for the occurrence of food.
However, the rats “know” that food cannot cause an electric shock and that the
co-occurrence of noise with some food is not likely to affect the nutritional value
of that food. The rats’ learning process is biased toward detecting some kind of
patterns while ignoring other temporal correlations between events.
It turns out that the incorporation of prior knowledge, biasing the learning process, is inevitable for the success of learning algorithms (this is formally stated and
proved as the “No-Free-Lunch theorem” in Chapter 5). The development of tools
for expressing domain expertise, translating it into a learning bias, and quantifying
the effect of such a bias on the success of learning is a central theme of the theory
of machine learning. Roughly speaking, the stronger the prior knowledge (or prior
assumptions) that one starts the learning process with, the easier it is to learn from
further examples. However, the stronger these prior assumptions are, the less flexible the learning is – it is bound, a priori, by the commitment to these assumptions.
We shall discuss these issues explicitly in Chapter 5.

When do we need machine learning rather than directly program our computers to
carry out the task at hand? Two aspects of a given problem may call for the use of
programs that learn and improve on the basis of their “experience”: the problem’s
complexity and the need for adaptivity.
Tasks That Are Too Complex to Program.
 Tasks Performed by Animals/Humans: There are numerous tasks that we
human beings perform routinely, yet our introspection concerning how
we do them is not sufficiently elaborate to extract a well defined program. Examples of such tasks include driving, speech recognition, and
image understanding. In all of these tasks, state of the art machine learning programs, programs that “learn from their experience,” achieve quite
satisfactory results, once exposed to sufficiently many training examples.
 Tasks beyond Human Capabilities: Another wide family of tasks that benefit from machine learning techniques are related to the analysis of very
large and complex data sets: astronomical data, turning medical archives
into medical knowledge, weather prediction, analysis of genomic data, Web
search engines, and electronic commerce. With more and more available




digitally recorded data, it becomes obvious that there are treasures of meaningful information buried in data archives that are way too large and too
complex for humans to make sense of. Learning to detect meaningful patterns in large and complex data sets is a promising domain in which the
combination of programs that learn with the almost unlimited memory
capacity and ever increasing processing speed of computers opens up new
Adaptivity. One limiting feature of programmed tools is their rigidity – once the
program has been written down and installed, it stays unchanged. However,
many tasks change over time or from one user to another. Machine learning
tools – programs whose behavior adapts to their input data – offer a solution to
such issues; they are, by nature, adaptive to changes in the environment they
interact with. Typical successful applications of machine learning to such problems include programs that decode handwritten text, where a fixed program can
adapt to variations between the handwriting of different users; spam detection
programs, adapting automatically to changes in the nature of spam e-mails; and
speech recognition programs.

Learning is, of course, a very wide domain. Consequently, the field of machine
learning has branched into several subfields dealing with different types of learning
tasks. We give a rough taxonomy of learning paradigms, aiming to provide some
perspective of where the content of this book sits within the wide field of machine
We describe four parameters along which learning paradigms can be classified.
Supervised versus Unsupervised Since learning involves an interaction between the
learner and the environment, one can divide learning tasks according to the
nature of that interaction. The first distinction to note is the difference between
supervised and unsupervised learning. As an illustrative example, consider the
task of learning to detect spam e-mail versus the task of anomaly detection.
For the spam detection task, we consider a setting in which the learner receives
training e-mails for which the label spam/not-spam is provided. On the basis of
such training the learner should figure out a rule for labeling a newly arriving
e-mail message. In contrast, for the task of anomaly detection, all the learner
gets as training is a large body of e-mail messages (with no labels) and the
learner’s task is to detect “unusual” messages.
More abstractly, viewing learning as a process of “using experience to gain
expertise,” supervised learning describes a scenario in which the “experience,”
a training example, contains significant information (say, the spam/not-spam
labels) that is missing in the unseen “test examples” to which the learned expertise is to be applied. In this setting, the acquired expertise is aimed to predict
that missing information for the test data. In such cases, we can think of the
environment as a teacher that “supervises” the learner by providing the extra
information (labels). In unsupervised learning, however, there is no distinction
between training and test data. The learner processes input data with the goal

1.3 Types of Learning

of coming up with some summary, or compressed version of that data. Clustering a data set into subsets of similar objets is a typical example of such a
There is also an intermediate learning setting in which, while the training examples contain more information than the test examples, the learner is
required to predict even more information for the test examples. For example, one may try to learn a value function that describes for each setting of a
chess board the degree by which White’s position is better than the Black’s.
Yet, the only information available to the learner at training time is positions
that occurred throughout actual chess games, labeled by who eventually won
that game. Such learning frameworks are mainly investigated under the title of
reinforcement learning.
Active versus Passive Learners Learning paradigms can vary by the role played
by the learner. We distinguish between “active” and “passive” learners. An
active learner interacts with the environment at training time, say, by posing
queries or performing experiments, while a passive learner only observes the
information provided by the environment (or the teacher) without influencing or directing it. Note that the learner of a spam filter is usually passive
– waiting for users to mark the e-mails coming to them. In an active setting, one could imagine asking users to label specific e-mails chosen by the
learner, or even composed by the learner, to enhance its understanding of what
spam is.
Helpfulness of the Teacher When one thinks about human learning, of a baby at
home or a student at school, the process often involves a helpful teacher, who
is trying to feed the learner with the information most useful for achieving
the learning goal. In contrast, when a scientist learns about nature, the environment, playing the role of the teacher, can be best thought of as passive –
apples drop, stars shine, and the rain falls without regard to the needs of the
learner. We model such learning scenarios by postulating that the training data
(or the learner’s experience) is generated by some random process. This is the
basic building block in the branch of “statistical learning.” Finally, learning also
occurs when the learner’s input is generated by an adversarial “teacher.” This
may be the case in the spam filtering example (if the spammer makes an effort
to mislead the spam filtering designer) or in learning to detect fraud. One also
uses an adversarial teacher model as a worst-case scenario, when no milder
setup can be safely assumed. If you can learn against an adversarial teacher,
you are guaranteed to succeed interacting any odd teacher.
Online versus Batch Learning Protocol The last parameter we mention is the distinction between situations in which the learner has to respond online, throughout the learning process, and settings in which the learner has to engage the
acquired expertise only after having a chance to process large amounts of data.
For example, a stockbroker has to make daily decisions, based on the experience collected so far. He may become an expert over time, but might have
made costly mistakes in the process. In contrast, in many data mining settings,
the learner – the data miner – has large amounts of training data to play with
before having to output conclusions.




In this book we shall discuss only a subset of the possible learning paradigms.
Our main focus is on supervised statistical batch learning with a passive learner
(for example, trying to learn how to generate patients’ prognoses, based on large
archives of records of patients that were independently collected and are already
labeled by the fate of the recorded patients). We shall also briefly discuss online
learning and batch unsupervised learning (in particular, clustering).

As an interdisciplinary field, machine learning shares common threads with the
mathematical fields of statistics, information theory, game theory, and optimization.
It is naturally a subfield of computer science, as our goal is to program machines so
that they will learn. In a sense, machine learning can be viewed as a branch of AI
(Artificial Intelligence), since, after all, the ability to turn experience into expertise or to detect meaningful patterns in complex sensory data is a cornerstone of
human (and animal) intelligence. However, one should note that, in contrast with
traditional AI, machine learning is not trying to build automated imitation of intelligent behavior, but rather to use the strengths and special abilities of computers
to complement human intelligence, often performing tasks that fall way beyond
human capabilities. For example, the ability to scan and process huge databases
allows machine learning programs to detect patterns that are outside the scope of
human perception.
The component of experience, or training, in machine learning often refers to
data that is randomly generated. The task of the learner is to process such randomly
generated examples toward drawing conclusions that hold for the environment from
which these examples are picked. This description of machine learning highlights its
close relationship with statistics. Indeed there is a lot in common between the two
disciplines, in terms of both the goals and techniques used. There are, however, a
few significant differences of emphasis; if a doctor comes up with the hypothesis
that there is a correlation between smoking and heart disease, it is the statistician’s
role to view samples of patients and check the validity of that hypothesis (this is the
common statistical task of hypothesis testing). In contrast, machine learning aims
to use the data gathered from samples of patients to come up with a description of
the causes of heart disease. The hope is that automated techniques may be able to
figure out meaningful patterns (or hypotheses) that may have been missed by the
human observer.
In contrast with traditional statistics, in machine learning in general, and in this
book in particular, algorithmic considerations play a major role. Machine learning
is about the execution of learning by computers; hence algorithmic issues are pivotal. We develop algorithms to perform the learning tasks and are concerned with
their computational efficiency. Another difference is that while statistics is often
interested in asymptotic behavior (like the convergence of sample-based statistical estimates as the sample sizes grow to infinity), the theory of machine learning
focuses on finite sample bounds. Namely, given the size of available samples,
machine learning theory aims to figure out the degree of accuracy that a learner
can expect on the basis of such samples.

1.5 How to Read This Book

There are further differences between these two disciplines, of which we shall
mention only one more here. While in statistics it is common to work under the
assumption of certain presubscribed data models (such as assuming the normality of data-generating distributions, or the linearity of functional dependencies), in
machine learning the emphasis is on working under a “distribution-free” setting,
where the learner assumes as little as possible about the nature of the data distribution and allows the learning algorithm to figure out which models best approximate
the data-generating process. A precise discussion of this issue requires some technical preliminaries, and we will come back to it later in the book, and in particular in
Chapter 5.

The first part of the book provides the basic theoretical principles that underlie
machine learning (ML). In a sense, this is the foundation upon which the rest of
the book is built. This part could serve as a basis for a minicourse on the theoretical
foundations of ML.
The second part of the book introduces the most commonly used algorithmic
approaches to supervised machine learning. A subset of these chapters may also be
used for introducing machine learning in a general AI course to computer science,
Math, or engineering students.
The third part of the book extends the scope of discussion from statistical classification to other learning models. It covers online learning, unsupervised learning,
dimensionality reduction, generative models, and feature learning.
The fourth part of the book, Advanced Theory, is geared toward readers who
have interest in research and provides the more technical mathematical techniques
that serve to analyze and drive forward the field of theoretical machine learning.
The Appendixes provide some technical tools used in the book. In particular, we
list basic results from measure concentration and linear algebra.
A few sections are marked by an asterisk, which means they are addressed
to more advanced students. Each chapter is concluded with a list of exercises. A
solution manual is provided in the course Web site.

1.5.1 Possible Course Plans Based on This Book
A 14 Week Introduction Course for Graduate Students:

Chapters 2–4.
Chapter 9 (without the VC calculation).
Chapters 5–6 (without proofs).
Chapter 10.
Chapters 7, 11 (without proofs).
Chapters 12, 13 (with some of the easier proofs).
Chapter 14 (with some of the easier proofs).
Chapter 15.
Chapter 16.
Chapter 18.





Chapter 22.
Chapter 23 (without proofs for compressed sensing).
Chapter 24.
Chapter 25.

A 14 Week Advanced Course for Graduate Students:

Chapters 26, 27.
Chapters 6, 28.
Chapter 7.
Chapter 31.
Chapter 30.
Chapters 12, 13.
Chapter 14.
Chapter 8.
Chapter 17.
Chapter 29.
Chapter 19.
Chapter 20.
Chapter 21.

Most of the notation we use throughout the book is either standard or defined on
the spot. In this section we describe our main conventions and provide a table summarizing our notation (Table 1.1). The reader is encouraged to skip this section and
return to it if during the reading of the book some notation is unclear.
We denote scalars and abstract objects with lowercase letters (e.g. x and λ).
Often, we would like to emphasize that some object is a vector and then we use
boldface letters (e.g. x and λ). The i th element of a vector x is denoted by x i . We use
uppercase letters to denote matrices, sets, and sequences. The meaning should be
clear from the context. As we will see momentarily, the input of a learning algorithm
is a sequence of training examples. We denote by z an abstract example and by
S = z 1 , . . . , z m a sequence of m examples. Historically, S is often referred to as a
training set; however, we will always assume that S is a sequence rather than a set.
A sequence of m vectors is denoted by x1 , . . . , xm . The i th element of xt is denoted
by x t,i .
Throughout the book, we make use of basic notions from probability. We denote
by D a distribution over some set,2 for example, Z . We use the notation z ∼ D to
denote that z is sampled according to D. Given a random variable f : Z → R, its
expected value is denoted by Ez∼D [ f (z)]. We sometimes use the shorthand E [ f ]
when the dependence on z is clear from the context. For f : Z → {true, false} we
also use Pz∼D [ f (z)] to denote D({z : f (z) = true}). In the next chapter we will also

To be mathematically precise, D should be defined over some σ -algebra of subsets of Z . The user who
is not familiar with measure theory can skip the few footnotes and remarks regarding more formal
measurability definitions and assumptions.

1.6 Notation

Table 1.1. Summary of notation


O, o, , ω, , Õ
1[Boolean expression]
x, v, w
xi , vi , wi
x, v
x2 or x
A ∈ Rd,k
A i, j
x1 , . . . , xm
xi, j
w(1) , . . ., w(T )
 : H × Z → R+
z ∼D
S = z 1 , . . ., z m
S ∼ Dm
P, E
Pz∼D [ f (z)]
Ez∼D [ f (z)]
N(µ, C)
f (x)
f (x)

the set of real numbers
the set of d-dimensional vectors over R
the set of non-negative real numbers
the set of natural numbers
asymptotic notation (see text)
indicator function (equals 1 if expression is true and 0 o.w.)
= max{0, a}
the set {1, . . . , n} (for n ∈ N)
(column) vectors
the ith element of a vector

= √ di=1 xi vi (inner product)
= x, x (the 2 norm of x)

= di=1 |xi | (the 1 norm of x)
= maxi |xi | (the ∞ norm of x)
the number of nonzero elements of x
a d × k matrix over R
the transpose of A
the (i, j ) element of A
the d × d matrix A s.t. A i, j = xi x j (where x ∈ Rd )
a sequence of m vectors
the j th element of the ith vector in the sequence
the values of a vector w during an iterative algorithm
the ith element of the vector w(t)
instances domain (a set)
labels domain (a set)
examples domain (a set)
hypothesis class (a set)
loss function
a distribution over some set (usually over Z or over X )
the probability of a set A ⊆ Z according to D
sampling z according to D
a sequence of m examples
sampling S = z 1 , . . . , z m i.i.d. according to D
probability and expectation of a random variable
= D({z : f (z) = true}) for f : Z → {true, false}
expectation of the random variable f : Z → R
Gaussian distribution with expectation µ and covariance C
the derivative of a function f : R → R at x
the second derivative of a function f : R → R at x
the partial derivative of a function f : Rd → R at w w.r.t. wi
the gradient of a function f : Rd → R at w
the differential set of a function f : Rd → R at w
= min{ f (x) : x ∈ C} (minimal value of f over C)
= max{ f (x) : x ∈ C} (maximal value of f over C)
the set {x ∈ C : f (x) = minz∈C f (z)}
the set {x ∈ C : f (x) = maxz∈C f (z)}
the natural logarithm

∂ f (w)

∇ f (w)
∂ f (w)
min x∈C f (x)
max x∈C f (x)
argmin x∈C f (x)
argmax x∈C f (x)




introduce the notation Dm to denote the probability over Z m induced by sampling
(z 1 , . . . , z m ) where each point z i is sampled from D independently of the other points.
In general, we have made an effort to avoid asymptotic notation. However, we
occasionally use it to clarify the main results. In particular, given f : R → R+ and
g : R → R+ we write f = O(g) if there exist x 0 , α ∈ R+ such that for all x > x 0 we
have f (x) ≤ αg(x). We write f = o(g) if for every α > 0 there exists x 0 such that for
all x > x 0 we have f (x) ≤ αg(x). We write f = (g) if there exist x 0 , α ∈ R+ such that
for all x > x 0 we have f (x) ≥ αg(x). The notation f = ω(g) is defined analogously.
The notation f = (g) means that f = O(g) and g = O( f ). Finally, the notation
f = Õ(g) means that there exists k ∈ N such that f (x) = O(g(x) logk (g(x))).
The inner product between vectors x and w is denoted by x, w. Whenever we
do not specify the vector
space we assume that it is the d-dimensional Euclidean
space and then x, w = di=1 x i wi . The Euclidean (or 2 ) norm of a vector w is
w2 = w, w. We omit the subscript from the 2 norm when it is clear from the
1/ p

context. We also use other  p norms, w p =
|wi | p
, and in particular w1 =

We use the notation minx∈C f (x) to denote the minimum value of the set
{ f (x) : x ∈ C}. To be mathematically more precise, we should use infx∈C f (x) whenever the minimum is not achievable. However, in the context of this book the
distinction between infimum and minimum is often of little interest. Hence, to simplify the presentation, we sometimes use the min notation even when inf is more
adequate. An analogous remark applies to max versus sup.



A Gentle Start

Let us begin our mathematical analysis by showing how successful learning can be
achieved in a relatively simplified setting. Imagine you have just arrived in some
small Pacific island. You soon find out that papayas are a significant ingredient in the
local diet. However, you have never before tasted papayas. You have to learn how
to predict whether a papaya you see in the market is tasty or not. First, you need
to decide which features of a papaya your prediction should be based on. On the
basis of your previous experience with other fruits, you decide to use two features:
the papaya’s color, ranging from dark green, through orange and red to dark brown,
and the papaya’s softness, ranging from rock hard to mushy. Your input for figuring
out your prediction rule is a sample of papayas that you have examined for color
and softness and then tasted and found out whether they were tasty or not. Let
us analyze this task as a demonstration of the considerations involved in learning
Our first step is to describe a formal model aimed to capture such learning tasks.

The learner’s input: In the basic statistical learning setting, the learner has access
to the following:
Domain set: An arbitrary set, X . This is the set of objects that we may wish
to label. For example, in the papaya learning problem mentioned before,
the domain set will be the set of all papayas. Usually, these domain
points will be represented by a vector of features (like the papaya’s color
and softness). We also refer to domain points as instances and to X as
instance space.
Label set: For our current discussion, we will restrict the label set to be a
two-element set, usually {0, 1} or {−1, +1}. Let Y denote our set of possible labels. For our papayas example, let Y be {0, 1}, where 1 represents
being tasty and 0 stands for being not-tasty.
Training data: S = ((x 1 , y1 ) . . . (x m , ym )) is a finite sequence of pairs in X × Y:
that is, a sequence of labeled domain points. This is the input that the


A Gentle Start

learner has access to (like a set of papayas that have been tasted and their
color, softness, and tastiness). Such labeled examples are often called
training examples. We sometimes also refer to S as a training set.1
The learner’s output: The learner is requested to output a prediction rule,
h : X → Y. This function is also called a predictor, a hypothesis, or a classifier.
The predictor can be used to predict the label of new domain points. In our
papayas example, it is a rule that our learner will employ to predict whether
future papayas he examines in the farmers’ market are going to be tasty or not.
We use the notation A(S) to denote the hypothesis that a learning algorithm,
A, returns upon receiving the training sequence S.
A simple data-generation model We now explain how the training data is generated. First, we assume that the instances (the papayas we encounter) are
generated by some probability distribution (in this case, representing the
environment). Let us denote that probability distribution over X by D. It is
important to note that we do not assume that the learner knows anything about
this distribution. For the type of learning tasks we discuss, this could be any
arbitrary probability distribution. As to the labels, in the current discussion
we assume that there is some “correct” labeling function, f : X → Y, and that
yi = f (x i ) for all i . This assumption will be relaxed in the next chapter. The
labeling function is unknown to the learner. In fact, this is just what the learner
is trying to figure out. In summary, each pair in the training data S is generated
by first sampling a point x i according to D and then labeling it by f .
Measures of success: We define the error of a classifier to be the probability that
it does not predict the correct label on a random data point generated by the
aforementioned underlying distribution. That is, the error of h is the probability to draw a random instance x, according to the distribution D, such that
h(x) does not equal f (x).
Formally, given a domain subset,2 A ⊂ X , the probability distribution, D,
assigns a number, D(A), which determines how likely it is to observe a point
x ∈ A. In many cases, we refer to A as an event and express it using a function
π : X → {0, 1}, namely, A = {x ∈ X : π(x) = 1}. In that case, we also use the
notation Px∼D [π(x)] to express D(A).
We define the error of a prediction rule, h : X → Y, to be

L D, f (h) =


P [h(x) = f (x)] = D({x : h(x) = f (x)}).



That is, the error of such h is the probability of randomly choosing an example
x for which h(x) = f (x). The subscript (D, f ) indicates that the error is measured with respect to the probability distribution D and the correct labeling
function f . We omit this subscript when it is clear from the context. L (D, f ) (h)
has several synonymous names such as the generalization error, the risk, or
the true error of h, and we will use these names interchangeably throughout

Despite the “set” notation, S is a sequence. In particular, the same example may appear twice in S and
some algorithms can take into account the order of examples in S.
Strictly speaking, we should be more careful and require that A is a member of some σ -algebra of
subsets of X , over which D is defined. We will formally define our measurability assumptions in the
next chapter.

2.2 Empirical Risk Minimization

the book. We use the letter L for the error, since we view this error as the loss
of the learner. We will later also discuss other possible formulations of such
A note about the information available to the learner The learner is blind to the
underlying distribution D over the world and to the labeling function f. In our
papayas example, we have just arrived in a new island and we have no clue
as to how papayas are distributed and how to predict their tastiness. The only
way the learner can interact with the environment is through observing the
training set.
In the next section we describe a simple learning paradigm for the preceding
setup and analyze its performance.

As mentioned earlier, a learning algorithm receives as input a training set S, sampled from an unknown distribution D and labeled by some target function f , and
should output a predictor h S : X → Y (the subscript S emphasizes the fact that
the output predictor depends on S). The goal of the algorithm is to find h S that
minimizes the error with respect to the unknown D and f .
Since the learner does not know what D and f are, the true error is not directly
available to the learner. A useful notion of error that can be calculated by the
learner is the training error – the error the classifier incurs over the training sample:

L S (h) =

|{i ∈ [m] : h(x i ) = yi }|


where [m] = {1, . . . , m}.
The terms empirical error and empirical risk are often used interchangeably for
this error.
Since the training sample is the snapshot of the world that is available to the
learner, it makes sense to search for a solution that works well on that data. This
learning paradigm – coming up with a predictor h that minimizes L S (h) – is called
Empirical Risk Minimization or ERM for short.

2.2.1 Something May Go Wrong – Overfitting
Although the ERM rule seems very natural, without being careful, this approach
may fail miserably.
To demonstrate such a failure, let us go back to the problem of learning to predict the taste of a papaya on the basis of its softness and color. Consider a sample as
depicted in the following:



A Gentle Start

Assume that the probability distribution D is such that instances are distributed
uniformly within the gray square and the labeling function, f , determines the label
to be 1 if the instance is within the inner square, and 0 otherwise. The area of the
gray square in the picture is 2 and the area of the inner square is 1. Consider the
following predictor:

yi if ∃i ∈ [m] s. t. x i = x
h S (x) =
0 otherwise.
While this predictor mig
ht seem rather artificia
l, in Exercise 2.1 we show a natural representation of it using polynomials. Clearly, no matter what the sample is,
L S (h S ) = 0, and therefore this predictor may be chosen by an ERM algorithm (it is
one of the empirical-minimum-cost hypotheses; no classifier can have smaller error).
On the other hand, the true error of any classifier that predicts the label 1 only on a
finite number of instances is, in this case, 1/2. Thus, L D (h S ) = 1/2. We have found
a predictor whose performance on the training set is excellent, yet its performance
on the true “world” is very poor. This phenomenon is called overfitting. Intuitively,
overfitting occurs when our hypothesis fits the training data “too well” (perhaps like
the everyday experience that a person who provides a perfect detailed explanation
for each of his single actions may raise suspicion).

We have just demonstrated that the ERM rule might lead to overfitting. Rather
than giving up on the ERM paradigm, we will look for ways to rectify it. We will
search for conditions under which there is a guarantee that ERM does not overfit,
namely, conditions under which when the ERM predictor has good performance
with respect to the training data, it is also highly likely to perform well over the
underlying data distribution.
A common solution is to apply the ERM learning rule over a restricted search
space. Formally, the learner should choose in advance (before seeing the data) a set
of predictors. This set is called a hypothesis class and is denoted by H. Each h ∈ H
is a function mapping from X to Y. For a given class H, and a training sample, S,
the ERMH learner uses the ERM rule to choose a predictor h ∈ H, with the lowest
possible error over S. Formally,
ERMH (S) ∈ argmin L S (h),

where argmin stands for the set of hypotheses in H that achieve the minimum value
of L S (h) over H. By restricting the learner to choosing a predictor from H, we bias it
toward a particular set of predictors. Such restrictions are often called an inductive
bias. Since the choice of such a restriction is determined before the learner sees the
training data, it should ideally be based on some prior knowledge about the problem
to be learned. For example, for the papaya taste prediction problem we may choose
the class H to be the set of predictors that are determined by axis aligned rectangles
(in the space determined by the color and softness coordinates). We will later show
that ERMH over this class is guaranteed not to overfit. On the other hand, the
example of overfitting that we have seen previously, demonstrates that choosing H

2.3 Empirical Risk Minimization with Inductive Bias

to be a class of predictors that includes all functions that assign the value 1 to a finite
set of domain points does not suffice to guarantee that ERMH will not overfit.
A fundamental question in learning theory is, over which hypothesis classes
ERMH learning will not result in overfitting. We will study this question later in
the book.
Intuitively, choosing a more restricted hypothesis class better protects us against
overfitting but at the same time might cause us a stronger inductive bias. We will get
back to this fundamental tradeoff later.

2.3.1 Finite Hypothesis Classes
The simplest type of restriction on a class is imposing an upper bound on its size
(that is, the number of predictors h in H). In this section, we show that if H is a
finite class then ERMH will not overfit, provided it is based on a sufficiently large
training sample (this size requirement will depend on the size of H).
Limiting the learner to prediction rules within some finite hypothesis class may
be considered as a reasonably mild restriction. For example, H can be the set of all
predictors that can be implemented by a C++ program written in at most 109 bits
of code. In our papayas example, we mentioned previously the class of axis aligned
rectangles. While this is an infinite class, if we discretize the representation of real
numbers, say, by using a 64 bits floating-point representation, the hypothesis class
becomes a finite class.
Let us now analyze the performance of the ERMH learning rule assuming that
H is a finite class. For a training sample, S, labeled according to some f : X → Y, let
h S denote a result of applying ERM H to S, namely,
h S ∈ argmin L S (h).


In this chapter, we make the following simplifying assumption (which will be
relaxed in the next chapter).
Definition 2.1 (The Realizability Assumption). There exists h ∈ H s.t.
L (D, f ) (h ) = 0. Note that this assumption implies that with probability 1 over random samples, S, where the instances of S are sampled according to D and are labeled
by f , we have L S (h ) = 0.
The realizability assumption implies that for every ERM hypothesis we have
that3 L S (h S ) = 0. However, we are interested in the true risk of h S , L (D, f ) (h S ),
rather than its empirical risk.
Clearly, any guarantee on the error with respect to the underlying distribution,
D, for an algorithm that has access only to a sample S should depend on the relationship between D and S. The common assumption in statistical machine learning
is that the training sample S is generated by sampling points from the distribution D
independently of each other. Formally,

Mathematically speaking, this holds with probability 1. To simplify the presentation, we sometimes
omit the “with probability 1” specifier.



A Gentle Start

The i.i.d. assumption: The examples in the training set are independently and
identically distributed (i.i.d.) according to the distribution D. That is, every
x i in S is freshly sampled according to D and then labeled according to the
labeling function, f . We denote this assumption by S ∼ Dm where m is the
size of S, and Dm denotes the probability over m-tuples induced by applying D
to pick each element of the tuple independently of the other members of the
Intuitively, the training set S is a window through which the learner gets
partial information about the distribution D over the world and the labeling
function, f . The larger the sample gets, the more likely it is to reflect more
accurately the distribution and labeling used to generate it.
Since L (D, f ) (h S ) depends on the training set, S, and that training set is picked by
a random process, there is randomness in the choice of the predictor h S and, consequently, in the risk L (D, f ) (h S ). Formally, we say that it is a random variable. It is not
realistic to expect that with full certainty S will suffice to direct the learner toward
a good classifier (from the point of view of D), as there is always some probability
that the sampled training data happens to be very nonrepresentative of the underlying D. If we go back to the papaya tasting example, there is always some (small)
chance that all the papayas we have happened to taste were not tasty, in spite of the
fact that, say, 70% of the papayas in our island are tasty. In such a case, ERMH (S)
may be the constant function that labels every papaya as “not tasty” (and has 70%
error on the true distribution of papapyas in the island). We will therefore address
the probability to sample a training set for which L (D, f ) (h S ) is not too large. Usually, we denote the probability of getting a nonrepresentative sample by δ, and call
(1 − δ) the confidence parameter of our prediction.
On top of that, since we cannot guarantee perfect label prediction, we introduce
another parameter for the quality of prediction, the accuracy parameter, commonly
denoted by . We interpret the event L (D, f ) (h S ) > as a failure of the learner, while
if L (D, f ) (h S ) ≤ we view the output of the algorithm as an approximately correct
predictor. Therefore (fixing some labeling function f : X → Y), we are interested
in upper bounding the probability to sample m-tuple of instances that will lead to
failure of the learner. Formally, let S|x = (x 1 , . . . , x m ) be the instances of the training
set. We would like to upper bound
Dm ({S|x : L (D, f ) (h S ) > }).
Let H B be the set of “bad” hypotheses, that is,
H B = {h ∈ H : L (D, f ) (h) > }.
In addition, let
M = {S|x : ∃h ∈ H B , L S (h) = 0}
be the set of misleading samples: Namely, for every S|x ∈ M, there is a “bad” hypothesis, h ∈ H B , that looks like a “good” hypothesis on S|x . Now, recall that we would
like to bound the probability of the event L (D, f ) (h S ) > . But, since the realizability assumption implies that L S (h S ) = 0, it follows that the event L (D, f ) (h S ) > can
only happen if for some h ∈ H B we have L S (h) = 0. In other words, this event will

2.3 Empirical Risk Minimization with Inductive Bias

only happen if our sample is in the set of misleading samples, M. Formally, we have
shown that
{S|x : L (D, f ) (h S ) > } ≤ M.
Note that we can rewrite M as

{S|x : L S (h) = 0}.


h∈H B

Dm ({S|x : L (D, f ) (h S ) > }) ≤ Dm (M) = Dm ( ∪h∈H B {S|x : L S (h) = 0}).


Next, we upper bound the right-hand side of the preceding equation using the
union bound – a basic property of probabilities.
Lemma 2.2 (Union Bound).

For any two sets A, B and a distribution D we have
D(A ∪ B) ≤ D(A) + D(B).

Applying the union bound to the right-hand side of Equation (2.6) yields

Dm ({S|x : L S (h) = 0}).
Dm ({S|x : L (D, f ) (h S ) > }) ≤
h∈H B

Next, let us bound each summand of the right-hand side of the preceding inequality.
Fix some “bad” hypothesis h ∈ H B . The event L S (h) = 0 is equivalent to the event
∀i , h(x i ) = f (x i ). Since the examples in the training set are sampled i.i.d. we get that
Dm ({S|x : L S (h) = 0}) = Dm ({S|x : ∀i , h(x i ) = f (x i )})


D({x i : h(x i ) = f (x i )}).



For each individual sampling of an element of the training set we have
D({x i : h(x i ) = yi }) = 1 − L (D, f ) (h) ≤ 1 − ,
where the last inequality follows from the fact that h ∈ H B . Combining the previous
equation with Equation (2.8) and using the inequality 1 − ≤ e− we obtain that for
every h ∈ H B ,
Dm ({S|x : L S (h) = 0}) ≤ (1 − )m ≤ e− m .


Combining this equation with Equation (2.7) we conclude that
Dm ({S|x : L (D, f ) (h S ) > }) ≤ |H B | e−


≤ |H| e−



A graphical illustration which explains how we used the union bound is given in
Figure 2.1.
Corollary 2.3. Let H be a finite hypothesis class. Let δ ∈ (0, 1) and > 0 and let m be
an integer that satisfies
log (|H|/δ)



A Gentle Start

Figure 2.1. Each point in the large circle represents a possible m-tuple of instances. Each
colored oval represents the set of “misleading” m-tuple of instances for some “bad” predictor h ∈ H B . The ERM can potentially overfit whenever it gets a misleading training set
S. That is, for some h ∈ H B we have L S (h) = 0. Equation (2.9) guarantees that for each
individual bad hypothesis, h ∈ H B , at most (1 − )m -fraction of the training sets would be
misleading. In particular, the larger m is, the smaller each of these colored ovals becomes.
The union bound formalizes the fact that the area representing the training sets that are
misleading with respect to some h ∈ H B (that is, the training sets in M) is at most the
sum of the areas of the colored ovals. Therefore, it is bounded by |H B | times the maximum
size of a colored oval. Any sample S outside the colored ovals cannot cause the ERM rule
to overfit.

Then, for any labeling function, f , and for any distribution, D, for which the realizability assumption holds (that is, for some h ∈ H, L (D, f ) (h) = 0), with probability of
at least 1 − δ over the choice of an i.i.d. sample S of size m, we have that for every
ERM hypothesis, h S , it holds that
L (D, f ) (h S ) ≤ .
The preceeding corollary tells us that for a sufficiently large m, the ERMH rule
over a finite hypothesis class will be probably (with confidence 1 − δ) approximately
(up to an error of ) correct. In the next chapter we formally define the model of
Probably Approximately Correct (PAC) learning.

2.1 Overfitting of polynomial matching: We have shown that the predictor defined in
Equation (2.3) leads to overfitting. While this predictor seems to be very unnatural,
the goal of this exercise is to show that it can be described as a thresholded polyd
nomial. That is, show that given a training set S = {(xi , f (xi ))}m
i=1 ⊆ (R × {0, 1}) ,
there exists a polynomial p S such that h S (x) = 1 if and only if p S (x) ≥ 0, where h S
is as defined in Equation (2.3). It follows that learning the class of all thresholded
polynomials using the ERM rule may lead to overfitting.
2.2 Let H be a class of binary classifiers over a domain X . Let D be an unknown distribution over X , and let f be the target hypothesis in H. Fix some h ∈ H. Show that
the expected value of L S (h) over the choice of S| x equals L (D, f ) (h), namely,

S|x ∼D m

[L S (h)] = L (D, f ) (h).

2.3 Axis aligned rectangles: An axis aligned rectangle classifier in the plane is a classifier that assigns the value 1 to a point if and only if it is inside a certain rectangle.

2.4 Exercises







Figure 2.2. Axis aligned rectangles.

Formally, given real numbers a1 ≤ b1 , a2 ≤ b2 , define the classifier h (a1 ,b1 ,a2 ,b2 ) by

1 if a1 ≤ x1 ≤ b1 and a2 ≤ x2 ≤ b2
h (a1 ,b1 ,a2 ,b2 ) (x1 , x2 ) =
0 otherwise
The class of all axis aligned rectangles in the plane is defined as
H2rec = {h (a1 ,b1 ,a2 ,b2 ) : a1 ≤ b1 , and a2 ≤ b2 }.

Note that this is an infinite size hypothesis class. Throughout this exercise we rely
on the realizability assumption.
1. Let A be the algorithm that returns the smallest rectangle enclosing all positive
examples in the training set. Show that A is an ERM.
2. Show that if A receives a training set of size ≥ 4log (4/δ) then, with probability of
at least 1 − δ it returns a hypothesis with error of at most .
Hint: Fix some distribution D over X , let R ∗ = R(a1∗ , b1∗ , a2∗ , b2∗ ) be the rectangle that generates the labels, and let f be the corresponding hypothesis. Let
a1 ≥ a1∗ be a number such that the probability mass (with respect to D) of the
rectangle R1 = R(a1∗ , a1 , a2∗ , b2∗ ) is exactly /4. Similarly, let b1 , a2 , b2 be numbers
such that the probability masses of the rectangles R2 = R(b1 , b1∗ , a2∗ , b2∗ ), R3 =
R(a1∗ , b1∗ , a2∗ , a2 ), R4 = R(a1∗ , b1∗ , b2 , b2∗ ) are all exactly /4. Let R(S) be the
rectangle returned by A. See illustration in Figure 2.2.
 Show that R(S) ⊆ R ∗ .
 Show that if S contains (positive) examples in all of the rectangles
R1 , R2 , R3 , R4 , then the hypothesis returned by A has error of at most .
 For each i ∈ {1, . . . , 4}, upper bound the probability that S does not contain
an example from Ri .
 Use the union bound to conclude the argument.
3. Repeat the previous question for the class of axis aligned rectangles in Rd .
4. Show that the runtime of applying the algorithm A mentioned earlier is polynomial in d, 1/ , and in log (1/δ).


A Formal Learning Model

In this chapter we define our main formal learning model – the PAC learning model
and its extensions. We will consider other notions of learnability in Chapter 7.

In the previous chapter we have shown that for a finite hypothesis class, if the ERM
rule with respect to that class is applied on a sufficiently large training sample (whose
size is independent of the underlying distribution or labeling function) then the output hypothesis will be probably approximately correct. More generally, we now
defineProbably Approximately Correct (PAC) learning.
Definition 3.1 (PAC Learnability). A hypothesis class H is PAC learnable if there
exist a function m H : (0, 1)2 → N and a learning algorithm with the following property: For every , δ ∈ (0, 1), for every distribution D over X , and for every labeling
function f : X → {0, 1}, if the realizable assumption holds with respect to H, D, f ,
then when running the learning algorithm on m ≥ m H ( , δ) i.i.d. examples generated by D and labeled by f , the algorithm returns a hypothesis h such that, with
probability of at least 1 − δ (over the choice of the examples), L (D, f ) (h) ≤ .
The definition of Probably Approximately Correct learnability contains two
approximation parameters. The accuracy parameter determines how far the output classifier can be from the optimal one (this corresponds to the “approximately
correct”), and a confidence parameter δ indicating how likely the classifier is to meet
that accuracy requirement (corresponds to the “probably” part of “PAC”). Under
the data access model that we are investigating, these approximations are inevitable.
Since the training set is randomly generated, there may always be a small chance that
it will happen to be noninformative (for example, there is always some chance that
the training set will contain only one domain point, sampled over and over again).
Furthermore, even when we are lucky enough to get a training sample that does
faithfully represent D, because it is just a finite sample, there may always be some
fine details of D that it fails to reflect. Our accuracy parameter, , allows “forgiving”
the learner’s classifier for making minor errors.

3.2 A More General Learning Model

Sample Complexity
The function m H : (0, 1)2 → N determines the sample complexity of learning H: that
is, how many examples are required to guarantee a probably approximately correct
solution. The sample complexity is a function of the accuracy ( ) and confidence (δ)
parameters. It also depends on properties of the hypothesis class H – for example,
for a finite class we showed that the sample complexity depends on log the size of H.
Note that if H is PAC learnable, there are many functions m H that satisfy the
requirements given in the definition of PAC learnability. Therefore, to be precise,
we will define the sample complexity of learning H to be the “minimal function,”
in the sense that for any , δ, m H ( , δ) is the minimal integer that satisfies the
requirements of PAC learning with accuracy and confidence δ.
Let us now recall the conclusion of the analysis of finite hypothesis classes from
the previous chapter. It can be rephrased as stating:
Corollary 3.2. Every finite hypothesis class is PAC learnable with sample complexity

log (|H|/δ)
m H ( , δ) ≤
There are infinite classes that are learnable as well (see, for example, Exercise
3.3). Later on we will show that what determines the PAC learnability of a class is
not its finiteness but rather a combinatorial measure called the VC dimension.

The model we have just described can be readily generalized, so that it can be
made relevant to a wider scope of learning tasks. We consider generalizations in
two aspects:
Removing the Realizability Assumption
We have required that the learning algorithm succeeds on a pair of data distribution D and labeling function f provided that the realizability assumption is met. For
practical learning tasks, this assumption may be too strong (can we really guarantee that there is a rectangle in the color-hardness space that fully determines which
papayas are tasty?). In the next subsection, we will describe the agnostic PAC model
in which this realizability assumption is waived.
Learning Problems beyond Binary Classification
The learning task that we have been discussing so far has to do with predicting a
binary label to a given example (like being tasty or not). However, many learning
tasks take a different form. For example, one may wish to predict a real valued
number (say, the temperature at 9:00 p.m. tomorrow) or a label picked from a finite
set of labels (like the topic of the main story in tomorrow’s paper). It turns out
that our analysis of learning can be readily extended to such and many other scenarios by allowing a variety of loss functions. We shall discuss that in Section 3.2.2



A Formal Learning Model

3.2.1 Releasing the Realizability Assumption – Agnostic
PAC Learning
A More Realistic Model for the Data-Generating Distribution
Recall that the realizability assumption requires that there exists h ∈ H such that
Px∼D [h (x) = f (x)] = 1. In many practical problems this assumption does not hold.
Furthermore, it is maybe more realistic not to assume that the labels are fully determined by the features we measure on input elements (in the case of the papayas,
it is plausible that two papayas of the same color and softness will have different taste). In the following, we relax the realizability assumption by replacing the
“target labeling function” with a more flexible notion, a data-labels generating
Formally, from now on, let D be a probability distribution over X × Y, where,
as before, X is our domain set and Y is a set of labels (usually we will consider
Y = {0, 1}). That is, D is a joint distribution over domain points and labels. One can
view such a distribution as being composed of two parts: a distribution Dx over unlabeled domain points (sometimes called the marginal distribution) and a conditional
probability over labels for each domain point, D((x, y)|x). In the papaya example,
Dx determines the probability of encountering a papaya whose color and hardness
fall in some color-hardness values domain, and the conditional probability is the
probability that a papaya with color and hardness represented by x is tasty. Indeed,
such modeling allows for two papayas that share the same color and hardness to
belong to different taste categories.

The empirical and the True Error Revised
For a probability distribution, D, over X × Y, one can measure how likely h is to
make an error when labeled points are randomly drawn according to D. We redefine
the true error (or risk) of a prediction rule h to be

L D (h) =




[h(x) = y] = D({(x, y) : h(x) = y}).


We would like to find a predictor, h, for which that error will be minimized.
However, the learner does not know the data generating D. What the learner does
have access to is the training data, S. The definition of the empirical risk remains
the same as before, namely,

L S (h) =

|{i ∈ [m] : h(x i ) = yi }|

Given S, a learner can compute L S (h) for any function h : X → {0, 1}. Note that
L S (h) = L D(uniform over S) (h).

The Goal
We wish to find some hypothesis, h : X → Y, that (probably approximately)
minimizes the true risk, L D (h).

3.2 A More General Learning Model

The Bayes Optimal Predictor.
Given any probability distribution D over X × {0, 1}, the best label predicting
function from X to {0, 1} will be

1 if P [ y = 1|x] ≥ 1/2
f D (x) =
0 otherwise
It is easy to verify (see Exercise 3.7) that for every probability distribution D, the
Bayes optimal predictor f D is optimal, in the sense that no other classifier, g : X →
{0, 1}, has a lower error. That is, for every classifier g, L D ( f D ) ≤ L D (g).
Unfortunately, since we do not know D, we cannot utilize this optimal predictor
f D . What the learner does have access to is the training sample. We can now present
the formal definition of agnostic PAC learnability, which is a natural extension of
the definition of PAC learnability to the more realistic, nonrealizable, learning setup
we have just discussed.
Clearly, we cannot hope that the learning algorithm will find a hypothesis whose
error is smaller than the minimal possible error, that of the Bayes predictor. Furthermore, as we shall prove later, once we make no prior assumptions about the
data-generating distribution, no algorithm can be guaranteed to find a predictor that
is as good as the Bayes optimal one. Instead, we require that the learning algorithm
will find a predictor whose error is not much larger than the best possible error of a
predictor in some given benchmark hypothesis class. Of course, the strength of such
a requirement depends on the choice of that hypothesis class.
Definition 3.3 (Agnostic PAC Learnability). A hypothesis class H is agnostic PAC
learnable if there exist a function m H : (0, 1)2 → N and a learning algorithm with the
following property: For every , δ ∈ (0, 1) and for every distribution D over X × Y,
when running the learning algorithm on m ≥ m H ( , δ) i.i.d. examples generated by
D, the algorithm returns a hypothesis h such that, with probability of at least 1 − δ
(over the choice of the m training examples),
L D (h) ≤ min L D (h ) + .
h ∈H

Clearly, if the realizability assumption holds, agnostic PAC learning provides the
same guarantee as PAC learning. In that sense, agnostic PAC learning generalizes
the definition of PAC learning. When the realizability assumption does not hold, no
learner can guarantee an arbitrarily small error. Nevertheless, under the definition
of agnostic PAC learning, a learner can still declare success if its error is not much
larger than the best error achievable by a predictor from the class H. This is in
contrast to PAC learning, in which the learner is required to achieve a small error in
absolute terms and not relative to the best error achievable by the hypothesis class.

3.2.2 The Scope of Learning Problems Modeled
We next extend our model so that it can be applied to a wide variety of learning
tasks. Let us consider some examples of different learning tasks.
 Multiclass Classification Our classification does not have to be binary. Take, for
example, the task of document classification: We wish to design a program that



A Formal Learning Model

will be able to classify given documents according to topics (e.g., news, sports,
biology, medicine). A learning algorithm for such a task will have access to
examples of correctly classified documents and, on the basis of these examples,
should output a program that can take as input a new document and output
a topic classification for that document. Here, the domain set is the set of all
potential documents. Once again, we would usually represent documents by a
set of features that could include counts of different key words in the document,
as well as other possibly relevant features like the size of the document or its origin. The label set in this task will be the set of possible document topics (so Y will
be some large finite set). Once we determine our domain and label sets, the other
components of our framework look exactly the same as in the papaya tasting
example; Our training sample will be a finite sequence of (feature vector, label)
pairs, the learner’s output will be a function from the domain set to the label
set, and, finally, for our measure of success, we can use the probability, over
(document, topic) pairs, of the event that our predictor suggests a wrong label.
 Regression In this task, one wishes to find some simple pattern in the data – a
functional relationship between the X and Y components of the data. For example, one wishes to find a linear function that best predicts a baby’s birth weight
on the basis of ultrasound measures of his head circumference, abdominal circumference, and femur length. Here, our domain set X is some subset of R3 (the
three ultrasound measurements), and the set of “labels,” Y, is the the set of real
numbers (the weight in grams). In this context, it is more adequate to call Y the
target set. Our training data as well as the learner’s output are as before (a finite
sequence of (x, y) pairs, and a function from X to Y respectively). However,
our measure of success is different. We may evaluate the quality of a hypothesis
function, h : X → Y, by the expected square difference between the true labels
and their predicted values, namely,

L D (h) =



(h(x) − y)2 .


To accommodate a wide range of learning tasks we generalize our formalism of
the measure of success as follows:
Generalized Loss Functions
Given any set H (that plays the role of our hypotheses, or models) and some domain
Z let  be any function from H × Z to the set of nonnegative real numbers,  :
H × Z → R+ . We call such functions loss functions.
Note that for prediction problems, we have that Z = X ×Y. However, our notion
of the loss function is generalized beyond prediction tasks, and therefore it allows
Z to be any domain of examples (for instance, in unsupervised learning tasks such
as the one described in Chapter 22, Z is not a product of an instance domain and a
label domain).
We now define the risk function to be the expected loss of a classifier, h ∈ H, with
respect to a probability distribution D over Z , namely,

L D (h) =

E [(h, z)].



3.2 A More General Learning Model

That is, we consider the expectation of the loss of h over objects z picked randomly according to D. Similarly, we define the empirical risk to be the expected loss
over a given sample S = (z 1 , . . . , z m ) ∈ Z m , namely,
(h, z i ).


L S (h) =



The loss functions used in the preceding examples of classification and regression
tasks are as follows:
 0–1 loss: Here, our random variable z ranges over the set of pairs X × Y and the
loss function is

0 if h(x) = y
0−1 (h, (x, y)) =
1 if h(x) = y

This loss function is used in binary or multiclass classification problems.
One should note that, for a random variable, α, taking the values {0, 1},
Eα∼D [α] = Pα∼D [α = 1]. Consequently, for this loss function, the definitions
of L D (h) given in Equation (3.3) and Equation (3.1) coincide.
 Square Loss: Here, our random variable z ranges over the set of pairs X × Y and
the loss function is
sq (h, (x, y)) = (h(x) − y)2 .
This loss function is used in regression problems.
We will later see more examples of useful instantiations of loss functions.
To summarize, we formally define agnostic PAC learnability for general loss
Definition 3.4 (Agnostic PAC Learnability for General Loss Functions). A hypothesis class H is agnostic PAC learnable with respect to a set Z and a loss function
 : H × Z → R+ , if there exist a function m H : (0, 1)2 → N and a learning algorithm
with the following property: For every , δ ∈ (0, 1) and for every distribution D over
Z , when running the learning algorithm on m ≥ m H ( , δ) i.i.d. examples generated
by D, the algorithm returns h ∈ H such that, with probability of at least 1 − δ (over
the choice of the m training examples),
L D (h) ≤ min L D (h ) + ,
h ∈H

where L D (h) = Ez∼D [(h, z)].
Remark 3.1 (A Note About Measurability*). In the aforementioned definition, for
every h ∈ H, we view the function (h, ·) : Z → R+ as a random variable and define
L D (h) to be the expected value of this random variable. For that, we need to require
that the function (h, ·) is measurable. Formally, we assume that there is a σ -algebra
of subsets of Z , over which the probability D is defined, and that the preimage
of every initial segment in R+ is in this σ -algebra. In the specific case of binary
classification with the 0−1 loss, the σ -algebra is over X × {0, 1} and our assumption
on  is equivalent to the assumption that for every h, the set {(x, h(x)) : x ∈ X } is in
the σ -algebra.



A Formal Learning Model

Remark 3.2 (Proper vs. Representation-Independent Learning*). In the preceding definition, we required that the algorithm will return a hypothesis from H. In
some situations, H is a subset of a set H , and the loss function can be naturally
extended to be a function from H × Z to the reals. In this case, we may allow
the algorithm to return a hypothesis h ∈ H , as long as it satisfies the requirement
L D (h ) ≤ minh∈H L D (h) + . Allowing the algorithm to output a hypothesis from
H is called representation independent learning, while proper learning occurs when
the algorithm must output a hypothesis from H. Representation independent learning is sometimes called “improper learning,” although there is nothing improper in
representation independent learning.

In this chapter we defined our main formal learning model – PAC learning. The
basic model relies on the realizability assumption, while the agnostic variant does
not impose any restrictions on the underlying distribution over the examples. We
also generalized the PAC model to arbitrary loss functions. We will sometimes refer
to the most general model simply as PAC learning, omitting the “agnostic” prefix
and letting the reader infer what the underlying loss function is from the context.
When we would like to emphasize that we are dealing with the original PAC setting
we mention that the realizability assumption holds. In Chapter 7 we will discuss
other notions of learnability.

Our most general definition of agnostic PAC learning with general loss functions
follows the works of Vladimir Vapnik and Alexey Chervonenkis (Vapnik and
Chervonenkis 1971). In particular, we follow Vapnik’s general setting of learning
(Vapnik 1982, Vapnik 1992, Vapnik 1995, Vapnik 1998).
The term PAC learning was introduced by Valiant (1984). Valiant was named
the winner of the 2010 Turing Award for the introduction of the PAC model.
Valiant’s definition requires that the sample complexity will be polynomial in 1/
and in 1/δ, as well as in the representation size of hypotheses in the class (see also
Kearns and Vazirani (1994)). As we will see in Chapter 6, if a problem is at all PAC
learnable then the sample complexity depends polynomially on 1/ and log (1/δ).
Valiant’s definition also requires that the runtime of the learning algorithm will be
polynomial in these quantities. In contrast, we chose to distinguish between the
statistical aspect of learning and the computational aspect of learning. We will elaborate on the computational aspect later on in Chapter 8. Finally, the formalization
of agnostic PAC learning is due to Haussler (1992).

3.1 Monotonicity of Sample Complexity: Let H be a hypothesis class for a binary classification task. Suppose that H is PAC learnable and its sample complexity is given
by m H (·, ·). Show that m H is monotonically nonincreasing in each of its parameters. That is, show that given δ ∈ (0, 1), and given 0 < 1 ≤ 2 < 1, we have that

3.5 Exercises

m H ( 1 , δ) ≥ m H ( 2 , δ). Similarly, show that given ∈ (0, 1), and given 0 < δ1 ≤ δ2 < 1,
we have that m H ( , δ1 ) ≥ m H ( , δ2 ).
3.2 Let X be a discrete domain, and let HSingleton = {h z : z ∈ X } ∪ {h − }, where for each
z ∈ X , h z is the function defined by h z (x) = 1 if x = z and h z (x) = 0 if x = z. h −
is simply the all-negative hypothesis, namely, ∀x ∈ X, h − (x) = 0. The realizability
assumption here implies that the true hypothesis f labels negatively all examples in
the domain, perhaps except one.
1. Describe an algorithm that implements the ERM rule for learning HSingleton in
the realizable setup.
2. Show that HSingleton is PAC learnable. Provide an upper bound on the sample
3.3 Let X = R2 , Y = {0, 1}, and let H be the class of concentric circles in the plane, that
is, H = {h r : r ∈ R+ }, where h r (x) = 1[x≤r ] . Prove that H is PAC learnable (assume
realizability), and its sample complexity is bounded by

log (1/δ)
m H ( , δ) ≤
3.4 In this question, we study the hypothesis class of Boolean conjunctions defined as
follows. The instance space is X = {0, 1}d and the label set is Y = {0, 1}. A literal over
the variables x1 , . . ., xd is a simple Boolean function that takes the form f (x) = xi , for
some i ∈ [d], or f (x) = 1 − xi for some i ∈ [d]. We use the notation x̄i as a shorthand
for 1 − xi . A conjunction is any product of literals. In Boolean logic, the product is
denoted using the ∧ sign. For example, the function h(x) = x1 · (1 − x2 ) is written as
x1 ∧ x̄2 .
We consider the hypothesis class of all conjunctions of literals over the d variables. The empty conjunction is interpreted as the all-positive hypothesis (namely,
the function that returns h(x) = 1 for all x). The conjunction x1 ∧ x̄1 (and similarly
any conjunction involving a literal and its negation) is allowed and interpreted as
the all-negative hypothesis (namely, the conjunction that returns h(x) = 0 for all x).
We assume realizability: Namely, we assume that there exists a Boolean conjunction
that generates the labels. Thus, each example (x, y) ∈ X × Y consists of an assignment to the d Boolean variables x1 , . . ., xd , and its truth value (0 for false and 1 for
For instance, let d = 3 and suppose that the true conjunction is x1 ∧ x̄2 . Then, the
training set S might contain the following instances:
((1, 1, 1), 0), ((1, 0, 1), 1), ((0, 1, 0), 0)((1, 0, 0), 1).
Prove that the hypothesis class of all conjunctions over d variables is PAC learnable and bound its sample complexity. Propose an algorithm that implements the
ERM rule, whose runtime is polynomial in d · m.
3.5 Let X be a domain and let D1 , D2 , . . ., Dm be a sequence of distributions over X . Let
H be a finite class of binary classifiers over X and let f ∈ H. Suppose we are getting
a sample S of m examples, such that the instances are independent but are not identically distributed; the i th instance is sampled from Di and then yi is set to be f (xi ).
Let D̄m denote the average, that is, D̄m = (D1 + · · · + Dm )/m.
Fix an accuracy parameter ∈ (0, 1). Show that
P ∃h ∈ H s.t. L (D̄m , f ) (h) > and L (S, f ) (h) = 0 ≤ |H|e− m .
Hint: Use the geometric-arithmetic mean inequality.



A Formal Learning Model

3.6 Let H be a hypothesis class of binary classifiers. Show that if H is agnostic PAC
learnable, then H is PAC learnable as well. Furthermore, if A is a successful agnostic
PAC learner for H, then A is also a successful PAC learner for H.
3.7 (*) The Bayes optimal predictor: Show that for every probability distribution D, the
Bayes optimal predictor fD is optimal, in the sense that for every classifier g from
X to {0, 1}, L D ( f D ) ≤ L D (g).
3.8 (*) We say that a learning algorithm A is better than B with respect to some
probability distribution, D, if
L D ( A(S)) ≤ L D (B(S))
for all samples S ∈ (X × {0, 1})m . We say that a learning algorithm A is better than B,
if it is better than B with respect to all probability distributions D over X × {0, 1}.
1. A probabilistic label predictor is a function that assigns to every domain point
x a probability value, h(x) ∈ [0, 1], that determines the probability of predicting
the label 1. That is, given such an h and an input, x, the label for x is predicted by
tossing a coin with bias h(x) toward Heads and predicting 1 iff the coin comes up
Heads. Formally, we define a probabilistic label predictor as a function, h : X →
[0, 1]. The loss of such h on an example (x, y) is defined to be |h(x) − y|, which is
exactly the probability that the prediction of h will not be equal to y. Note that
if h is deterministic, that is, returns values in {0, 1}, then |h(x) − y| = 1[h(x)= y] .
Prove that for every data-generating distribution D over X × {0, 1}, the Bayes
optimal predictor has the smallest risk (w.r.t. the loss function (h, (x, y)) =
|h(x) − y|, among all possible label predictors, including probabilistic ones).
2. Let X be a domain and {0, 1} be a set of labels. Prove that for every distribution
D over X × {0, 1}, there exist a learning algorithm AD that is better than any
other learning algorithm with respect to D.
3. Prove that for every learning algorithm A there exist a probability distribution,
D, and a learning algorithm B such that A is not better than B w.r.t. D.
3.9 Consider a variant of the PAC model in which there are two example oracles: one
that generates positive examples and one that generates negative examples, both
according to the underlying distribution D on X . Formally, given a target function
f : X → {0, 1}, let D+ be the distribution over X + = {x ∈ X : f (x) = 1} defined by
D+ ( A) = D( A)/D(X + ), for every A ⊂ X + . Similarly, D− is the distribution over X −
induced by D.
The definition of PAC learnability in the two-oracle model is the same as the
standard definition of PAC learnability except that here the learner has access to
H ( , δ) i.i.d. examples from D and m ( , δ) i.i.d. examples from D . The learner’s
goal is to output h s.t. with probability at least 1 − δ (over the choice of the two
training sets, and possibly over the nondeterministic decisions made by the learning
algorithm), both L (D+ , f ) (h) ≤ and L (D−, f ) (h) ≤ .
1. (*) Show that if H is PAC learnable (in the standard one-oracle model), then H
is PAC learnable in the two-oracle model.
2. (**) Define h + to be the always-plus hypothesis and h − to be the always-minus
hypothesis. Assume that h + , h − ∈ H. Show that if H is PAC learnable in the
two-oracle model, then H is PAC learnable in the standard one-oracle model.

Learning via Uniform Convergence

The first formal learning model that we have discussed was the PAC model. In
Chapter 2 we have shown that under the realizability assumption, any finite hypothesis class is PAC learnable. In this chapter we will develop a general tool, uniform
convergence, and apply it to show that any finite class is learnable in the agnostic PAC model with general loss functions, as long as the range loss function is

The idea behind the learning condition discussed in this chapter is very simple.
Recall that, given a hypothesis class, H, the ERM learning paradigm works as follows: Upon receiving a training sample, S, the learner evaluates the risk (or error)
of each h in H on the given sample and outputs a member of H that minimizes this
empirical risk. The hope is that an h that minimizes the empirical risk with respect to
S is a risk minimizer (or has risk close to the minimum) with respect to the true data
probability distribution as well. For that, it suffices to ensure that the empirical risks
of all members of H are good approximations of their true risk. Put another way, we
need that uniformly over all hypotheses in the hypothesis class, the empirical risk
will be close to the true risk, as formalized in the following.
Definition 4.1 ( -representative sample). A training set S is called -representative
(w.r.t. domain Z , hypothesis class H, loss function , and distribution D) if
∀h ∈ H, |L S (h) − L D (h)| ≤ .
The next simple lemma states that whenever the sample is ( /2)-representative,
the ERM learning rule is guaranteed to return a good hypothesis.
Lemma 4.2. Assume that a training set S is 2 -representative (w.r.t. domain Z ,
hypothesis class H, loss function , and distribution D). Then, any output of
ERMH (S), namely, any h S ∈ argminh∈H L S (h), satisfies
L D (h S ) ≤ min L D (h) + .



Learning via Uniform Convergence

Proof. For every h ∈ H,
L D (h S ) ≤ L S (h S ) + 2 ≤ L S (h) + 2 ≤ L D (h) + 2 + 2 = L D (h) + ,
where the first and third inequalities are due to the assumption that S is
2 -representative (Definition 4.1) and the second inequality holds since h S is an ERM
The preceding lemma implies that to ensure that the ERM rule is an agnostic
PAC learner, it suffices to show that with probability of at least 1 − δ over the random choice of a training set, it will be an -representative training set. The uniform
convergence condition formalizes this requirement.
Definition 4.3 (Uniform Convergence). We say that a hypothesis class H has the
uniform convergence property (w.r.t. a domain Z and a loss function ) if there exists
a function m UC
H : (0, 1) → N such that for every , δ ∈ (0, 1) and for every probability distribution D over Z , if S is a sample of m ≥ m UC
H ( , δ) examples drawn i.i.d.
according to D, then, with probability of at least 1 − δ, S is -representative.
Similar to the definition of sample complexity for PAC learning, the function
m H measures the (minimal) sample complexity of obtaining the uniform convergence property, namely, how many examples we need to ensure that with
probability of at least 1 − δ the sample would be -representative.
The term uniform here refers to having a fixed sample size that works for all
members of H and over all possible probability distributions over the domain.
The following corollary follows directly from Lemma 4.2 and the definition of
uniform convergence.

Corollary 4.4. If a class H has the uniform convergence property with a function m UC
then the class is agnostically PAC learnable with the sample complexity m H ( , δ) ≤
m UC
H ( /2, δ). Furthermore, in that case, the ERMH paradigm is a successful agnostic
PAC learner for H.

In view of Corollary 4.4, the claim that every finite hypothesis class is agnostic PAC
learnable will follow once we establish that uniform convergence holds for a finite
hypothesis class.
To show that uniform convergence holds we follow a two step argument, similar
to the derivation in Chapter 2. The first step applies the union bound while the
second step employs a measure concentration inequality. We now explain these two
steps in detail.
Fix some , δ. We need to find a sample size m that guarantees that for any D,
with probability of at least 1 − δ of the choice of S = (z 1 , . . . , z m ) sampled i.i.d. from
D we have that for all h ∈ H, |L S (h) − L D (h)| ≤ . That is,
Dm ({S : ∀h ∈ H, |L S (h) − L D (h)| ≤ }) ≥ 1 − δ.
Equivalently, we need to show that
Dm ({S : ∃h ∈ H, |L S (h) − L D (h)| > }) < δ.

4.2 Finite Classes Are Agnostic PAC Learnable

{S : ∃h ∈ H, |L S (h) − L D (h)| > } = ∪h∈H {S : |L S (h) − L D (h)| > },
and applying the union bound (Lemma 2.2) we obtain

Dm ({S : ∃h ∈ H, |L S (h) − L D (h)| > }) ≤
Dm ({S : |L S (h) − L D (h)| > }). (4.1)

Our second step will be to argue that each summand of the right-hand side of
this inequality is small enough (for a sufficiently large m). That is, we will show that
for any fixed hypothesis, h, (which is chosen in advance prior to the sampling of the
training set), the gap between the true and empirical risks, |L S (h) − L D (h)|, is likely
to be small.

Recall that L D (h) = Ez∼D [(h, z)] and that L S (h) = m1 m
i=1 (h, z i ). Since each z i
is sampled i.i.d. from D, the expected value of the random variable (h, z i ) is L D (h).
By the linearity of expectation, it follows that L D (h) is also the expected value of
L S (h). Hence, the quantity |L D (h) − L S (h)| is the deviation of the random variable
L S (h) from its expectation. We therefore need to show that the measure of L S (h) is
concentrated around its expected value.
A basic statistical fact, the law of large numbers, states that when m goes to
infinity, empirical averages converge to their true expectation. This is true for L S (h),
since it is the empirical average of m i.i.d random variables. However, since the law
of large numbers is only an asymptotic result, it provides no information about the
gap between the empirically estimated error and its true value for any given, finite,
sample size.
Instead, we will use a measure concentration inequality due to Hoeffding, which
quantifies the gap between empirical averages and their expected value.
Lemma 4.5 (Hoeffding’s Inequality). Let θ1 , . . . , θm be a sequence of i.i.d. random
variables and assume that for all i , E [θi ] = µ and P [a ≤ θi ≤ b] = 1. Then, for any


θi − µ >
P m
≤ 2 exp −2 m 2 /(b − a)2 .

The proof can be found in Appendix B.
Getting back to our problem, let θi be the random variable (h, z i ). Since h is
fixed and z 1 , . . . , z m are sampled i.i.d.,
it follows that θ1 , . . . , θm are also i.i.d. random
variables. Furthermore, L S (h) = m1 m
i=1 θi and L D (h) = µ. Let us further assume
that the range of  is [0, 1] and therefore θi ∈ [0, 1]. We therefore obtain that


D ({S : |L S (h) − L D (h)| > }) = P m
θi − µ >
≤ 2 exp −2 m 2 . (4.2)

Combining this with Equation (4.1) yields

2 exp −2 m


= 2 |H| exp −2 m


Dm ({S : ∃h ∈ H, |L S (h) − L D (h)| > }) ≤





Learning via Uniform Convergence

Finally, if we choose

log (2|H|/δ)
2 2

Dm ({S : ∃h ∈ H, |L S (h) − L D (h)| > }) ≤ δ.
Corollary 4.6. Let H be a finite hypothesis class, let Z be a domain, and let  : H ×
Z → [0, 1] be a loss function. Then, H enjoys the uniform convergence property with
sample complexity

log (2|H|/δ)
m UC
2 2
Furthermore, the class is agnostically PAC learnable using the ERM algorithm with
sample complexity

2 log (2|H|/δ)
m H ( , δ) ≤ m UC
Remark 4.1 (The “Discretization Trick”). While the preceding corollary only
applies to finite hypothesis classes, there is a simple trick that allows us to get a
very good estimate of the practical sample complexity of infinite hypothesis classes.
Consider a hypothesis class that is parameterized by d parameters. For example,
let X = R, Y = {±1}, and the hypothesis class, H, be all functions of the form
h θ (x) = sign(x − θ ). That is, each hypothesis is parameterized by one parameter,
θ ∈ R, and the hypothesis outputs 1 for all instances larger than θ and outputs −1
for instances smaller than θ . This is a hypothesis class of an infinite size. However,
if we are going to learn this hypothesis class in practice, using a computer, we will
probably maintain real numbers using floating point representation, say, of 64 bits.
It follows that in practice, our hypothesis class is parameterized by the set of scalars
that can be represented using a 64 bits floating point number. There are at most 264
such numbers; hence the actual size of our hypothesis class is at most 264 . More generally, if our hypothesis class is parameterized by d numbers, in practice we learn
a hypothesis class of size at most 264d . Applying Corollary 4.6 we obtain that the
. This upper bound
sample complexity of such classes is bounded by 128d+2 log(2/δ)
on the sample complexity has the deficiency of being dependent on the specific representation of real numbers used by our machine. In Chapter 6 we will introduce
a rigorous way to analyze the sample complexity of infinite size hypothesis classes.
Nevertheless, the discretization trick can be used to get a rough estimate of the
sample complexity in many practical situations.

If the uniform convergence property holds for a hypothesis class H then in most
cases the empirical risks of hypo