AI and Machine learning approaches almost always have a probabilistic setup. We consider what is called a classification problem. In the real line, imagine that a threshold (a number \(T\)) splits the line into two (infinite length) intervals. Each side of \(T\) carries a different label, the left side of \(T\) carrying label -1 and the right side carrying label +1. The threshold \(T\) is unknown (but fixed) to the learning algorithm, and the algorithm is supposed to figure this out.
There is a pdf \(f\) on real numbers. A training point is a real number \(x\) generated by the distribution \(f\). Once we generate the training point, an oracle that knows \(T\) gives us its correct label.
Our learning algorithm takes this training point \(x\) and its label, and uses it to output a hypothesis (its best guess of what the unknown threshold is). The learning algorithm will then use its estimate of the threshold to classify a test point \(x'\), potentially different, but also generated by the same probability law as the training point.
For this problem, consider the following algorithm: If the label of the training point \(x\) is -1, the algorithm outputs as its estimate of the threshold to be \(x+\frac12\), labeling the left interval to be -1 and right to be +1. If the label of the training point \(x\) is +1, the algorithm outputs \(x-\frac12\), again labeling the left interval to be -1 and right to be +1.
Since the algorithm does not know \(T\), it will classify test points generated by \(f\) using its estimate of the threshold instead. There may therefore be some combinations of training and test samples (\(x\) and \(x'\) respectively) where the true label of the test sample (set by \(T\)) and its estimated label (set by the algorithm’s estimate of the threshold) differ. Then we say that the algorithm trained on \(x\) misclassifies the test point \(x'\).
Model the problem, identify the misclassification event: ie all combinations of training \(x\) and test sample \(x'\) such that the algorithm trained on \(x\) misclassifies \(x'\). The probability of misclassification (which we will examine later) is related to something the generalization error, the ability of the algorithm to handle test samples potentially different from the training samples.