Quiz 3 and 4

Please submit on Laulima as instructed

Problems in class

The four problems here are a developed from the corresponding problems in the previous module.

Recall the 3-coin toss problem (Problem 1 from the previous module).
1. Fair Coin: Suppose every sequence of coin tosses has equal probability. Show that each coin toss has probability \(1/2\) of being heads. What is the probability the first toss is heads given the second and third tosses have the same face (ie. the event that they are both heads or both tails)?
2. Zero-parity coin: This is a special coin such that if you toss it thrice, you are guaranteed to see only an even number of Heads. However, all sequences with an even number of Heads have equal probability (regardless of whether the sequence has 0 or 2 Heads among 3 coin tosses). What is the probability the first toss is heads given the second and third tosses have the same face (ie. the event that they are both heads or both tails)?
3. Show that two tosses of either coin are independent. Show that three tosses of the Zero-parity coin are not independent.
4. Can you give another probability law on three coin tosses such that the event: “first toss shows Heads” is independent of any event you could construct only from the second and third tosses?
Let us build on our abstract model of a communication link. As before, there is a transmitter and a receiver, and the transmitter transmits one bit through a channel. The transmitter chooses 0 or 1, but unlike the previous module, now the probability of a 1 is \(p\). Now the link is not perfect. Given any input, the channel flips the input bit with probability \(\epsilon\). This abstraction is called a Binary Symmetric Channel.
1. Phrase this channel model in terms of conditional probabilities.
2. If \(\epsilon = \frac12\), show that the received bit is independent of the transmitted bit. In this case of course, the implication is that there is nothing we could do to reconstruct the transmitted bit.
We set up QuickSort in the last module. Please review the setup. You set up the probability space for a sequence of length 4. In the first iteration, you would choose a pivot, say \(p_0\), and break the sequence into two, calling the procedure recursively on each subsequence.
1. Let \(L_1\) and \(L_2\) be the lengths of the two sequences when you split the length 4 sequence. What is the distribution of \(L_1\)? Of \(L_2\)? Is it intuitive that they should have distributions related that way?
2. This anticipates a future topic: given \(L_1\), what is the distribution of \(L_2\)?
3. When Quicksort is called on the two subsequences of length \(L_1\) and \(L_2\), each call produces a new pivot, say \(p_1\) and \(p_2\) respectively. Conditioned on any choice of \(p_0\), the two pivots are chosen independently. But are \(p_1\) and \(p_2\) independent? (You can tell whether they are independent without any calculation).
4. This question anticipates a future topic, but what is the joint distribution of \(L_1\) and \(L_2\)? We have not formally defined the joint distribution yet, but this problem helps you appreciate the need for one.
We are handcalculating everything for now because we want to understand what is happening. Consequently, we are keeping the sequence length small. You could easily simulate everything we are doing here for larger sequence lengths. But simulations like this not only tend to be overkill, but also may not reveal much insights, because they carry too much detail (like the distribution of \(p_1\) and \(p_2\)) that are not really that important. If we are interested in running time, we will be able to identify a few summaries to keep track of (we will learn them in Expectations of Random Variables). We can often hand-calculate or estimate these summaries better and faster than any simulation.
AI and Machine learning approaches almost always have a probabilistic setup. We considered a classification problem in the prior module. where a threshold (a fixed but unknown number \(T\)) splits the line into two (infinite length) intervals, the left side of \(T\) carrying label -1 and the right side carrying label +1. The threshold \(T\) is unknown and the learning algorithm is supposed to figure this out.

Training points are generated by a pdf \(f\), which was left unspecified last module, but which we will now specify. To generate a training point from \(f\), first toss a biased coin whose two sides are \(-1\) and \(+1\), and whose probability of landing on \(+1\) is \(p\). If the outcome of the coin toss is \(-1\), we choose a number from \(T-1\) to \(T\), using a uniform pdf. If the coin toss is \(+1\), we choose a number from \(T\) to \(T+1\), using a uniform pdf.

Our learning algorithm sees a training point \(x\) and its label. It does not know \(f\) nor does it know \(T\). If the label of the training point \(x\) is -1, the algorithm outputs as its estimate of the threshold to be \(x+\frac12\), If the label of the training point \(x\) is +1, the algorithm outputs \(x-\frac12\). The idea is that since the learning algorithm does not know \(T\), it uses its estimate of the threshold to classify points instead.

a. What is the generalization error (the probability another point generated from \(f\) is misclassified by the threshold estimated by the learning algorithm) given a specific training point \(x\)?

b. What is the generalization error? The distinction between this and the prior quesiton motivates the topics in the next module.