To develop a new formulation for the sequence comparison problem, we will need to travel to 19th century Japan and visit one of the Yakuza-run casinos called Bakuto. The Yakuza Crime Syndicate had a humble beginning, where their first business operation was running the network of frame shift casinos. And in these casinos, the favorite game was Cho-han, when the dealer would toss two dice, and the visitors would bet on the outcomes, the outcomes being the sum of numbers on these two dice.
So “cho” means, as you probably guessed, “even”, and “han” means “odd”. This is, of course, a fair game. If the dice are not loaded, then there is 50/50 probability of winning. However, in Bakuto casinos, the dealers often used biased dice, “loaded” dice. And in this case, the outcome may be shifted towards the casino owner or towards insider backers.
Also, it may be fun ro play Cho-han, in a Yakuza winward casino. We will play an equivalent game, where the dealer simply flips a coin, and we bet on heads or tails. The challenge here is that the dealer actually has a choice of two coins: fair and biased.
For the fair coin, the probability of heads is 1/2 of course, and for the biased coin, the probability of heads is 3/4. On this slide, I show the coins in blue and green, but in reality, the coins are completely identical, so you have no clue which coin the dealer used. The question arises: Suppose the dealer made 100 flips and 63 of them are heads; which coin did the dealer use, fair or biased? This question, of course, doesn’t make sense because any coin, fair or biased, can result in 63 heads.
The right question to ask is: What coin is more likely if 63 out of 100 flips resulted in heads? How do we answer this question? Here’s a hint: 63 is a little bit closer to 75 than 50.
Should we then assume that the biased coin is more likely in this series of flips? Before we come to a conclusion, let’s try to estimate some probabilities. Let’s consider a sequence of n flips, denoted X = x1 x2 … xn, with k heads. The probability that this sequence was generated by the fair coin is of course (1/2)^n. But the probability that it was generated by the biased coin is (3/4)^k * (1/4)^(n-k) How can we decide what coin is more likely?
Of course, if the probability of X being generated by the fair coin is larger than the probability of X being generated by the biased coin, then the fair coin is more likely. Let’s refer to these probabilities as the blue probability and green probability, respectively, when we talk about the probability of X being generated by the fair coin or biased coin. And likewise, if the blue probability is smaller than the green probability, then the biased coin is more likely. However, when these probabilities are the same, when in equilibrium state, when blue probability = green probability, we can compute, by doing some arithmetic, that in this case, k = log_2(3) * n Or, in other words, k will be approximately equal to 0.632 multiplied by n. Which means that when the number of heads, or the fraction of the number of heads is smaller than 0.632, it means that the dealer is more likely to have used the fair coin. Therefore, our original intuition actually was incorrect. If there are 63 heads in the series of 100 coin tosses, then the dealer was more likely to use the fair coin, even though 63 is closer to 75 than to 50.
The dealers in traditional Yakuza-run casinos were shirtless to avoid accusation of tampering the dice because a shirtless dealer is less likely to switch the fair dice into the biased dice.
But the reality was that even shirtless dealers were able to switch the dice, and we will model these situations with two coins up the dealer’s sleeve, and instead of the previous case, where the dealer ran a serious of coin flips, but always used either the fair or the biased coin, we will assume that the dealer now can change the coin at any moment, with probability 0.1. So, after watching a sequence of flips, can you tell when the dealer was the fair coin and when he was using the biased coin? Our attempt to read the dealer’s mind brings us to the following casino problem: Given a sequence of coin flips, determine when the dealer was using the fair coin and when he was using the biased coin.
Do you think it is a well-formulated computational problem? Of course it is not because any outcome of coin tosses could have been generated by any combination of fair and biased coins. And therefore, we need to grade different scenarios, for example BBBBB, FFFFF, BBFFBB, differently, depending on how likely they are. But if we try to do this, then the question arises: How can we explore and grade 2^n possible scenarios for coin tosses? Here’s a simple approach to reading the dealer’s mind one window at a time. Let’s choose a small window, let’s say a window of five tosses, and for each such window, let’s decide which coin is more likely to generate the outcome in this window.
We already know how to answer this question. For example, for this window, HHHTH, there are 80% heads, therefore it’s most likely to have been generated by the biased coin. For the second window, we have only 60% of heads, and therefore, this is more likely to be generated by the fair coin. We will actually use a ratio of blue and green probabilities, and we will decide what coin generated the given outcome by simply computing this ratio.
If this ratio is smaller than 1, then the biased coin is more likely. If this ratio is higher than 1, then the fair coin is more likely. And we will also introduce log-odds ratio, which is simply the base-two logarithm of this ratio, and in this case, as we know, the base-two logarithm of the ratio equals (number of tosses) – log_2(3) * (number heads) and our decision rule for deciding whether a given window was generated by the biased or fair coin will simply be: If the log-odds ratio is less than zero, then it is a biased coin. If it is larger than zero, then it is a fair coin, and it is represented by this simple rule. So, to continue reading the dealer’s mind, we will go through the whole sequence, and every time computing which coin is more likely, and then we will classify all these windows into being more likely to be generated by the biased or the fair coin.
The question however, arises: What are the disadvantages of this approach? The obvious disadvantage of the sliding window approach is that different windows may classify the same flip differently. Also, the choice of the window length always affects our conclusions. But how do we know how to choose the window length? And all that may be an interesting introduction to makeshift casinos and coin flipping, but what does it have to do with biology? To explain what coin flipping has to do with biological applications, we will start from a simple example of CG islands and then move to more complex examples.
The question we will ask: Why are CG dinucleotides more rare than GC dinucleotides in genomic sequences? Well, different species have widely different GC-content, or percentages of (G+C) nucleotides in the genome. For example, gorilla and human have GC content 46% while platypus has GC content 58%, which means that, on average, if the distribution was uniform, you would expect that both nucleotides G and C appear in the human genome with a probability of 23%. And therefore, you would expect that each of the dinucleotides (CC, CG, GC, and GG) appears in the genome with frequency 5.29%.
But the reality is that the frequency of CG in the human genome is 5 times smaller. Why? Methylation is a DNA modification that adds methyl CH3 group to the cytosine nucleotide, and it often happens to a C in a CG dinucleotide. The resulting methylated cytosine has a tendency to deaminate into thymine, and as a result of methylation, CG is the least frequent dinucleotide in many genomes. However, methylation is often suppressed in the areas around genes called CG-islands, where CG appears frequently.
For this example, this would be a CG-island. And the question arises: Suppose you want to find genes. How would you search for CG-islands as a prerequisite for finding genes? Well, if we were to use the same paradigm of Log-odds ratio, and simply classify CG-islands as more likely in areas where this Log-odds ratio is less than 0, and non-CG island as more likely in the areas where Log-odds ratio is larger than 0, then we will arrive at a classification algorithm for CG-islands. However, there are a few issues.
As before, different windows may classify the same position in the genome differently. It is not clear how to choose the length of the window to construct CG-islands, and very importantly it is not clear whether we should use the same length of windows for different regions of the genome. And in the next section, I will describe the notion of Hidden Markov Models that provides a new, better paradigm, for problems like finding CG Islands and many, many other problems in bioinformatics.