To develop a new formulation for the
sequence comparison problem, we will need
to travel to 19th century Japan
and visit one of the Yakuza-run
casinos called Bakuto.
The Yakuza Crime Syndicate had
a humble beginning, where their first
business operation was running
the network of frame shift casinos.
And in these casinos,
the favorite game was Cho-han,
when the dealer would toss two dice,
and the visitors would
bet on the outcomes,
the outcomes being the sum of
numbers on these two dice.
So “cho” means, as you probably guessed,
“even”, and “han” means “odd”.
This is, of course, a fair game.
If the dice are not loaded, then there
is 50/50 probability of winning.
However, in Bakuto casinos,
the dealers often used biased dice,
And in this case, the outcome
may be shifted towards the
casino owner or towards insider backers.
Also, it may be fun ro play Cho-han,
in a Yakuza run casino.
We will play an equivalent game,
where the dealer simply flips a coin,
and we bet on heads or tails.
The challenge here
is that the dealer actually has
a choice of two coins: fair and biased.
For the fair coin, the probability of
heads is 1/2 of course, and for
the biased coin,
the probability of heads is 3/4.
On this slide,
I show the coins in blue and
green, but in reality,
the coins are completely identical, so
you have no clue which
coin the dealer used.
The question arises:
Suppose the dealer made 100 flips and
63 of them are heads;
which coin did the dealer use,
fair or biased?
This question, of course,
doesn’t make sense
because any coin, fair or
biased, can result in 63 heads.
The right question to ask is:
What coin is more likely
if 63 out of 100 flips resulted in heads?
How do we answer this question?
Here’s a hint:
63 is a little bit closer to 75 than 50.
Should we then assume that the biased coin
is more likely in this series of flips?
Before we come to a conclusion,
let’s try to estimate some probabilities.
Let’s consider a sequence of n
flips, denoted X=x1 x2 … xn,
with k heads.
The probability that this sequence
was generated by the fair coin
is of course (1/2)^n.
But the probability that it was
generated by the biased coin
is (3/4)^k * (1/4)^(n-k)
How can we decide what
coin is more likely?
Of course, if the probability of X
being generated by the fair coin
is larger than the probability of X
being generated by the biased coin,
then the fair coin is more likely.
Let’s refer to these probabilities
as the blue probability and
green probability, respectively,
when we talk about the probability of
X being generated by the fair coin or biased coin.
And likewise, if the blue probability
is smaller than the green probability,
then the biased coin is more likely.
However, when these probabilities are the same,
when in equilibrium state, when
blue probability=green probability,
we can compute, by doing some
arithmetic, that in this case,
k=log_2(3) * n
Or, in other words, k will be approximately
equal to 0.632 multiplied by n.
Which means that when the number of heads,
or the fraction of the number of
heads is smaller than 0.632,
it means that the dealer is more
likely to have used the fair coin.
Therefore, our original intuition
actually was incorrect.
If there are 63 heads in
the series of 100 coin tosses,
then the dealer was more
likely to use the fair coin,
even though 63 is closer to 75 than to 50.
The dealers in traditional
Yakuza-run casinos were
shirtless to avoid accusation
of tampering the dice
because a shirtless dealer is less likely
to switch the fair dice into the biased dice.
But the reality was that even shirtless
dealers were able to switch the dice,
and we will model these situations with
two coins up the dealer’s sleeve,
and instead of the previous case,
where the dealer ran a serious of
coin flips, but always used
either the fair or the biased coin,
we will assume that the dealer now
can change the coin at any moment,
with probability 0.1.
So, after watching a sequence of flips,
can you tell when the dealer
was the fair coin and
when he was using the biased coin?
Our attempt to read the dealer’s
mind brings us to the following
casino problem: Given a sequence
of coin flips, determine
when the dealer was using the fair coin
and when he was using the biased coin.
Do you think it is a well-formulated
Of course it is not
because any outcome of coin tosses
could have been generated by any
combination of fair and biased coins.
And therefore, we need to grade different
scenarios, for example BBBBB, FFFFF,
depending on how likely they are.
But if we try to do this,
then the question arises:
How can we explore and grade
2^n possible scenarios for coin tosses?
Here’s a simple approach to reading
the dealer’s mind one window at a time.
Let’s choose a small window,
let’s say a window of five tosses,
and for each such window,
let’s decide which coin is more likely
to generate the outcome in this window.
We already know how to
answer this question.
For example, for this window,
HHHTH, there are 80% heads,
therefore it’s most likely to have
been generated by the biased coin.
For the second window,
we have only 60% of heads,
and therefore, this is more likely
to be generated by the fair coin.
We will actually use a ratio of blue and
and we will decide what coin
generated the given outcome
by simply computing this ratio.
If this ratio is smaller than 1,
then the biased coin is more likely.
If this ratio is higher than 1,
then the fair coin is more likely.
And we will also introduce log-odds ratio,
which is simply the base-two logarithm
of this ratio, and in this case,
as we know, the base-two logarithm
of the ratio equals
(number of tosses) – log_2(3) * (number heads)
and our decision rule for deciding whether
a given window was generated by the
biased or fair coin will simply be:
If the log-odds ratio is less than zero,
then it is a biased coin.
If it is larger than zero,
then it is a fair coin, and
it is represented by this simple rule.
So, to continue reading the dealer’s mind,
we will go through the whole sequence,
and every time computing which coin
is more likely, and then we will classify
all these windows into being more likely
to be generated by the biased or the fair coin.
The question however, arises:
What are the disadvantages of this approach?
The obvious disadvantage of the sliding
window approach is that different windows
may classify the same flip differently.
Also, the choice of the window length
always affects our conclusions.
But how do we know how to
choose the window length?
And all that may be an interesting
introduction to makeshift casinos and
but what does it have
to do with biology?
To explain what coin flipping has to do
with biological applications, we will start
from a simple example of CG islands and
then move to more complex examples.
The question we will ask:
Why are CG dinucleotides more rare than
GC dinucleotides in genomic sequences?
Well, different species have widely
different GC-content, or
percentages of (G+C)
nucleotides in the genome.
For example, gorilla and human have GC
content 46% while platypus
has GC content 58%,
which means that, on average,
if the distribution was uniform, you
would expect that both nucleotides G and C
appear in the human genome
with a probability of 23%.
And therefore, you would expect
that each of the dinucleotides
(CC, CG, GC, and GG) appears in
the genome with frequency 5.29%.
But the reality is that the frequency of
CG in the human genome is 5 times smaller.
Methylation is a DNA modification
that adds methyl CH3 group
to the cytosine nucleotide, and
it often happens to a C
in a CG dinucleotide.
The resulting methylated cytosine
has a tendency to deaminate into
thymine, and as a result of methylation,
CG is the least frequent
dinucleotide in many genomes.
However, methylation is often
suppressed in the areas around
genes called CG-islands,
where CG appears frequently.
For this example,
this would be a CG-island.
And the question arises:
Suppose you want to find genes.
How would you search for CG-islands
as a prerequisite for finding genes?
Well, if we were to use the same
paradigm of Log-odds ratio,
and simply classify CG-islands as
more likely in areas where this
Log-odds ratio is less than 0,
and non-CG island as
more likely in the areas where
Log-odds ratio is larger than 0,
then we will arrive at a classification
algorithm for CG-islands.
However, there are a few issues.
different windows may classify the same
position in the genome differently.
It is not clear how to choose the length
of the window to construct CG-islands,
and very importantly it is not clear
whether we should use the same
length of windows for different
regions of the genome.
And in the next section,
I will describe the notion of
Hidden Markov Models that provides a new,
better paradigm, for
problems like finding CG Islands and
many, many other problems in bioinformatics.