To develop a new formulation for the

sequence comparison problem, we will need

to travel to 19th century Japan

and visit one of the Yakuza-run

casinos called Bakuto.

The Yakuza Crime Syndicate had

a humble beginning, where their first

business operation was running

the network of frame shift casinos.

And in these casinos,

the favorite game was Cho-han,

when the dealer would toss two dice,

and the visitors would

bet on the outcomes,

the outcomes being the sum of

numbers on these two dice.

So “cho” means, as you probably guessed,

“even”, and “han” means “odd”.

This is, of course, a fair game.

If the dice are not loaded, then there

is 50/50 probability of winning.

However, in Bakuto casinos,

the dealers often used biased dice,

“loaded” dice.

And in this case, the outcome

may be shifted towards the

casino owner or towards insider backers.

Also, it may be fun ro play Cho-han,

in a Yakuza run casino.

We will play an equivalent game,

where the dealer simply flips a coin,

and we bet on heads or tails.

The challenge here

is that the dealer actually has

a choice of two coins: fair and biased.

For the fair coin, the probability of

heads is 1/2 of course, and for

the biased coin,

the probability of heads is 3/4.

On this slide,

I show the coins in blue and

green, but in reality,

the coins are completely identical, so

you have no clue which

coin the dealer used.

The question arises:

Suppose the dealer made 100 flips and

63 of them are heads;

which coin did the dealer use,

fair or biased?

This question, of course,

doesn’t make sense

because any coin, fair or

biased, can result in 63 heads.

The right question to ask is:

What coin is more likely

if 63 out of 100 flips resulted in heads?

How do we answer this question?

Here’s a hint:

63 is a little bit closer to 75 than 50.

Should we then assume that the biased coin

is more likely in this series of flips?

Before we come to a conclusion,

let’s try to estimate some probabilities.

Let’s consider a sequence of n

flips, denoted X=x1 x2 … xn,

with k heads.

The probability that this sequence

was generated by the fair coin

is of course (1/2)^n.

But the probability that it was

generated by the biased coin

is (3/4)^k * (1/4)^(n-k)

How can we decide what

coin is more likely?

Of course, if the probability of X

being generated by the fair coin

is larger than the probability of X

being generated by the biased coin,

then the fair coin is more likely.

Let’s refer to these probabilities

as the blue probability and

green probability, respectively,

when we talk about the probability of

X being generated by the fair coin or biased coin.

And likewise, if the blue probability

is smaller than the green probability,

then the biased coin is more likely.

However, when these probabilities are the same,

when in equilibrium state, when

blue probability=green probability,

we can compute, by doing some

arithmetic, that in this case,

k=log_2(3) * n

Or, in other words, k will be approximately

equal to 0.632 multiplied by n.

Which means that when the number of heads,

or the fraction of the number of

heads is smaller than 0.632,

it means that the dealer is more

likely to have used the fair coin.

Therefore, our original intuition

actually was incorrect.

If there are 63 heads in

the series of 100 coin tosses,

then the dealer was more

likely to use the fair coin,

even though 63 is closer to 75 than to 50.

The dealers in traditional

Yakuza-run casinos were

shirtless to avoid accusation

of tampering the dice

because a shirtless dealer is less likely

to switch the fair dice into the biased dice.

But the reality was that even shirtless

dealers were able to switch the dice,

and we will model these situations with

two coins up the dealer’s sleeve,

and instead of the previous case,

where the dealer ran a serious of

coin flips, but always used

either the fair or the biased coin,

we will assume that the dealer now

can change the coin at any moment,

with probability 0.1.

So, after watching a sequence of flips,

can you tell when the dealer

was the fair coin and

when he was using the biased coin?

Our attempt to read the dealer’s

mind brings us to the following

casino problem: Given a sequence

of coin flips, determine

when the dealer was using the fair coin

and when he was using the biased coin.

Do you think it is a well-formulated

computational problem?

Of course it is not

because any outcome of coin tosses

could have been generated by any

combination of fair and biased coins.

And therefore, we need to grade different

scenarios, for example BBBBB, FFFFF,

BBFFBB, differently,

depending on how likely they are.

But if we try to do this,

then the question arises:

How can we explore and grade

2^n possible scenarios for coin tosses?

Here’s a simple approach to reading

the dealer’s mind one window at a time.

Let’s choose a small window,

let’s say a window of five tosses,

and for each such window,

let’s decide which coin is more likely

to generate the outcome in this window.

We already know how to

answer this question.

For example, for this window,

HHHTH, there are 80% heads,

therefore it’s most likely to have

been generated by the biased coin.

For the second window,

we have only 60% of heads,

and therefore, this is more likely

to be generated by the fair coin.

We will actually use a ratio of blue and

green probabilities,

and we will decide what coin

generated the given outcome

by simply computing this ratio.

If this ratio is smaller than 1,

then the biased coin is more likely.

If this ratio is higher than 1,

then the fair coin is more likely.

And we will also introduce log-odds ratio,

which is simply the base-two logarithm

of this ratio, and in this case,

as we know, the base-two logarithm

of the ratio equals

(number of tosses) – log_2(3) * (number heads)

and our decision rule for deciding whether

a given window was generated by the

biased or fair coin will simply be:

If the log-odds ratio is less than zero,

then it is a biased coin.

If it is larger than zero,

then it is a fair coin, and

it is represented by this simple rule.

So, to continue reading the dealer’s mind,

we will go through the whole sequence,

and every time computing which coin

is more likely, and then we will classify

all these windows into being more likely

to be generated by the biased or the fair coin.

The question however, arises:

What are the disadvantages of this approach?

The obvious disadvantage of the sliding

window approach is that different windows

may classify the same flip differently.

Also, the choice of the window length

always affects our conclusions.

But how do we know how to

choose the window length?

And all that may be an interesting

introduction to makeshift casinos and

coin flipping,

but what does it have

to do with biology?

To explain what coin flipping has to do

with biological applications, we will start

from a simple example of CG islands and

then move to more complex examples.

The question we will ask:

Why are CG dinucleotides more rare than

GC dinucleotides in genomic sequences?

Well, different species have widely

different GC-content, or

percentages of (G+C)

nucleotides in the genome.

For example, gorilla and human have GC

content 46% while platypus

has GC content 58%,

which means that, on average,

if the distribution was uniform, you

would expect that both nucleotides G and C

appear in the human genome

with a probability of 23%.

And therefore, you would expect

that each of the dinucleotides

(CC, CG, GC, and GG) appears in

the genome with frequency 5.29%.

But the reality is that the frequency of

CG in the human genome is 5 times smaller.

Why?

Methylation is a DNA modification

that adds methyl CH3 group

to the cytosine nucleotide, and

it often happens to a C

in a CG dinucleotide.

The resulting methylated cytosine

has a tendency to deaminate into

thymine, and as a result of methylation,

CG is the least frequent

dinucleotide in many genomes.

However, methylation is often

suppressed in the areas around

genes called CG-islands,

where CG appears frequently.

For this example,

this would be a CG-island.

And the question arises:

Suppose you want to find genes.

How would you search for CG-islands

as a prerequisite for finding genes?

Well, if we were to use the same

paradigm of Log-odds ratio,

and simply classify CG-islands as

more likely in areas where this

Log-odds ratio is less than 0,

and non-CG island as

more likely in the areas where

Log-odds ratio is larger than 0,

then we will arrive at a classification

algorithm for CG-islands.

However, there are a few issues.

As before,

different windows may classify the same

position in the genome differently.

It is not clear how to choose the length

of the window to construct CG-islands,

and very importantly it is not clear

whether we should use the same

length of windows for different

regions of the genome.

And in the next section,

I will describe the notion of

Hidden Markov Models that provides a new,

better paradigm, for

problems like finding CG Islands and

many, many other problems in bioinformatics.