The Data Game: Visualizing IP & Gambling with Quova

The Data Game: Visualizing IP & Gambling with Quova

TANCRED: …the kind of data experiment we
did, looking at a lot of our data in relation
to our gambling customers right around the
time of the World Cup. Basically to see what
we could see. I run Product Management at
Quova. Tobias, I’ll let you introduce yourself.
>>SPECKBACHER: All right. So, I’m Tobias
Speckbacher. I’m the VP of Emerging Technologies
at Quova, which really means that I get to
work with lots of different companies these
days that are pre-products or first products
line, to see how they fit into our infrastructure
to make things work there. I’ve been with
Quova for 10 years. I’ve had multiple roles
there so I can run it through all technical
positions pretty much that we have there from
research to operations. And recently, I moved
into this role. And that’s about it.
>>TANCRED: Yeah. So, a little bit about Quova.
Quova provides information about IP addresses
and we provide geographic and network information.
And what our customers do with that information
is basically provide richer, more engaging,
more relevant experiences for their users.
So, whether it’s geo-targeting and other kind
of targeting for search and other kind of
advertising or financial services in e-commerce
companies helping to mitigate the risk of
fraud. You have video on-demand in sports
companies who stream live video content and
other rich content. One of the reasons they’re
able to do that with copyrighted content is
because solutions like ours allow them to
comply with the regulations and the other
contracts they have that restrict them from
streaming content in other places. Major League
Baseball is an example in the U.S. where legislation
actually prevents them from streaming live
games in markets where they’ve sold the rights
to broadcasters. So the reason they can stream
live games is because they can tell where
you are if you’re in a home market and restrict
that content. And that’s how game–gaming
customers uses as well. Gambling obviously
has dif–online gambling has different restrictions
in different places around the world. The
reason you can gamble online where it is legal
is because online gambling companies can tell
whether you’re not–whether you’re in somewhere
where it’s legal or not. Yes, so, as I said
we took–we have a number of gambling customers
mostly in the U.K., but all over the world.
And we took some of their data and looked
at it in relation to right around the time
of the World Cup as I said. So, a little bit
about–Tobias will talk a little bit about
the methodology in our data.
>>SPECKBACHER: Okay. So, the way we get the
data is we have what we call a closed feedback
system that we basically as customers use
our data, we get individual transaction data
back from them, which we use for accounting
purposes, but also to focus our research efforts.
So, if you have the IP–the Internet, the
IPv4 space, basically it’s 4.2 billion addresses.
Not all of those are assigned so there’s about
2.8 billion addresses that are assigned right
now. But again, not all of those are used
by actual users a lot of that is infrastructure
space. And the majority of that traffic comes
from a subset of that. So, we use that feedback
data as a significant sample to target the
areas that are important for our customers
to focus our research on. The other thing
that we do is as we get that data back, we
release data–we release our IP data every
week. As we get that data back, we join individual
IP addresses back onto all the dimensions
that we have available on that specific IP
address or the network at large and we store
that. So, we can basically perform dimensional
analysis across all the feedback data that
we receive. And that’s about 30 billion queries
per month. That again is a subset of the queries
that are actually performed against our data.
The actual number is probably, you know, way
north of a 100 billion a month, because some
customers have higher performance requirements.
And choose to implement it differently that
it doesn’t allow them to give feedback data
back to us. What else is there?
>>TANCRED: So, some of the–some of the information
that we–that we assign to an IP address includes
geographic information from continent down
to postal code. And then the network characteristics
we assign are things like the carrier or ISP,
the organization that is responsible for the
content of the network, the domain, the speed
of the connection, how the connection is routed
through the Internet whether the–that IP
address is associated with an automizing activity
and things like that. And we have this data
going back pretty much since we started, about
10 years worth of data. You can imagine there’s
a lot of data there. Because we have so much
data–well, one of the reasons we haven’t
looked at it yet in a way that we’ve started
to look at it is because dealing with all
these data is kind of onerous, there’s a lot
of data to deal with. And so, we’ll talk a
little bit about the technologies that we
used to actually aggrate some of the data,
so it’s easier to record against and also
just–just mine it. A little bit about gambling
though, so online gambling has a kind of a
storied history. I mentioned the reason you
can get them online is because these companies
can now tell whether you’re somewhere where
it’s legal. Back in 2006, you saw stories
especially about European companies and executives
of European companies being indicted in the
U.S. because they were breaking the U.S. laws
by allowing their customers to gamble by violating
U.S. citizens from the U.S. to gamble. So,
being able to tell where users are coming
from is critical to industries like gambling.
And gambling in general, online gambling is
a growth market, so it’s–it represents 8%
of the total market or did last year, which
is–which is significant in terms of the market.
And it’s also growing. So it’s growing 13%
per year is projected to about $36 billion
by 2012, which is a large market. And this
is all according to H2, which is a sort of
industry–a gambling industry analyst. And
because of the legality, it’s mainly in Europe
and Asia that you see online gambling. That’s
not to say there isn’t any gambling in North
America though and in the U.S. In the U.S.,
gambling is traditionally legislated by states,
different states have different laws. You
can actually gamble online. You can do things
like you can bet on horse races in certain
states. And you can play Poker online for
money in some cases. But what’s happening
in the U.S., there’s legislation now being
passed to allow certain kinds of gambling
across the U.S. It will still be regulated
by states. And one of the reasons that’s happening
is because there’s other laws being passed
to allow that gambling to be taxed. And of
course, once you’ve, you know, it is a significant
market. Once you start taxing it, of course,
it represents a significant revenue stream
for the government. So it–that’s one reason
why you’re going to see that. And what you’re
seeing now is some of those companies, those
same companies that were in trouble in 2006
are coming to the U.S. and they’re either
setting up shop or buying some of the existing
gambling organizations in the U.S. So–and
gambling is interesting because it has–because
it’s worldwide it has a lot of the aspects
that make IP address geolocation interesting.
You need to localize the language to the–your
customer. You need to know where your customers
are coming from so you can market to them.
And you need to–you need to restrict the
access. And there’s also a lot of fraud involved,
especially during these big events, what you
see is online gambling houses, especially
smaller ones will be blackmailed by fraudsters
who’ll say, you know, “I’ve set up a system
that can take down your gambling site and
I’m going to do that, you know, during the
World Cup unless you, you know, unless you
pay me X amount of money.” And so it’s really
important to them, you know, that you–some
of these sites have been destroyed because
they’ve ignored these threats. Some of them
just pay out, but it’s really important for
them to be able to understand what threats
are real and also help prevent those. So,
it has a nice broad application for IP Geo.
So, a little bit about how we went about this.
We worked with a design company called Stamen
Design. And they do a lot with really interesting
visualizations and they do a lot with geography.
They did the maps for the last Olympics I
think they’re doing the 2012 Olympics in London
as well. You can it see it–some of the projects
they’ve done here, Crimespotting in Oakland
and San Francisco. It’s a project where you
can go and see real-time crime statistics
for those cities. They’re responsible for
[INDISTINCT] labs where you can see different
visualizations of big stories, log-on or logging-in
and wireless visualization. But they’re a
fantastic design company, they do great work.
And we knew that working with them we would
see–we would see the data in ways that we
hadn’t imagined we could see the data and
see things that–that we wouldn’t see otherwise.
One of the things–one of the ways that they
were able to work with a large dataset is
through the use of Solr, which you can talk
a little bit about.
>>SPECKBACHER: So, Solr is a Apache project
and it’s built on top of the Lucene Engine
that was developed by CNET in 2004. When it
was developed by CNET, it was donated to the
Apache Foundation in 2004. What makes Solr
interesting for a project like this is that
it allows you to rapidly dive into the data.
It’s very fast to ingest data, so it’ll access
over it and it provides facet search and date
faceting. So, faceting basically is–as it
correlated to group by operation that you
can run some [INDISTINCT]. So, we’ve used
that to explore the data with Stamen. And
we’ll present some interesting visualizations
that we both have at Solr and we used some
innovative newer graphing concepts for those
>>TANCRED: So, there are two kinds of graphs
that Stamen used with the data. The first
is a Horizon Graph which I’ll talk about in
a little more detail, and the second is a
stream graph, which may–you might be a little
bit more familiar with. I’ll talk about horizon
graphs first. Horizon graphs were introduced
in 2008 in a paper by these folks. Stephen
Few is a design blogger and consultant. He
is–his site is the perceptualedge. And he
wrote a paper talking specifically about Panopticons
use, which is a commercial business intelligence
company of Horizon Graphs. A lot of the images
you see are from Stephen’s paper. But it’s
a really interesting way to see data that
you would normally look at–temporal data
you might normally look at in a line graph
in a compressed form where you can start comparing
things and seeing things differently. So,
you have a traditional line graph and this
is a very good way to look at data over time
and you can see variations in data, peaks
and valleys. It’s pretty intuitive what these
means. But it’s hard to compare one line graph
to another. You can start overlaying line
graphs, you can start putting them beside
each other, but it gets very busy, very quickly.
And you can see that this is–this is 50 stocks
over about a year in 2006 all with different
line graphs. And it’s impossible to really
see what’s going on with these line graphs,
to really compare what’s going on with them.
So, Horizon Graphs alidade to see the same
data but in a much compressed form. And the
way you do that is you draw a zero line in
the graph; ideally, somewhere in the middle
of the graph depending on your graph and you
color the space between the zero line and
the line. You color the space above the line
in one color, the space below the line in
another. And what you have is anywhere if
you look at the red spaces, anywhere above
the line, you have empty white space. And
so, you could leverage that white space by
essentially flipping the graph up. So, now
you’ve cut the graph in half. You can still
see the peaks and valleys through the color
and you can compress it further. So, this
graph is–it has six bands of color and you
can see the darker color on top. If you look
at those parts of darker color, those polygons
fit in the polygon below in every case. And
what you can do is basically compress them
down. So, what you wound up with is a graph
that takes up less than a fifth of space but
it still gives you a very good sense of the
data. So, you can see by the intensity of
the color and the color itself whether the
data is positive or negative and how–where
the peaks and valleys are. So obviously, where
the colors’ more intense the peaks and valleys
are higher and lower. So, if you could look
at that same graph of 50 stocks with Horizon
Graphs, you get a much richer picture of the
data. You can see individually how individual
stocks have performed to which one have–which
ones have done well and which ones haven’t
and you can start to see trends temporarily.
So, you can see these stocks are all performing
negatively in this timeframe and these are
performing positively. And that maybe gives
you some indication of where you might want
to look deeper into the data. The other good
thing about this is this–all these line graphs
are–it doesn’t really matter–it’s all relative.
So, you’re seeing relative peaks and valleys
instead of absolute numbers. So that you can
see, you know, you might have one stock at–that
trades at a very low price and other stock
that trades at a very high price, but you’ll
see the same trends because the data is all
relative. So that’s Horizon Graphs. So, if
you’ll look–so, you know, we’re dealing with
countries all around the world. These are
the line graphs of the countries. You can
start to see–well, first of all, you can’t
see many countries on one page. You can start
to see maybe some trends in terms of where
the peaks and valleys are, but it’s hard to
kind of see them. So, this is actually a single
color Horizon Graph, but you–this is Internet
traffic to gambling sites from different countries
around the world. And immediately you start
to see–and this is–just in about a week
before the World Cup. Immediately, you start
to see, like if you look at the right edge
of each of these columns, you see a lot of
activity there, which correlates with, you
know, the day before the day of the World
Cup. And you still see individually where
you have a lot of activity. Like in Germany,
there’s always a lot of activity versus Guinea,
where there’s not a lot of activity until
the World Cup. So–and you have many more
countries here on this graph than you did
before. So, it’s a really powerful way to
see data, temporal data, when you’re looking
at lots of elements. So, this was really neat.
And it does show some trends. It really gets
interesting when we start looking at the stream
graphs though, so I’ll let Tobias talk about
the stream graphs.
>>SPECKBACHER: All right. So, stream graphs
are a type of Stacked Graph, complex layer
graph. And it was developed by Lee Byron and
he developed it out of a personal interest
to visualize his listening habits on lots
of events–last of that [INDISTINCT], lots
of different data about which music you listen
to, how often you do that. So, he tried to
do that with line graphs and different standard
visualization techniques and none of these
really brought a clear picture to the table.
So, he developed the stream graph concept,
which excels really when you’re trying to
present lots of data to a mass audience. It’s
not–it’s probably not–I mean, it’s not a
accurate–it’s not a highly-statistical representation
of the data, but it gives you ideas of trends
and how the different layers behave independently.
In 2008, the New York Times published a stream
graph that showed block the movie ticket sales
performance of 7,500 movies over the past
21 years. And, so this was kind of the first
publication of stream graph that was very
popular. And it evoked different kinds of
emotions. So, probably more technical people
didn’t feel that good about it because it
doesn’t really give you a good quantitative
image of what’s going on. And less technical
people really like the representation because
it is very aesthetic and it lets you visually
explore the data much, much better than a
more accurate representation of the absolute
numbers. So, here’s an example.
>>TANCRED: We’ll help you.
>>SPECKBACHER: We’ll get them.
>>TANCRED: Basically–and this is actually
what we’ll walk through. What the stream graphs
do is they let you start seeing trends and
then depending on your system, you can start
drilling down into the data either with more
stream graphs, which is what we’ll do or other
data. So, this graph is worldwide Internet
traffic to some of our gambling customers
from the fifth through the 13th. And, of course,
the World Cup started on the–of June of this
year, started on the 12th. So, what you see
is a pretty regular pattern of Internet traffic.
It’s heavily dominated by European countries
and the U.K., mostly because a lot of our
gambling customers are in the UK. But also
they have a pretty good gambling culture there,
online gambling culture anyway. And you see
there’s a lot of activity during the day.
It drops off at night, comes back during the
day. You see activity on Saturday and then
more activity than the other days of the week,
but it’s pretty regular until the day before
the World Cup where you see it spike and then
continued to stay high. So, this is interesting.
It is dominated by the U.K. and Europe. So,
what we’re going to do is drill down into
different continents and different countries
and then eventually different network characteristics
of the data to see other trends. And you can
see little examples of little anomalies in
here, but once you start drilling down they
become a little bit more apparent. So, if
we look at just Europe, it pretty much looks
the same. You start to see little weird things,
like up here you see this little chokepoint
but it pretty much looks the same. So, let’s
take a look at everything but the U.K., since
it was so heavily weighted from–with the
U.K. So, now, it starts to look a little bit
different. You start to see less of a–the
rhythm is still there, but it’s less extreme.
So, you see more activity throughout the day.
You also, on the first graph, you could see
this little blip, but this becomes a lot more
apparent here. Friday morning, there’s something
going on. And you see that’s this red band
in the middle, which is associated with the
U.S. So, there’s something going on there.
But you also see different countries behaving
differently. So, the blue up here, right above,
is the Netherlands and they have a very regular
rhythm of activity during the day and not
much at night versus some place like Denmark,
which is down here, which has pretty regular
activity throughout the day. And then you
also have like this green up here is Singapore,
where there’s not a lot of activity at all
in the week before the World Cup and then
it really just blows up. So, if we look at
>>I’m sorry, but what’s technically the buildup
with Vietnam? I don’t understand [INDISTINCT]
>>TANCRED: That’s a good question, I’m glad
you asked. Because it’s very important to
understand it. This–so the size, it’s like
a Stacked Graphs, so the size of the color
is more traffic, more queries. And what this
data represents is IP address queries from
these companies. It doesn’t necessarily mean
that people are gambling, so someone could
be coming from the U.S. and hit the site and
be denied.
So my question basically is, what is zero
and why is it different from a graph that
is less [INDISTINCT] stacked graph?
>>SPECKBACHER: So typically when you stack
graphs, you have a couple of issues. So first
of all if you use lots of time series, a series
that don’t contribute that much data kind
of disappear in the graph visually. So, the
other issue is, if you have two series of
equal vertical height but with different slopping,
one of the two tends to disappear visually.
So, this methodology really is to visually
pull those out and not make them disappear
and stand apart. So it’s not so much like
I need to know exactly the slope and I want
to know what the movement of the individual
layers is.
>>Like how did you choose [INDISTINCT]
>>SPECKBACHER: It’s actually an algorithm
that you…
>>SPECKBACHER: Yes. So, yeah, so it’s a detailed–there’s
detailed documentation in the paper that was
linked on the previous slide, so.
>>TANCRED: Yeah. And you’ll see–you’ll see
kind of how it differs from a stacked area
graph when we look at the U.K. specifically.
And it’s a nice example of how a stream graph
kind of changes, how it’s different from a
stacked area graph in some ways. Does that
help at all? I mean basically, what you’re
seeing here–what you’re looking for are trends
and in some cases it gives you some answers,
but in more cases it just raises additional
questions that you may or may or may not be
able to answer with a stream graph. So we’re
looking at Asia. So Asia looks a little bit
similar to Europe, except that you don’t have
that big spike on Saturday, because it’s–because
for the customers that we’re seeing in this
traffic, Asia isn’t as much of a gambling
culture traditionally, but you do see them
coming to these gambling sites during the
World Cup, before and during the World Cup.
So, and again, you get a much better view
here of the impact of Singapore and their
big traffic, which is represented in the middle
here where it just kind of explodes. So this
gives you an idea of gambling patterns in
Asia. If we look at the US, where you saw
that kind of weird spike, well this is North
America, but this is instead of by country,
we did it by organization because it actually
gets very interesting. So if you look at the–so
the immediate thing that you might notice
here is that regular rhythm is gone. It’s
a pretty straight graph for the most part.
You have these blips which I’ll talk about
in a second, but even in the bands, there
isn’t a regular pulse of activity. So when
you look at the actual organizations, it’s
hard for you to read that, but red is Google,
so either your counterparts in Mountain View
are staying up all night gambling everyday
or there’s something else going on. You start
looking at the other organizations like Microsoft
and Yahoo! and you realize what these are
[INDISTINCT], that are indexing the site.
So that all of a sudden makes sense, where
before, you might have seen a lot of traffic
from North America to the States and not being
able to explain it because really you go to
a site once you get denied and that you don’t
try again. This is much more understandable.
These kind of anomalies are weird. This one
on this side was Comcast Cable in Centerville,
California. And so there was just a bunch
of activity on Saturday. I don’t know why.
I don’t know–I mean, we can look at it further
and we can say, “Okay, which sites were they
going to? What IP addresses were they? Does–is
it many IP addresses or single IP addresses?”
But it’s something to look into. It can be
completely legitimate or it could be illegitimate.
It could be someone probing the site before
an attack. It could be someone probing the
site for legitimate reasons. It could the
site itself doing some–running some tests.
You see the same thing here. This one’s in
Phoenix from a publishing company. Again,
very odd to see that level of traffic the
day before the World Cup, but it could be,
again, legitimate or illegitimate. And certainly
it’s strange. You also see that chokepoint
that I mentioned earlier, much more pronounced
here. And you see that on other graphs that
could be an attack, maybe the servers went
down because of an attack or maybe they went
down because they crashed or maybe they’ve–maybe
some of these sites took their service down
for maintenance. It happens to be during–I
mean, it’s a bad maintenance window and that
is in the middle of the World Cup but if something
bad was happening and they had to take the
site down, then it makes sense probably to
do it when traffic was low anyway. So that’s
probably what it is. But it’s interesting
looking at these graphs and kind of coming
up with theories for this. And then as a customer
of the data, you would be looking at this.
As an industry, it [INDISTINCT] about what’s
happening on the industry. Yeah?
>>TANCRED: If you can–it’s not just relative,
you can get information about how many total
queries is this and then you can start figuring
out what the traffic numbers actually are.
What I would do if I actually wanted to know
what those numbers are, I’d query the data
directly for that timeframe and find out what
the group’s in. I don’t for this category
graph, but we could come up with them. Yeah?
>>TANCRED: So that’s–that’s the way that
the graph works. It tries to–and maybe you
can explain it better, Tobias, but it tries
to kind of equalize the data. And you’ll see
this in some other graphs where there’s less
data, that the graph shifts more. Where there’s
more data, it’s better at equalizing.
>>TANCRED: Yeah. Right, right. And I don’t
know exactly what the graphing software’s
doing there but it’s basically an artifact
to the graph.
>>SPECKBACHER: This one?
>>TANCRED: Yeah. So this is everything but
Europe, Asia and North America. So, again,
you see this kind of shift because there’s
less data overall so the waiting is less.
But you start to see interesting things again,
like which countries outside of those three
main markets are good markets for gambling
and gaming. And so, here you have South America
in green and in gray, we got two grays, oh,
Australia. And you see, again, South America
has a good rhythm. Australia, they stay up
later or they’re gambling at different times,
but it’s more of an equal band until you get
to Wednesday. Interestingly, Australia started
betting really early. If you look at other
countries, I was looking at other countries
like Malawi. And when I was looking at Malawi,
I was just looking between Friday and Friday,
and it was just basically flat except for
a spike somewhere on Monday or Tuesday. And
I thought, well, like, “I guess they didn’t
have a team in the World Cup so they weren’t
interested in it,” until I looked at–because
every other country started betting on Friday,
and then I looked at Saturday and then there
was a huge spike. So it’s just interesting
to see the different mentality of different
countries. And I don’t think Nigeria played
until the 13th so it could be that they were
betting on African teams. I don’t know. But
it’s interesting to come up with hypothesis
about this. So now, we’ll look at three different
countries in Europe, starting with the UK
because it represented so much data. This
is just a very interesting stream graph because
it–you basically have a stream graph, and
if you take away London, you have a stacked
area graph on top of it, because London basically
creates the zero line. But this essentially
matches the European data in terms of its
pulse and again everything I talked about
with betting on the weekend and the chokepoint
and things like that. So if we take away London,
it’d be interesting to see if the U.K. is
sort of heterogeneous in the way it gambles
and the graph essentially looks the same.
You start to see a little bit more detail
in terms of what other cities in the U.K.
are gambling online but it basically looks
the same. So, let’s look at something that
looks different. So here’s Germany. This kind
of have this rhythm but it’s also a little
bit all over the place. You have, you know,
Monday morning people come into work and they
stop betting, but then they sort of get over
their guilt and they go online and continue
betting. Germany’s first game is on the 12th,
and so, you see a big spike here. But it’s
pretty consistent; they’re online all the
time betting, unlike the U.K. And you also
have this huge area that kind of looks like
London did in the U.K. except this is Karlsruhe
which is not any place I’ve heard of. So,
it’s a little bit harder to explain until
you start looking a little bit deeper into
the data. And this is actually 1&1 Internet
AG. They’re an Internet provider. They have
a big hosting facility in Karlsruhe. And so,
you know, we’re locating their traffic where
their datacenter is because that’s the last
point we see. And so, in our data, this would
be represented with the routing type of regional
proxy so, you know, we know what country it’s
in, but we can’t necessarily tell you what
city it’s in. But at least we can tell you
it’s Germany. And so, now that makes a little
bit more sense. So that’s Germany. We’ll look
at Denmark next, which also looks really crazy.
There’s really no pattern here. You have this
huge red and this huge blue. Definitely, you
see a lot of activity during the World Cup.
And so, that big red most likely represents
consumer traffic. It’s strange that this blue
is really active here and really active in
the middle of the week before the World Cup.
And then, kind of dies out completely. When
you look at the organization behind this,
that blue is basically a website that reports
odds for games and it refers traffic to the
gambling houses. So, for whatever reason,
there’s a lot of people online checking the
odds of different matches, whether it’s World
Cup or not and going to betting sites and
placing bets. The red is similar to what you
saw in Germany in Karlsruhe, it seems to be
a hosting provider, although, it also has–provides
VPN services. I don’t know why there’s a big
spike there. Maybe there were some other major
sporting event that people were betting on.
But certainly, if, you know, if I want to
learn more about the Denmark marketing, how
it works, this is something that would, you
know, I would start looking into, why there
might be a big spike and then a complete drop
in activity and what’s going on. So I mentioned
that we have this geographic data, we also
looked at the data in terms of the never characteristics.
In the next few graphs, Tobias will cover
and they show how people are connecting and
routing to get to these gambling sites.
>>SPECKBACHER: Right. So what we see here
is a stream graph representing the connection
types. Meaning, what we do is we categorize
network blogs by how they are connected to
the Internet. So you have the DSL and cable
down here in red and yellow which are, you
know, you would expect those to be dominating.
There’s a pretty healthy amount of routing
as betting going around here that’s represented
as purple on this graph. And we have this
green band that shows this uniform traffic
coming through here on fix connections so
that again is probably most likely the U.S.
traffic that we saw earlier that originated
from the large search providers and we can
see that as fix connections here.
>>TANCRED: Yes, you want to…
>>SPECKBACHER: Yes. All right. So, as I said
there was a pretty healthy amount of mobile
betting going on. And that’s–and now we’re
segmenting the data by mobile providers. And
since most of the traffic came from the U.K.,
we see T-Mobile U.K. and Hutchison 3G, I think
the dominant providers here. But this is kind
of an interesting if you, you know, to slice
data like that it’s interesting to understand
which providers users are with, you can use
that for marketing or target ads. But so just
the fact that it’s a–that you actually are
able to identify that’s coming from a mobile
carrier helps you in a sense because you know
the user’s mobile, so whatever IP geo-location
tells you is probably something that you should
not rely 100% on but you can use confidence
factors and other data points that we give
our customers to understand these circumstances.
So, there was also a segment of dial-up users.
And that was actually kind of surprising because
there was decent percentage of…
>>TANCRED: Yeah.
>>SPECKBACHER: …of the overall traffic.
And again, the U.K. has dominated in the traffic
there. There was some of the U.S. traffic
>>TANCRED: Japan.
>>SPECKBACHER: Yeah, Japan.
>>TANCRED: Tanzania.
>>SPECKBACHER: And then, there’s, you know,
lots of developing countries on there, which
apparently still use modems. Anonymizers.
So when you’re operating a gambling site,
you want to make sure that your customers
are not circumventing your IP geo-location
solution. And typically, they’ll try to do
that by cracking through a proxy server that
provide its–that provides a certain level
of anonymity. If you’re trying to gamble with
a U.K. provider, what better proxy to use
than the one in the U.K. and that’s basically
what we see here.
>>TANCRED: Maybe, I can say a word about…
>>TANCRED: …anonymizer in the data. So
the way that Quova identifies anonymizers,
they identify anonymizers by specific IP address
and activity receipt. We also–because we
provide our data as network blocks, we also
identify network blocks that have anonymizing
activity in them. So, a lot of this activity
is probably not anonymizer activity but is
in a network block where we’ve seen anonymizer
activity. Certainly, so I wouldn’t expect
that every transaction that you see here is
associated with someone using a proxy. But
you can see at, you know, the graph certainly
gets wider as it moves to the right, which
is what you’d expect during a big event that
you’d see more anonymizer activity at these
sites. And as Tobias said, more in the U.K.
because they’re trying to reach sites that
are in the U.K.
>>SPECKBACHER: Right. So, basically what
we flag is bad neighborhoods so like for crimespotting
data, if you look at it, this is the network
block that had some suspicious activity going
on in the past or recently. So you should
be cautious in dealing with that type of traffic.
And so now, we segmented the anonymizer populations
by carriers and it’s not very surprising that
most of these anonymizers are actually with
hosting providers. So, they’re probably not
systems that are actively being used by actual
users, unless this is having betting with
some customers. Yes?
>>TANCRED: And this can be compromised machines
or hosts that people have setup specifically
for this?
>>SPECKBACHER: Yeah. So, someone might get
[INDISTINCT] set up with or the other possibility
is just that boxes get routed and [INDISTINCT].
>>TANCRED: And the significance of this information
is that when you’re trying to prevent fraud,
when you’re looking at traffic coming into
your sites, the more things you can correlate
with, the better your prediction capabilities
are. So if you can correlate–if you’d know
that certain carriers or certain organizations
or certain countries for certain connection
types correlate better with known fraud, then
knowing all that data when–if the traffic
is coming in, lets you treat those connections
differently than you would otherwise. And
that’s what the financial institutions do,
that’s what e-commerce sites do, that’s what
gambling houses do. And that’s why it’s important
to have this information. So, you know, it
was a pretty brief look at a very small part
of our data. We’re just starting looking at
this data. We’re just starting at looking
at different ways to visualize the data. What
we’d like to do is make a lot of this information
public because the more people looking at
it the more interesting things we’ll find
in the data. As people start looking at the
data, I expect that, you know, we’ll see more
trends in the data and that we can start to
use a lot of these user’s data to do things
like predict events, predict and prevent fraud,
look at marketing trends. And they’re certainly
going to be a lot of assumptions that people
have about traffic to different markets from
different places that can be either confirmed
or disproved with this data. So, we’re excited
about this. We’re going to continue looking
at it, like I said, hopefully, we’ll make
this data public pretty soon. And that’s it,
any questions? Thank you. We were so interesting
that we distracted you.
>>Yeah I am. So this is about [INDISTINCT].
>>TANCRED: Yeah.
>>TANCRED: Right.
>>TANCRED: Well, I mean, what the laws typically
state that you’re using, you know, industry
best practices.
>>TANCRED: And, yeah, and it’s not–and there
are certainly ways to get location data that
are not industry best practices. So if you’re
trying to–if you’re, you know, selling restricted
goods to different countries around the world
where those goods aren’t supposed to be sold…
>>TANCRED: …then, using things like user
reported data wouldn’t be sufficient. You
have to use some other kind of data or, you
know, even GPS now you see spoofing there,
so, yeah.
>>TANCRED: Yes, in our experience.
>>TANCRED: Yeah, sure.
>>TANCRED: Yes. So, let me ask–I’ll repeat
the question because I don’t know if the questions
are coming through in the recording but the
question is, “When we put this data on the
public, do we know what kind of visualizations
and graphs will allow, whether that they’ll
be static or dynamic and things like that?”
You want to take that?
>>SPECKBACHER: Sure. So, certainly, our goal
is to enable lots of people to explore the
data. So, static graphs are not going to be
very suitable for that. Obviously, we’ll have
to provide some level of pre-aggregation to
protect the innocent customers. But, you know,
we can provide dimensionally aggregated data
and let people slice and dice those datasets
however they want. So that’s the plan.
>>TANCRED: And I would expect that we’re
going to probably provide some interesting
visualizations like this and maybe some more
traditional ones that let people get a little
bit more statistical and specific with the
>>TANCRED: Right.
>>…it’s getting anonymized and what have
you going with it
all connected on. But you’ve obviously shown
that these kind of meet the new graph types.
>>TANCRED: Right.
>>Are you implying something in the space
where people will actually be able to navigate
these graph types?
>>TANCRED: That’s our plan. I would expect
we’re not going to–well, at least in the
first instance, the first exploration will
be through different graph types rather than
just access the data directly. Although, we
might, depending on how we can aggregate and
anonymizerd the data to make the data directly
>>And my second question is with the stream
graphs, have you done any kind of cross-dimensional
analysis back where you all are actually using
it to find support correlation and trends
into the dimensions with different methods?
>>TANCRED: It’s interesting, we’ve like–we’ve
done that with multiple stream graphs. Like
I was talking about, looking at a specific
city that shows weird activity and then looking
at different dimensions of that but that’s
by running different–well, yeah, running
different stream graphs. And it’s actually
been very interesting for us to see certain
things about our data that weren’t completely
evident to us before. But I don’t know what
the stream graph’s capabilities are to look
at multiple dimensions in the same graph,
if that’s what you’re asking.
>>TANCRED: Yeah. I mean, what we wound up
doing a lot, I mean, Tobias and I spent, we
basically spent a long time just creating
interesting graphs. And you wind up creating
graphs on specific metrics and excluding specific
things to get to the answer you’re looking
for. You know, so you look at interesting
things like routing types against cities,
against carriers and organizations until some
things start to make sense. Like that chokepoint
that we saw early Saturday morning, I think
it was. If it exists across every routing
type and across every customer that we’re
looking at and in every country, then it indicates
something maybe industry-wide. If it only
exists for one of the customers then it’s
something specific to that customer. And so,
that’s the kind of exploration you want to
>>Thanks very much.
>>TANCRED: Sure.
>>SPECKBACHER: So there’s actually a JavaScript
library that you can use to create this. It’s
called Protovis.
>>TANCRED: You know, I know everyone’s wondering
about the stream graph on my shirt, so I’ll
answer that question, yeah, yeah. I didn’t
plan on wearing the shirt. I brought it and
Tobias mentioned he pointed out that if there’s
essential stream graph on it so I’d realized
I had to wear the shirt. So it’s not entirely
>>SPECKBACHER: It was a designer’s idea data,
I think.
>>TANCRED: Yeah. It’s a–yeah, I’ll let you
decide. Thank you.

1 thought on “The Data Game: Visualizing IP & Gambling with Quova”

  1. person talkin about stream graphs doesn't speak loud enough, or he doesn't speek into mike.
    Other then that, good video

Leave a Reply

Your email address will not be published. Required fields are marked *