Unraveling Big Data with Dixon Jones from Majestic SEO Ahead of ClickZ Live NY

dixon-jones-majesticseoWith our digital marketing conference, ClickZ Live New York, about to commence in just under a couple of weeks, we’ve reached out to some companies who’ll have representatives speaking or exhibiting for some words of wisdom ahead of the event.

One of the agenda session tracks this year will focus delegates’ minds on all-important “Business Intel” and how marketers can make sense of the glut of insight they have at their fingertips to help them make better decisions about where they focus their time, what works and what levers they can pull to optimize their digital marketing efforts.

In this interview, we turn to Dixon Jones from UK-based analytics firm, Majestic SEO, who are no strangers to providing billions of bits of data that search marketers and digital PR professionals around the world use for competitive analysis, link building and influencer discovery and outreach.

ClickZ: The term “big data” is thrown around a lot in our industry. Marketers have always used data to make calls on where to spend money and track results, so why the focus on “big” and why is it important?

Dixon Jones: What has changed is that we are starting to derive meaning from different – unstructured – groups of disparate data.

For example you might know all the Twitter profiles of everyone on the Internet in one data set and a list of how links to those profiles in another data set. From these you can derive how influential each Twitter user is.

If you then work out in a third data source what each person tweets about, then you have a list of the most influential person on any industry (that uses Twitter). Do this again for Facebook profiles and LinkedIn profiles and you start to get a pretty good list of all the influencers in an industry – even though that data set does not exist independently.

CZ: Why are insights from big data different to insights from “just” data?

DJ: There are a number of ways to answer that – and the combination of multiple unstructured data sets is one. But the fact that we are using the data to direct the insight, rather than using it to verify a hypothesis is (I think) a key factor.

In the example above, we end up with a list of people that influence a market sector. In the “old world order” you might say “Lady Gaga is probably influential on Twitter, can we check that?” In the new big data methodology she will be influential if she is in the list and won’t be influential if she is not in the list. Whether any other list says she is or isn’t becomes irrelevant, unless that other list gets fed into the unstructured data sources as well, in which case the whole list might change as a result.

CZ: Is big data relevant to small businesses or is it all about big budgets?

DJ: Both have important roles to play. Collecting and string huge amounts of data requires big budgets – but interrogating large sets of unstructured data is not, and luckily the web itself is the largest set of unstructured data on the planet.

Small businesses can go a long way into analyzing the web without having to be a custodian of the big data set. This does not always have to cost money – increasingly you can access this data through APIs, often for little money or for free. Other times you can crawl the web or scrape the data – however (and this is important), scraping data needs to take into account the rights to use the data.

There are two disciplines developing, though: data collection and data manipulation. The custody of the data is rarely well equipped to glean insight quickly.

For example, the electoral registrar is a massive data set, but the people that compile the list day-to-day have to concentrate on the data collection. Knowing that people in one ZIP code are 30 percent richer than those living in another code or 20 percent less likely to have subsidence after merging that data onto a geological map of the USA is not something the electoral registrar is likely to be very good at.

Small businesses have the agility to develop the insights, but being a custodian of the data comes with an inability to react quickly or create opportunity. The custodians create potential; businesses of all sizes can turn it into something valuable.

CZ: So what does it take to be a custodian of a big data set?

DJ: The first thing to understand is scale. Often understanding this takes a little more insight than raw numbers.

For example, Majestic crawls over 2 billion pages a day and in the process sees something like 7 billion links. Our big data set is that list of links, going back over seven years. Already that sounds a lot right?

Now consider this: we are told by Twitter that in an average day there are 500 million tweets. In our industry we think we have a better grasp of how big Twitter is. Five hundred million tweets. Sounds a lot, but it is a load less than 7 billion URLs seen in the same 24 hour period.

Then, once you consider that, consider this. Google says that the average web page is 370kb. By my reckoning, then, to see those 7 billion URLs from 2 billion crawled pages, we have to do the equivalent of trawling through about 6,500 hours of video in data every day. My math may be off, but it makes my point.

You need to understand scale at the outset. Then, when you understand the scale of the task, you hopefully realize that with that comes a strong need for responsibility. For most big data sets the biggest responsibility is one of security.

Very few data sets don’t have something personal in them and if the data is anything like full, then it won’t have had every person actively and willingly opting in to you giving it lock-stock to every hacker on the planet.

In Majestic’s case the data is not personal, but we do have to operate responsibly in both the collection of the data – by not bringing down servers through over aggressive crawling, for example. We obey commands that websites can give us to either not crawl or crawl at a slower speed to help this and we also aim to be a very efficient crawler… but many an irresponsible data collector has inadvertently caused mayhem by being too aggressive through poor programming.

CZ: If our readers wants to dive in and capitalize on this trend, where should they start?

DJ: It depends on whether you want to capitalize on the technologies or the business opportunities. Of course they overlap.

If I was waking up tomorrow with a view to building a big data set, I would start by going on a Hadoop course. It’s open source and not the only data set you are going to use, but is really is designed to help deal with “big”.

If I wanted to look at the business opportunities, then there are a load of conferences and ClickZ Live has a track on “Business Intel” which is a great start. I think the challenge, though, is finding the visionary (or being the visionary) that has the ability to see the data sources, coordinate the harnessing of them all and have the political muscle in an organization to develop the business deals between the data sources.

CZ: Thanks Dixon and have a great show in NYC!

Companies who’ll be speaking on the “Business Intel” track at ClickZ Live New York include Covario, Chango, iProspect, SEER Interactive, GroupM, Time Warner Cable Media, Conductor and eBay, so you’ll not be bereft of intriguing insight and a whole host of inspiration if you make the journey to New York to hear what they have to say.

Registering for ClickZ Live NY is easy, just follow this link.

Related reading

guide to call tracking
Converting custom Using analytics to optimize sales funnels for new and returning customers
set up google analytics annotations for google updates
Three tools providing actionable competitive research insight