IndustryThe Google Spam-Jam

The Google Spam-Jam

Google must find an effective, scalable solution to fight spammy web results, rather than just plugging holes in its old PageRank model for organic results.

Now that people have pointed out that they’re seeing more spam than usual in Google, we’re all seeing it. Correct?

But is there actually more spam? Google doesn’t seem to think so, according to a recent blog post by a Google engineer.

Fact is, this is a problem that Google has had from day one and it’s not likely to go away anytime soon.

Google came into the search world with a “we can’t be spammed” battle cry and introduced the search engine optimization (SEO) world to PageRank. The battle has been raging ever since.

The Problem With PageRank

PageRank (an eigenvector centrality measure, to be precise) gives web pages a high score if they receive links from many other pages, but does so in a way that the credit received for a link is higher if it comes from a page that is already highly ranked.

PageRank is “keyword independent,” which means it allows Google to calculate the latter parts of the ranking score offline. So they are calculated ahead of time — and not at the time of the query itself (regardless of any keywords your web pages already have a PageRank score).

Because PageRank is so computationally intensive to calculate, it saves time if you only have to calculate it once. However, the downside to PageRank not being “keyword dependent” is that people may link to a given web page for a number of different reasons.

And this is where the problem lies: many pages may have a high PageRank for a reason totally unrelated to the search query at hand. Pages making reference to more than one topic, for instance (and many pages do) may be an “authority” on one topic but essentially irrelevant to another — and PageRank can’t distinguish between the two.

As a result, since day one of Google on the web, it hasn’t been unusual for end users (and more so SEOs) to find a highly ranked page in the search results, even when it’s obviously irrelevant to the search topic.

Skewed Search Results

Even before the current murmurs, a large fraction of bad search results have always been included. Pages that are important in some context — yet not in the context of the specific search query.

So, it’s no wonder PageRank gets a slight demotion and HillTop creeps in (around 2003).

Co-citation can skew results. A query for something specific such as [the beatles” isn’t too difficult for a search engine to discover and rank an authority result at number one.

These bibliometric cues which are loud inside lists can alter results. It’s beyond the scope of this article to answer why, but lists can seemingly force Google to provide a combined on-topic-off-topic results page.

Try a search for something less specific than [the beatles”, like [newspapers” for instance. The result is going to be different depending where and when you search.

Google Newspapers SERP

Look at the screenshot to see what I mean. The top ranked results are actually lists that receive a lot of in links (authority pages on a non-specific subject). You’d expect with such a query to at least see a result set of prominent newspapers.

The New York Times isn’t even above the fold (and I’m based in New York). Relevant? Not really. But not totally irrelevant either.

Game Theoretic View

There have always been obvious weaknesses to be exploited at Google.

In this kind of environment the perfect ranking function is always likely to be a moving target. The HTTP protocol and crawling the web for information discovery and indexing is wide open to spammers.

A couple of years ago at SES New York, Andrew Tomkins, then chief scientist at Yahoo, said something along the lines of: “As content becomes more diverse, more complex, bigger, and more fragmented… getting it through HTTP and HTML may not be the right model anymore.” That wasn’t specifically related to web spam, but it does address the entire problem of a process which no longer seems to be effective or scalable as far as web search is concerned, moving forward.

Of course, as Google has such a great understanding of user intent behind so many popular queries, they could simply filter out all commercial listings inside the organic results and leave them specifically for paid advertising. That would solve a huge chunk of the problem.

In fact, make all of the commercial listings inside the organic results video. It’s harder to spam that format.

Better still, don’t have any organic listings at all, bar a link to Wikipedia (which is what the organic listings frequently feel like anyway!).

By the way, Tomkins is now engineering director at Google. Maybe the future holds a whole different way of doing things at Google than to keep trying to plug holes in the old way.

Join us for SES London 2011, the Leading Search & Social Marketing Event, taking place February 21-25! The conference offers sessions on topics including search engine optimization (SEO), keyword analysis, link building, local, mobile, video, analytics, social media, and more. Register now.

Resources

The 2023 B2B Superpowers Index
whitepaper | Analytics

The 2023 B2B Superpowers Index

8m
Data Analytics in Marketing
whitepaper | Analytics

Data Analytics in Marketing

10m
The Third-Party Data Deprecation Playbook
whitepaper | Digital Marketing

The Third-Party Data Deprecation Playbook

1y
Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study
whitepaper | Digital Marketing

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

1y