Why Index Size is Important

UPDATE: There is an update from Microsoft at the end of this post.

One of the interesting things about Microsoft’s new Live search update was the announcement that Microsoft had expanded its index from about 5B pages to about 20B pages, a 4x increase. At some level, the exact index size is not a big issue, unless, your index is simply too small.

Google has stopped reporting its index size, but reportedly has about 24B pages in its index. In my opinion, there is little significance in the difference between 20B and 24B pages in your index, but there is a significant difference between 5B and 24B pages in your index. In short, Microsoft needed to make a move of this type to improve their relevance.

What’s at issue is coverage. People increasingly search for a highly specialized set of things on the web, and if you don’t have the related sites in the index, you can’t return the right result. During the announcement sessions, Microsoft demoed many search queries, but one that illustrates this point particularly well was a search for shelli segal.

A search for this term on Live Search will being up the designer’s website. This happens even though the site has a relatively small number of third party web site links to it (106 according to Yahoo).

UPDATE: I got an email from Matt Cutts letting me know that the laundrybyshellisegal.com web site is out of operation, and has been that way for several months. This makes the specific example provided here invalid, but nonetheless the underlying point of this post is unchanged. I have asked Microsoft to provide a new example, and will update this post when I get that.

By comparison, if you search on Google, you quickly discover that Google does not have this web site in its index. Note that many counter examples are possible to show – sites in the Google index that Microsoft has in its index. Ultimately, the point is, you can’t return the right result if the site you should be returning for a given search is not in your index.

Update from Justin Osmer of Microsoft:

“We crawled the site and did not receive any redirects, Like many other engines we rely on redirects and other Webmaster Tools help us stay fresh (like our URL removal process) but we know we can’t rely on that alone and are still building out as scalable, broad, and updated of an index as we can and are continuing to improve.

As you state, we still stand by our original point and intention that a user won’t get the relevant site if it isn’t indexed and this particular (poor) example was used to illustrate the point that before we never would of have had it, then we did with the larger index…however, unfortunately in the time we built the index the most relevant site is no longer available and we hadn’t re-crawled it yet. We have removed the site from the index now and are returning what we believe to be the most relevant set of results for that query. Our larger index will speed up its crawl frequency in time and situations like this will hopefully be minimized.

You had asked for some additional examples from the presentation Ramez gave so here they are:

Bigger index helping us: search for janet Buxman kurihara.

Core ranking examples:

Hottest temperature in the state of az:

Safeco building address Redmond: