Indexing Versus Caching & How Google Print Doesn’t Reprint

I’ve written
that legal concerns about book indexing and Google Print may have
repercussions for web indexing. Kevin Werback and David Winer look at this
again, afresh. A look at this, plus the crucial difference between indexing
(making something searchable) and caching (reprinting content). Google’s library
scanning program makes things searchable in Google Print but reprinted.

Breaking Apart at the Seams
from Kevin stresses as I’ve done that indexing
the words on a web page isn’t that much different than indexing the words on a
printed page. He wonders if a lawsuit preventing book indexing might a type of
unraveling of sharing content online in general.

A turning point for
the web?
from Dave goes much longer to counter the notion that an opt-out
approach is acceptable. Unfortunately, he’s just not getting some of the points
of what’s involved correct. Specifically:

If you publish a site, Google reads the whole site into its cache and then
lets you find things in it. Generally people who publish sites know this, and
want Google to do this.

Google’s index and its cache are two different things, and it’s critical —
absolutely critical — they not be confused like this.

When any search engine visits a web page, it effectively makes a copy of that
page which is stored in the index. But the index literally breaks apart the
page. It stores where words were located, were they in bold, what other words
were they near, were the words in a hyperlink and so on.

Nothing in the index is anything you as a human being could read. I’ve
described the index in searching classes to being like a “big book of the web.”
But it’s not, really. It’s more like a giant spreadsheet, where all the words of
a page are in one row of the spreadsheet, each word to a different column, then
the next page in the row below that, and so on. It’s not something a human being
would read.

Aside from the index, Google, Yahoo, MSN and Ask Jeeves also make “cached”
copies of pages available. You can see a copy of the exact page the search
engine spidered. These cached pages are kept separate from the index. They are
useful for when a page is down or for a copyright holder wants to see if someone
has stolen and cloaked their content to feed to a spider. But the legality of
showing such cached pages is also in question. No one today has challenged them
in court. The reason seems to be that Google, which mainstreamed cached copies,
lets site owners opt out of caching if they want.

All major search engines also let you opt out of being in their indexes, as
well — a completely different thing — and another reason why the index
shouldn’t be confused with the cache. To take Google as an example, you can:

  • Have your page listed in the index (available to be found through
    searches) and have your page available as a cached copy
  • Have your page listed in the index but not cached
  • Have your page NOT listed in the index and thus also not cached.

The ability to opt-out of the index is another reason why we really haven’t
had a major search engine sued over web search indexing. In addition, site
owners as Dave notes generally want to be indexed, so they can get traffic. In
fact, the reason so many are upset over the current indexing update at Google is
that they feel changes are causing them to lose traffic. But whether it is LEGAL
to do this type of indexing (as opposed to caching) still really hasn’t been

So indexing and caching are NOT the same. Back to Dave’s piece. He writes:

Google clearly does not have the right to make a copy of the book and
republish it without the permission of or compensation to the copyright owner.
The publishers appear to be on the right side of this one, and while I’m not a
lawyer, I can’t imagine that they won’t prevail in court.

I’m not a lawyer either, but I can completely imagine that Google might win.
Maybe not, but it’s hardly far-fetched or doubtful, and even some lawyers
feel they may

Here’s the thing. Google is NOT, repeat NOT, republishing copies of books
that it scans out of libraries. This is a fundamental mistake that many people
seem to be making.

Google is scanning books into an index, just as it spiders web pages and adds
them to its index. It is making the books searchable by doing this, but that
process does not republish the books in a way you can read.

Think about it in web search terms. You can find a matching book, but there’s
NO hyperlink to click on that will take you to an online version of the book
itself. There’s just a snippet — maybe — of the text surrounding the words
matching what you looked for.

Want the actual book? Google Print won’t give it to you. Instead, you have to
go someplace and buy it or find it in a library. Google Print merely tells you
the book may be what you’re looking for.

The only exception to this is if a publisher OPTS-IN. Not opt-out. If a
publisher chooses, then — and only then for books that are in copyright — will
Google display some of the actual book. The exact amount is left up to the

So, I’ve covered that indexing means making a book (or web page) searchable
while caching means making a page (or a book) viewable online, without having to
go to the source material (the book or the page). Let’s recap then how both
systems work:

Search Type Indexing Caching Snippets/
Web Opt-Out Opt-Out Opt-Out
Books Opt-Out Opt-In Opt-Out

As you can see, book search is actually more opt-in than web search is. Books
themselves aren’t cached or shown. But they are made searchable without

That systems has worked on the web, because of the aforementioned feeling
that site owners want traffic. As for book publishers,
Why Don’t Book
Publishers Object To Web Indexing?
from me earlier covers how many seem not
to mind getting traffic through an opt-out system on the web, as well.

It remains to a court to decide whether it should be workable when it comes
to book indexing. If not, then absolutely, you might see search engines ponder
if web indexing itself — which really hasn’t been legally tested — is
something they’ll need to require an opt-in for. And if that’s the case, web
indexing will get pretty bad, since many publishers will simply fail to make the
opt-in effort.

What’s that third column, the snippets/description one? That’s the place
where I think book publishers might prevail, and certainly a change that Google
should consider.
Legal Experts Say Google Library Digitization Project Likely OK; Will It Revolve
Around Snippets?
covers how it’s possible that in some cases, even the
limited description that Google puts on pages might give away some of the value
of a book and thus real harm might be proven to a publisher. Solution? Make
showing descriptions an opt-IN thing.

Lastly, Dave makes a couple of other comments:

It’s time to realize that Google is no longer the little company we used to
love. They’re now a huge company that pushes individuals around like a lot of
other huge companies. They need some balance to their power. And it’s
ridiculous to blindly take their side on every issue. Sometimes they’re wrong,
and I believe this is one of those times. It’s certainly worth considering the
possibility that they’re wrong.

Absolutely, Google is a big giant company, not some tiny lovable start-up. If
anyone still has that idea, definitely get it out of your mind now. But whether
you think they push others around or not may depend on what area we’re talking
about. And whether a company of any type should be hated because they’re big is
another issue, as well. Nor should it be assumed that Google is always right.
The most definitely are not.

As for this:

This situation is much like the
we had with Google a few months back, when they wanted to put ads on our sites
without permission and without paying….and right now they’re putting ads on
your content without your permission, without compensating you. Now how do you
feel about that?

Dave is talking about Google’s
AutoLink. I’d
disagree that the links Google may insert if someone clicks on the right button
in the Google Toolbar are ads, so don’t freak out if you aren’t familiar with
AutoLink and are suddenly scanning your pages to find how Google got real
AdSense ads on it. They didn’t.

I would agree that Google should to the opt-out route with AutoLink, as I
wrote before.
But it’s also a harder argument to have, when there’s been the incredible
popularity of GreaseMonkey for
Firefox, which can insert links into pages. Plenty of people use
CustomizeGoogle, which inserts
links into Google’s own pages. Fair turnabout, some who hate AutoLink would say.
Yes, it is — but then it also weakens the argument that Google itself can’t let
people put links into pages with its own tools.

Postscript: Ray Gordon writes to say he has filed a complaint arguing
that web search on an opt-out basis is in violation of copyright. You can read
the filings here. I’ve
skimmed them, and he seems more concerned about usenet material (rather than web
material) that can’t be removed, apparently because others may have reprinted
his own posts.

Postscript 2: Dan Thies writes that an search index is even less readable than a spreadsheet, and he’s correct. I was trying to keep things simple yet familiar to illustrate the difference between words arranged on a page for reading and words indexed to make a search engine. As Dan says, he understands I was keeping things simple — but he also takes you deeper into how inaccessible to a “reader” a real index actually is.

Related reading

SEO is a team sport: How brands and agencies organize work
How to pitch to top online publishers: 10 exclusive survey insights
search reports for ecommerce to pull now for Q4 plan
amazon google market share for ecommerce, data