AOL Releases Search Data & Raises Privacy Concerns
Techmeme is reporting
a huge amount of concern over AOL releasing, then pulling, search logs done by
500,000 users over three months. The purpose of the release was to help search
researchers better understand user behavior in conjunction with an industry
event for search researchers happening in Seattle,
SIGIR. The data was posted on the
site, but has since been pulled.
Unlike what TechCrunch
suggests, this isn’t private data in that no personally identifiable
information has been released. Instead, actual usernames have been replaced with
anonymous one. However, this still means it’s possible to track the behavior of
a particular user and potentially know who they are if their searches contained
personally identifiable information.
To understand this more,
gives some examples gleaned from the new AOL data. Also see
this example of someone who might be planning to murder his wife. Danny’s
Private Searches Versus Personally Identifiable Searches, also covers the
general difference between private data versus personally identifiable stuff.
How does what AOL compare to what the
Justice asked for from search engines earlier this year? It actually goes
further. The DOJ simply
not any further information that would allow a group of searches to be linked
with an individual, even if that individual as kept anonymous.
Danny may have more to say about this next week. He’s at the SES San Jose
conference this week and very busy with that, but he sent me some notes from a
brief review of the AOL move to give perspective here as he sees it.
Postscript From Danny: Just a few quick thoughts and updates in the
short time I have between sessions.
from John Battelle and
AOL apologizes for release of user search data from News.com have AOL
apologizing for the release, now said to be data involving about 658,000
individuals from March through May of this year. AOL says the release of the
data wasn’t properly vetted for privacy issues and that the release intentions
I believe that. Make no mistake, this was a big screw up. The researchers
providing the data didn’t think hard enough about how making it possible to
build a profile of individuals, even if they were given anonymous names, might
then make it possible to determine who those people are if they revealed enough
information in their searches.
In addition, it’s going to be very difficult for some law enforcement agency
not to want to subpoena AOL for actual user names when they read about things
that suggest a murder is being planned or may have happened, as covered above.
I’m not saying they’ll get it, but I think it’s almost inevitable that someone
will try. That will set off further privacy fireworks.
But yes, the original intention was innocent. I got an email about the
research site last week (and with my traveling all last week, simply did not
have a chance to check it out). Here’s what a researcher involved with it
Over the last few years I have witnessed a divide developing within
Information Retrieval research – between the haves and have-nots. The ‘haves’
are the companies like Google, Yahoo, MSN, and ourselves, with lots of
resources and data. The ‘have-nots’ are people without those resources such as
academic researchers and smart guys at small companies. We want to be able to
help anyone work on great ideas by giving them the data and infrastructure
So we started building data sets and made them available for everyone to
test their ideas with. Each data set features a dynamic view, which allows you
to inspect the data without having to download it. We also built some APIs for
news, video, audio and podcasts, which will save people time from having to do
that themselves. We have tried to stay away from interfaces like web search as
those are already around.
There’s nothing evil in that. In fact, there’s much to appreciate,
We all use search engines so much, and they are so important in our daily
lives, yet they remain one of the most poorly researched media venues out there.
Yes, we’re getting new labs like
the one from
Yahoo at UC Berkeley. But most search behavior studies outside of the search
depended on ancient search logs from places like Excite from back in 2001 or
so. Newer studies, if the search engines are doing them, simply don’t come out
often. So the intention to promote learning with this release was innocent, if
not honorable. The execution was poor and inexcusable.
This is the second major milestone in raising awareness of search privacy
issues this year. The first was the
Justice action, which rightly focused on whether we need more safeguards
over what governments can request. Today’s upset highlights the protections that
are needed again corporate releases of data.
The good news is that perhaps it will spur better protections even more.
& Others Call For Unified Federal Privacy Protection covers how the major
search engines recently asked for better legal protections from the government.
But perhaps the search industry itself will move forward to develop better
privacy standards. I’ve hoped recently for some type of
Privacy Bill Of Rights. Since I doubt the government will act quickly,
perhaps the industry will go faster before a third incident causes searchers to
completely lose faith in them.
AOL’s Jason Calacanis, who runs Netscape, is proposing that AOL
not keep search records at all. That might sound like a nice idea, but it’s
not practical. To not keep records raises issues with click fraud, plus with
internal tracking to determine how to improve a search engine itself in how it
responds and feeds queries. Putting better limits on how long data is kept might
help, as might developing ways to somehow remove personally identifiable
information that might get into search records.
Then again, Ixquick
recently tried a PR push on how it doesn’t keep records. Perhaps that’s
going to be a way for some players to win new users. Just make sure you also use
some tool like
Anonymizer to keep your ISP from logging your actions. Otherwise, your data
is still out there and being recorded in another way.
For more on search privacy issues, here’s a big giant list of recent posts: