Behind the Scenes at Yahoo Labs

Author

Date published June 23, 2004 Categories

Industry

Dr. Gary Flake is Principal Scientist & Head of Yahoo Research Labs. In this wide-ranging interview, he talks about the daily work of researchers at Yahoo Labs, and what they’re doing to make search better.

Dr. Flake, can you tell us about your background in IR and web search?

Gary Flake: Like a lot of people in this industry, my background is in machine learning. In the late 80s, I was working on what can now be best termed as “toy” problems relative to today’s scales. In the early 90s, I started working on larger data mining problems (at first from the biomedical domain, and then later on in industrial processes). Things seemed to get much more interesting with more data, so it was natural for me to switch to the Web and IR because that was where some of the most interesting data could be found.

How many people do you have on your team at Yahoo Research? How do you decide on what new products or services to work on? What’s a typical day like?

We have a couple dozen full-time members of Yahoo Research Labs (YRL), and a significantly larger number when you consider the extended R&D family within Yahoo that includes folks from the individual business units (BUs). While those two sets of researchers collaborate as often as possible, the focus of the full-time members of YRL is on areas that can impact the whole of the company, while the BU scientists focus more on problems specific to a BU.

How we pick what activities to pursue is a long story (and partially a function of my own history within industrial R&D), so please tolerate the longish answer to this question.

In general, YRL’s mission is to produce reusable R&D results, explore areas that fall between the cracks (i.e., between BUs), and look for — or perhaps produce — R&D results that could disrupt the industry. Steering the activities of a group with this sort of mission requires that one take a very holistic view of R&D and see the value of diversity. By this, I mean we explicitly choose to do a lot of things in vastly different ways. We work on short, medium, and long term projects. We have activities initiated by a scientist or engineer, but we also have some efforts which are done in response to an executive goal. We work on fundamental algorithms (occasionally producing deep theoretical results), but also ground our efforts to business problems. We also work on individual products, infrastructure improvements, or even business strategy.

The point of all this is that we mix it up. Done incorrectly, prioritizing all of these seemingly conflicting objectives could produce mediocrity. However, with the right blend, one often finds that there is a subtle interplay among these objectives that often yields something wonderful. My job is to keep the mix as interesting as possible, which requires that I look for what’s missing. If activities within YRL are chaotic, I’ll wear my dictator hat until things become less chaotic. If activities seem too focused or top-down, I’ll encourage some short-term anarchy.

That’s the philosophical answer that ignores the content of R&D. If we consider the content of the work, then my own preference is to look for activities that are eminently reusable (i.e., applicable to multiple BUs, so that we get more bang for the buck). I also believe that all of our efforts have to be interesting on either a scientific, mathematical, engineering, product, or strategic yardstick. The very best activities will be significant along all of those dimensions. For example, machine learning and data mining are both off the chart because both have high value no matter how they are evaluated, and a single result may be applicable to multiple BUs.

Typical day? I haven’t had one of those in a long time. In a typical week, I’ll make a trip or two, dissect code, brainstorm with product and business teams, indulge in some discussions with YRL members (which is like oxygen to me), read as much as possible, receive hundreds of emails (and write a few too), all while trying to balance and prioritize the team’s efforts in a rational way. The balancing part is perhaps the most subtle and important.

What’s wrong with web search today?

It’s easier for me to point to what web search should be and then highlight the differences. If web search were perfect, then it would produce an answer to every query that would be as good — or better — than if the smartest people in the world had as much time, data, and contextual information (about the user) required to fulfill the query; and it would do all of this in a split second. In other words, the search engine would be an artificial intelligence (AI) so smart that if a correct answer could be found in theory with close to infinite resources, then it would find it. If a correct answer did not exist, then the search engine would give you the next best thing: an approximation, or perhaps even an explanation as to why your query has no perfect result. (And by the way, if we realized all of the above within my lifetime, I would consider myself lucky. That should give you an idea of what sort of time frame I am talking about.)

Alternative interfaces, like cell phones, voice, and snazzy graphical results are all nice, but in the end they represent relatively easy technology problems when compared to the challenges involved in realizing our hypothetical search engine. What really matters is what is under the hood.

Today, search engines have almost no understanding of words or language in any significant way. They exploit the statistical properties of words and links, but in no way is there anything going on akin to understanding. Search engines don’t recognize user intent, can’t distinguish goal-oriented search from browsing search, and are completely ignorant of the subtleties of how different concepts relate to one another. Moreover, they completely lack wisdom — i.e., they are very poor at distinguishing between trivia and something profound.

Do you still see a need for targeted crawlers and focused databases?

Certainly. Different types of data have different notions of timeliness. Moreover, besides structured and unstructured data, there is a whole universe of data best characterized as semi-structured. As long as those two observations hold, niche tools will always fill a niche, to coin a tautology. I don’t think a huge monolithic database will ever subsume all other databases. Instead, what we think of as a search engine will gradually evolve into a more subtle meta-search engine, blending its own data with other sources.

How can Yahoo Research Labs make search better?

Getting at the heart of user intent is very important to us and to the overall search team, with whom we work on a daily basis. I think this is how we will make the most impact in the short to medium term. I also think that current search engines have only scratched the surface on what can be done with link data. The commoditization of 64-bit hardware will also change the search engine landscape, and we intend to push on this front as well. Our long-term goal is to get as close as possible to what I described earlier as a perfect search engine. We are far, far from that goal. But that’s okay, too, because we know some of the next key steps towards realizing the larger goal.

Click here for part 2 of this interview.

Gary Price is News Editor of SearchEngineWatch.com.

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.

Google reveals its caring side…
Melbourne Age Jun 24 2004 1:20PM GMT

Forging Yahoo’s Future…
The Wall Street Journal Online Jun 24 2004 1:05PM GMT

LII Releases Olympics Games Resource…
ResearchBuzz Jun 24 2004 1:03PM GMT

AskJeeves Drops Remaining Paid Inclusion Program…
ClickZ Today Jun 24 2004 12:33PM GMT

AOL Buys Advertising.com for $435 Million…
Reuters Jun 24 2004 12:15PM GMT

Ask Jeeves, Microsoft join email battle…
SiliconValley.com Jun 24 2004 11:55AM GMT

Spammers ‘know where YOU live’…
Silicon.com Jun 24 2004 7:05AM GMT

Bill to Curb Online Piracy Is Challenged as Too Broad…
New York Times Jun 24 2004 6:54AM GMT

AOL engineer sold 92m names to spammers…
Guardian Unlimited Jun 24 2004 1:59AM GMT

Text mining tools take on unstructured data…
Computerworld Jun 23 2004 9:44PM GMT

A Man, a Plan, a Pointless(?) Program…
Google Jun 23 2004 8:29PM GMT

Build First, Monetize Later? The Business of Search…
Search Engine Watch Forums Jun 23 2004 3:48PM GMT

Net pioneer predicts web future…
BBC Jun 23 2004 3:15PM GMT

Domain escapes Google…
The Times Jun 23 2004 2:07PM GMT

TechBrief: Terra gets offers for Lycos unit…
IHT Jun 23 2004 10:34AM GMT

More about:

Resources

Analytics The 2023 B2B Superpowers Index

The Merkle B2B 2023 Superpowers Index outlines what drives competitive advantage within the business culture and subcultures that are critical to success. It is the indispensable guide for B2B marketers to deliver world-class experiences and keep pace with the dynamic environment. Download Now
Analytics Data Analytics in Marketing

The ClicData survey found that various challenges exist that prevent organizations from achieving such gains. These challenges included inaccessible data formats and limited flexibility in displaying data in dashboards. Download Now
Digital Marketing The Third-Party Data Deprecation Playbook

The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now
Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Information

Follow us

Search Headlines

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

The Search Engine Watch Top 5!

The ultimate 2022 Google updates round up

Is Google headed towards a continuous “real-time” algorithm?

How AI-generated images can streamline your SEO game with DALL-E 2

Why we’re hardwired to believe SEO myths (and how to spot them!)

Seven Google alerts SEOs need to stay on top of everything!

The not-so-SEO checklist for 2022

Wrapping up 2021 with our top 10!

Follow us

Behind the Scenes at Yahoo Labs

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Search Headlines

Get the Latestdaily news and insights about search engine marketing, SEO and paid search.

Resources

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

The Search Engine Watch Top 5!

The ultimate 2022 Google updates round up

Is Google headed towards a continuous “real-time” algorithm?

How AI-generated images can streamline your SEO game with DALL-E 2

Why we’re hardwired to believe SEO myths (and how to spot them!)

Seven Google alerts SEOs need to stay on top of everything!

The not-so-SEO checklist for 2022

Wrapping up 2021 with our top 10!

Get the Latest
daily news and insights about search engine marketing, SEO and paid search.