Google Launches Robots.txt File Checker; Now We Need Robots.txt Standardization

Author

Danny Sullivan

Date published February 7, 2006 Categories

Industry

Very nice. Wondering how a search engine will process your robots.txt file? Google now provides a way to check on that through the Google Sitemaps program. More stats and analysis of robots.txt files from the official Inside Google Sitemaps blog explains more. Below, I’ll give you a real life example of how nice this is in action, along with a plea that the robots.txt standard needs to become, well, more standard.

About two weeks ago, we wanted to stop Google from doing things on our Search Engine Watch Forums such as trying to reply to every thread over there. That meant blocking any URL that begins like this:

http://forums.searchenginewatch.com/newreply.php

See the bold part? We made that disallowed by our robots.txt file, like this:

newreply.php

However, we weren’t sure if that would stop spidering of variations like this:

http://forums.searchenginewatch.com/newreply.php?do=newreply&noquote=1&p=73140

One of our technical people felt that the way the robots.txt protocol is written, it should do a prefix match. That means if a URL begins with what you’ve disallowed, it won’t be spidered. So neither of these URLs would get indexed:

http://forums.searchenginewatch.com/newreply.php
http://forums.searchenginewatch.com/newreply.php?do=newreply&noquote=1&p=73140

because they both begin with newreplay.php, beginning meaning what comes after the domain name of forums.searchenginewatch.com.

I wasn’t so certain. To be safe, I wondered if we should make use of the wildcard option that Google allows, such as:

newreply.php*

Looking around, I found one WebmasterWorld discussion where prefix matching did NOT seem to be working according to one person while another said it should be.

I contacted Google to get a definitive answer. They had to do a quick test to be certain. Yes, prefix matching does work. This was all we needed:

newreply.php

Today, the new tool means I don’t have to bug a Google contact for an answer. Even better, anyone can get the answer themselves without needing to know someone at Google.

Plug in a URL from your site that you think your robots.txt file is supposed to be blocking in the the robots.txt checker at Google Sitemaps. If it’s blocked, you’ll be told something like this:

Blocked by line 23: Disallow: /newreply.php

For me, that shows exactly what in my robots.txt file is keeping that content out. It’s also a helpful way to find out if there’s something in your robots.txt file accidentally blocking content that you DO want in Google.

One odd thing. Google reports not understanding the crawl-delay values in our robots.txt file:

Crawl-Delay: 10 Syntax not understood

Google doesn’t support this option, but Ask, MSN & Yahoo do. But since the delay command is specifically called out in the robots.txt file for them (in our case for MSN and Yahoo), rather than for Google, I was surprised it bothered analyzing these sections of the robots.txt file at all. It should have just ignored them, rather than risk confusing people into thinking something was wrong.

Overall, I’m thrilled with the new tool. I’d like to see the other search engines add similar ones. Even better, I’d like to see them all come together on creating an enhanced and more standardized robots.txt standard. Consider:

Google allows wildcards, but others don’t.
Ask, MSN & Yahoo allow crawl delays (but don’t define minimum or maximum values). Google does not.
Ask & Google have ALLOW commands that no others support

Postscript: Matt Cutts from Google has some good comments over here, pointing out Google also has an allow command (I’ve updated my list above) and further in comments to the post, explaining why they don’t support crawl-delay yet because of concerns it might be set too low by mistake by some webmasters.

More about:

Resources

Analytics The 2023 B2B Superpowers Index

The Merkle B2B 2023 Superpowers Index outlines what drives competitive advantage within the business culture and subcultures that are critical to success. It is the indispensable guide for B2B marketers to deliver world-class experiences and keep pace with the dynamic environment. Download Now
Analytics Data Analytics in Marketing

The ClicData survey found that various challenges exist that prevent organizations from achieving such gains. These challenges included inaccessible data formats and limited flexibility in displaying data in dashboards. Download Now
Digital Marketing The Third-Party Data Deprecation Playbook

The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now
Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Opinion

Information

Follow us

Google Launches Robots.txt File Checker; Now We Need Robots.txt Standardization

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

The Search Engine Watch Top 5!

The ultimate 2022 Google updates round up

Is Google headed towards a continuous “real-time” algorithm?

Why we’re hardwired to believe SEO myths (and how to spot them!)

Seven Google alerts SEOs need to stay on top of everything!

The not-so-SEO checklist for 2022

Wrapping up 2021 with our top 10!

Four tips for SEM teams to adjust to a privacy-focused future

Follow us

Google Launches Robots.txt File Checker; Now We Need Robots.txt Standardization

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Get the Latestdaily news and insights about search engine marketing, SEO and paid search.

Resources

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

The Search Engine Watch Top 5!

The ultimate 2022 Google updates round up

Is Google headed towards a continuous “real-time” algorithm?

Why we’re hardwired to believe SEO myths (and how to spot them!)

Seven Google alerts SEOs need to stay on top of everything!

The not-so-SEO checklist for 2022

Wrapping up 2021 with our top 10!

Four tips for SEM teams to adjust to a privacy-focused future

Get the Latest
daily news and insights about search engine marketing, SEO and paid search.