The robots.txt file is still the subject of conflicting opinions. For example, the good folks over at Yoast SEO have determined that you should open up Everything on your website for crawling; others still contend that there are various directories that you should bar to search engine bots. Here is Moz’s take on the nefarioulsy nuanced robots.txt file
Here at Moz we have committed to making Link Explorer as similar to Google as possible, specifically in the way we crawl the web. I have discussed in previous articles some metrics we use to ascertain that performance, but today I wanted to spend a little bit of time talking about the impact of robots.txt and crawling the web.
Most of you are familiar with robots.txt as the method by which webmasters can direct Google and other bots to visit only certain pages on the site. Webmasters can be selective, allowing certain bots to visit some pages while denying other bots access to the same. This presents a problem for companies like Moz, Majestic, and Ahrefs: we try to crawl the web like Google, but certain websites deny access to our bots while allowing that access to Googlebot. So, why exactly does this matter?
Why does it matter?
As we crawl the web, if a bot encounters a robots.txt file, they’re blocked from crawling specific content. We can see the links that point to the site, but we’re blind regarding the content of the site itself. We can’t see the outbound links from that site. This leads to an immediate deficiency in the link graph, at least in terms of being similar to Google (if Googlebot is not similarly blocked).
But that isn’t the only issue. There is a cascading failure caused by bots being blocked by robots.txt in the form of crawl prioritization. As a bot crawls the web, it discovers links and has to prioritize which links to crawl next. Let’s say Google finds 100 links and prioritizes the top 50 to crawl. However, a different bot finds those same 100 links, but is blocked by robots.txt from crawling 10 of the top 50 pages. Instead, they’re forced to crawl around those, making them choose a different 50 pages to crawl. This different set of crawled pages will return, of course, a different set of links. In this next round of crawling, Google will not only have a different set they’re allowed to crawl, the set itself will differ because they crawled different pages in the first place.
Long story short, much like the proverbial butterfly that flaps its wings eventually leading to a hurricane, small changes in robots.txt which prevent some bots and allow others ultimately leads to very different results compared to what Google actually sees.
So, how are we doing?
You know I wasn’t going to leave you hanging. Let’s do some research. Let’s analyze the top 1,000,000 websites on the Internet according to Quantcast and determine which bots are blocked, how frequently, and what impact that might have.
The methodology is fairly straightforward.
Download the Quantcast Top Million
Download the robots.txt if available from all top million sites
Parse the robots.txt to determine whether the home page and other pages are available
Collect link data related to blocked sites
Collect total pages on-site related to blocked sites.
Report the differences among crawlers.
Total sites blocked
The first and easiest metric to report is the number of sites