Robots.txt File Disallowed Pages Still Accrue and Pass PageRank

In a previous blog post, I talked about duplicate content and search engine optimization: and how it’s important to fix duplicate content. I personally prefer to completely remove all interior links to web pages rather than adding a “disallow” to them in the robots.txt file. Why?

PageRank According to Matt Cutts of Google, even though you stop the crawlers from indexing a web page, that web page can still accrue PageRank. Let’s take a look what Matt Cutts said in this old interview:

Now, robots.txt says you are not allowed to crawl a page, and Google therefore does not crawl pages that are forbidden in robots.txt. However, they can accrue PageRank, and they can be returned in our search results.

It is important to note that even if a web page that is not allowed to be crawled by the search engine, it can still show up in the search results. An example of this would be a web page that has external links (links from another web page going to that web page. If that’s the case, even though Google is told not to crawl the page it can still show up in the search results.)

It is also important to note that even if a web page is not allowed to be crawled by the search engine, it can still accrue PageRank. But, that web page can still pass PageRank to another web page. In fact, even if the page cannot be crawled (disallow in the robots.txt file) and even if a “no index” meta tag is added on the page, the page can still accrue PageRank and it can still pass PageRank.

If you do not want a web page to pass PageRank then you need to use the nofollow attribute.

A while back, my friend Aaron Wall talked about getting your blog out of Google’s Supplemental index, which, in my opinion, has to do with duplicate content on your blog. Unfortunately, the default installation of blogs and most blog themes create all sorts of duplicate content. If you want your blog posts to rank well in the search engines, you need to take a look at removing the duplicate content from your blog. The archives and tags pages of your blog tends to be duplicate content. You need to take care of that on your blog.

There are two ways to do this:

  1. Remove the links to the duplicate content
  2. Disallow them in the robots.txt file

If you choose option two, which is to simply “disallow” the duplicate content from being crawled, those pages can still accrue and pass PageRank. If you disallow your blog archives pages from getting crawled you also need to make sure that you add noindex to the pages. Also, make sure that you add the appropriate nofollow tags to links, as well.

I prefer to remove links to the archives and the tags and any other pages that I believe are duplicates: so they cannot be crawled and people won’t link to them.

So, even if you plan a strategy to “optimize” your blog by disallowing pages on your site to the robots.txt file, it is important to consider the fact that the page can still accrue and pass PageRank on to other web pages. If someone can still “get to” that web page in their web browser, you might also consider adding a noindex meta tag to the page and adding the appropriate nofollow tags on your site.

  • Hi, recently I started accessing my site in google webmaster, where I could see plenty of crawl erros due to robot.txt files, does it actually hurt my site ?? How can I make sure that these robot.txt doesn't prevent my site and content from being not viewed..

    please help me.. thanks again for making the robots.txt files more understandable for novices like me