Preventing reStructuredText sources from being Indexed by Search Engines

While doing a search for my website on Google, I discovered that Google had indexed the reStructuredText (reST) sources for my posts as well.

What is reStructuredText?

reStructuredText is a markup format used to write text in a relatively simple format, which can then be converted to other formats such as HTML, LaTeX, PDF etc. It uses a filename extension of .rst. As you can see from the footer of my blog, I use Nikola for generating this blog. I write the posts in reStructuredText, and Nikola converts them to HTML. Pretty neat, huh?

On Topic: The issue is that Google is indexing those .rst files as well, which is polluting the search results. I am pretty sure people won't be searching for the reST sources of the content posted here.

So, I started looking for ways to prevent web crawlers from indexing the .rst files. I found that it can be done by adding the following to your site's robots.txt file:

User-agent: *
Disallow: /*.rst$

Unfortunately, there's a problem with the above format. It is that, wildcards are not supported by most web crawlers — except for the major ones — such as Google, Bing, Yahoo, etc. On the other hand, I consider robots.txt to be as useful as Do Not Track (DNT), since both of these rely on the other party to behave and follow the rules, which rarely happens and gives a false sense of security.

Fortunately, the big companies do respect robots.txt and wildcards, so this should hopefully fix my problem with the search results.

Comments

Comments powered by Disqus