Tuesday 22 September 2015

Robot.txt

According to Google Console Robot.txt is a file at the root of your site. It gives instructions to web robots/crawlers about those parts of your website you don’t want them to access, referred as The Robot Exclusion Protocol. Robot.txt file is publically available therefore cannot be used for any private information.  Before crawling any website say http://www.example.com robots checks for http://www.example.com/robot.txt. In robot.txt file code includes:

User-agent: -->Means this file is applicable for all robots
Disallow: /  --> Specifies robot not to visit these web  pages

Instructions specified in robot.txt are the directives which google bot and other descent crawlers obey but there are many other crawlers that pay no attention specially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers. Inspite you mention the URLs to robots not to access, still through links from other websites they index those webpages, and appear in google SERPs. You can completely block those pages to appear by using robot.txt in combination with other URL blocking methods like 

  • password-protecting the files on your server (for e.g. For Apache Server edit .htaccess file to make web root directory password protected)
  • by inserting directive meta tags into your HTML (noindex tag 
** If you are using noindex tag then don't disallow(block) the webpage in robot.txt because if done, then crawler will never see the noindex tag and problem will still persist.


No comments:

Post a Comment

Latest Deals