What is Robots.txton December 28, 2012 at 12:04 AM
What is Robots.txt
One way to inform search engines that files and folders on your internet site to avoid is with the employment of the Robots metatag. however since not all search engines browse metatags, the Robots matatag will merely go unperceived . a much better thanks to inform search engines regarding your can is to use a robots.txt file.
What Is Robots.txt?
Robots.txt may be a text (not html) file you set on your website to inform search robots that pages you’d like them to not visit. Robots.txt is by no suggests that obligatory for search engines however typically search engines adjust what they’re asked to not do. it’s necessary to clarify that robots.txt isn’t some way from preventing search engines from crawl your website (i.e. it’s not a firewall, or a sort of arcanum protection) and also the undeniable fact that you set a robots.txt file are some things like swing a note “Please, don’t enter” on Associate in Nursing unfastened door – e.g. you can’t forestall thieves from coming back in however the nice guys won’t hospitable door and enter. that’s why we are saying that if you’ve got extremely subunit sitive knowledge , it’s too naïve to trust robots.txt to safeguard it from being indexed and displayed in search results.
The location of robots.txt is incredibly necessary . It should be within the main directory as a result of otherwise user agents (search engines) won’t be able to notice it – they are doing not search the entire website for a file named robots.txt. Instead, they give the impression of being 1st within the main directory (i.e. http://mydomain.com/robots.txt) and if they do not notice it there, they merely assume that this website doesn’t have a robots.txt file and thus they index everything they notice on the manner . So, if you do not place robots.txt within the right place, don’t be shocked that search engines index your whole website .
The thought and structure of robots.txt has been developed quite a decade past and if you’re interested to find out a lot of regarding it, visit http://www.robotstxt.org/ otherwise you will go straight to the quality for golem Exclusion as a result of during this article we are going to deal solely with the foremost necessary aspects of a robots.txt file. Next we are going to continue with the structure a robots.txt file.
Structure of a Robots.txt File
The structure of a robots.txt is pretty straightforward (and barely flexible) – it’s AN endless list of user agents and disallowed files and directories. Basically, the syntax is as follows:
“User-agent” area unit search engines’ crawlers and disallow: lists the files and directories to be excluded from compartmentalization . additionally to “user-agent:” and “disallow:” entries, you’ll be able to embody comment lines – simply place the # sign at the start of the line:
# All user agents area unit disallowed to ascertain the /temp directory.
The Traps of a Robots.txt File
When you begin creating difficult files – i.e. you opt to permit totally different |completely different} user agents access to different directories – issues will begin , if you are doing not pay special attention to the traps of a robots.txt file. Common mistakes embody typos and contradicting directives. Typos area unit misspelled user-agents, directories, missing colons once User-agent and interdict , etc. Typos are often tough to search out however in some cases validation tools facilitate .
The a lot of significant issue is with logical errors. for example :
The higher than example is from a robots.txt that enables all agents to access everything on the location except the /temp directory. Up to here it’s fine however afterward there’s another record that specifies a lot of restrictive terms for Googlebot. once Googlebot starts reading robots.txt, it’ll see that each one user agents (including Googlebot itself) area unit allowed to all or any folders except /temp/. this can be enough for Googlebot to grasp , thus it’ll not browse the file to the top and can index everything except /temp/ – together with /images/ and /cgi-bin/, that you think that you’ve got told it to not bit . You see, the structure of a robots.txt file is straightforward however still serious mistakes are often created simply .
Tools to Generate and Validate a Robots.txt File
Having in mind the easy syntax of a robots.txt file, you’ll be able to continuously browse it to ascertain if everything is OK however it’s a lot of easier to use a validator, like this one: : http://tool.motoricerca.info/robots-checker.phtml These tools report regarding common mistakes like missing slashes or colons, that if not detected compromise your efforts. for example, if you’ve got typed:
User agent: *
this is wrong as a result of there’s no slash between “user” and “agent” and also the syntax is inaccurate.
In those cases, after you have a fancy robots.txt file – i.e. you provide totally different|completely different} directions to different user agents otherwise you have an extended list of directories and subdirectories to exclude, writing the file manually are often a true pain. however don’t worry – there ar tools which will generate the file for you. what’s a lot of, there area unit visual tools that permit to purpose and choose that files and folders area unit to be excluded. however though you are doing not desire shopping for a graphical tool for robots.txt generation, there area unit on-line tools to help you. For instance, Server-Side Robots Generator offers a dropdown list of user agents and a text box for you to list the files you do not wish indexed. Honestly, it’s not a lot of of a facilitate, unless you would like to line specific rules for various search engines as a result of in any case it’s up to you to kind the list of directories however is quite nothing.