Do you know what Robots.txt is? We show you all about this file and how important it is to your site.
Ensuring that your site appears in user searches is essential to the success of any Digital Marketing strategy.
To achieve this goal it is common that you invest in SEO strategies, Content Marketing and a number of other actions that can attract the attention of search engines and thereby increase your page traffic.
However, there are pages on your site that you do not want to be crawled by search engines, such as login pages and others that have files that exclusively access clients or members of your team.
To help you hide these pages there is robots.txt.
What is robots.txt?
A robots.txt file should be saved in the root folder of your site, and it indicates to Google search robots, Bing and many others which pages of your site you do not want to be accessed by these search engines.
And as its name implies, robots .txt is a .txt file that can be created in your own notebook, excluding the need for a tool to create it.
Robots.txt uses the standard Robot Exclusion Protocol format, a set of commands that search robots use which directories and pages on your site should not be accessed by them.
Since the file is saved directly to the site’s root folder, accessing robots.txt files from other pages is quite simple: just type the page address in your browser and add the “/robots.txt” command to the end of the URL.
Doing so can give you some interesting insights and let you know some addresses your competitors want to hide from your pages.
What is robots.txt for?
As we said, robots.txt serves to give specific orders to search for robots.
For you to understand a little better, we have listed their specific functions.
Controls access to image files
Robots.txt is able to prevent image files from your page from appearing in search results.
This helps control access to some important information, such as infographics and technical product details.
Because they are not displayed in search results, the user will have an obligation to access your page, which may be more interesting for your company.
However, it is important to note that robots.txt does not prevent other pages and users from copying and sharing the links of your images.
There are other tools to help you with this goal.
Controls access to web pages
Your page is also composed of non-image files, which are the web pages of your page itself.
In addition to preventing search robots from accessing pages that are restricted or irrelevant to your strategy, using robots.txt helps prevent the server hosting your site from being overwhelmed by search engine hits, helping your business saving money.
However, it is important to remember that, as with images, users can still find some of your pages if they have the direct access link to them.
Block access to resource files
In addition to blocking images and your web pages, robots.txt can be useful for blocking access to other less important script and style files, saving your servers.
However, you should also use this function with caution, especially if these features are indispensable for the correct loading of your page, which can make it difficult for crawlers to work, hampering the analysis of your page.
How to create a robots.txt file
Creating a robots.txt file is very simple, requiring only the knowledge of a few specific commands.
This file can be created in your computer’s notebook or another text editor of your choice.
You will also need access to your domain root folder.
To create a robots.txt file, you need to go to the root of your domain, where you will save the file you created.
After that, you will need to know some of the robots.txt syntax and commands.
The robots.txt commands
The commands in robots.txt work similarly to HTML and the various programming languages on the market.
There are commands that robots will follow to navigate and find the pages of your site.
Here are some of the main commands from the robots.txt file:
The User-agent Command
You can enter specific orders for each search robot on the market in your robots.txt file by using the User-agent command to determine which search robot you are referring to.
For the name of each User-agent, you can consult the Web Robots Database, which lists the robots of the main search engines in the market.
Google’s main search robot is Googlebot.
If you wanted to give it specific orders, the command you entered in your robots.txt would be this:
If you wanted to leave specific orders for the Bing search robot, the command would be this:
As you can see, just change the name of the User-agent.
And if you want to enter general direction to be followed by all search robots, just replace the User-agent name with an asterisk. It would be like this:
The Disallow Command
The Disallow command is responsible for describing which directory pages or websites should not be included in search results.
Like the User-agent command, simply enter the page address after the command.
To guide the robots not to access the “beta.php” page of your site the command would be this:
You can still prevent access to specific folders.
If you needed to block access to the “files” folder, the command would be this:
Disallow: / files /
You can also block access to content that starts with a specific letter.
To block access to all folders and files that begin with the letter “a”, this would be the command:
Disallow: / a
The Allow Command
The Allow command allows you to determine for search robots which pages or directories of your site you want to be indexed.
By default, all pages on your site will be indexed except when you use the Disallow command.
Therefore, using the Allow command is recommended only when you need to lock a folder or directory through the Disallow command, but would like to have indexed a specific file or folder that is inside the locked directory.
If you want to block access to the “files” folder but need to allow access to the “products.php” page, the command would look like this:
Disallow: / files /
If you want to block access to the “files” folder, but need to allow access to the “projects” folder, the command would be like this:
Disallow: / files /
Allow: / files / projects /
The Sitemap Command
Another useful command for a robots.txt file is to indicate your page’s sitemap, which is very useful in helping search robots identify all the pages on your site.
However, it is a command that has been in disuse, mainly due to Google Webmaster Tools, a tool for Google webmasters that allows you to quickly inform the location of your sitemap file, and other functions.
To enter your sitemap address you need to have saved your sitemap file saved to your site root folder. The command to enter this address on your site is this:
Sitemap: http://www.yoursite.com/sitemap.xmlhttp : // www . your site . with . in / sitemap . xml
What are the limitations of robots.txt
While it is very useful for driving search engine access to your page, you must recognize that robots.txt has some limitations.
Knowing them is important, especially to identify the need to use other devices so that your URLs are not easily found in searches.
Robots.txt file instructions are directives only
Although using robots.txt is an industry-standard, search engines are not required to follow all your orders.
This means that as much as Google’s search robots follow the robots.txt file instructions, other search engines may not do the same.
That’s why it’s important that in addition to the robots.txt file you use other methods together to hide your Google pages, such as password-protected access or using noindex meta tags in your HTML code.
Each search robot can interpret syntax in different ways
Although following an international standard, the commands entered in robots.txt can be interpreted differently by each search robot.
Therefore, to ensure its correct use it is necessary to know the ideal syntax to suit each search engine.
This means that in addition to understanding how Google interprets robots.txt information, you may also need to learn the methodology of Bing, Yahoo, and any other search engine on the market.
Robots.txt directives do not prevent other sites from referencing your URLs
A very common mistake is to find that content blocked by robots.txt cannot be found in other ways by users and even your competitors.
For this reason, if a restricted URL may be disclosed on other websites or blogs this page may still appear in search results.
That’s why it’s essential to insert the noindex tag and even block password access to ensure no one has access to your page.
You may need to give specific orders for each search robot.
Some search robots follow their own rules and logic, which may eventually require you to determine specific rules for each in your robots.txt file.
And in addition to increasing your workload, it can lead to errors in creating your files.
So be very careful when setting rules for specific robots, making sure the instructions are clear for each robot.
Now that you know what it is and how to create a robots.txt file, managing your site will be made easier by ensuring that only pages that are important to your business are visited by search robots.
If you would like to learn more about how to optimize your site for search robots, read our post SEO Checklist: Learn step by step how to get your post up the search engine rankings.