Implementing a robots.txt file into your website is one of the easiest technical SEO techniques to introduce, but it’s also one that you can just as easily botch.
A simple typo, incorrect URL, or misinterpreted rule could lead to disastrous results for your websites search results. In worst-case scenarios, it could even lead to your key pages of content on your site completely disappearing from Google!
Nevertheless, having in place a properly implemented robots.txt can give your website, especially larger sites with duplicate content, the opportunity to control how bots and crawlers such as Google and Bing index your site.
In this article, we’ll give you:
The complete lowdown on what exactly is a robots.txt file and why it might affect your brand’s site.
What’s more, we’ll outline the complete step-by-step instructions on how to craft a web crawler friendly robots.txt file.
Plus all the tools and tricks on testing and optimization.
Well, what are we waiting for? Let’s give you the tools and instructions you need to crack the code on a perfect robots.txt file.
What is Robots.txt?
Robots.txt is a plain text file living on your website that tells search engines and web crawlers such as Google where they can or cannot go. Following a set of web standards called the robots exclusion protocol, robots.txt files give search engines clear directions on where they should go by ‘allowing’ or ‘disallowing’ parts of your site.
In easier terms, think of robots.txt files like a guide at a nature park, that tells you where in the park you are allowed and not allowed to adventure through.
Here’s a very basic example of a robots.txt:
A ‘user-agent’ indicates which web crawler or search engine needs to obey the subsequent directives, in this case the wildcard means all web bots. The forward slash after ‘disallow’ tells the robot that all pages on the entire site should not be visited.
Here’s another example:
In this instance, the robots.txt is telling the user agent ‘Googlebot’ to not visit any ‘/blog/’ pages on the entire site.
Why is Robots.txt important?
It’s all good and all that you know what the text file can do, but how is this prevalent to your site? Well...
A little thing called crawl budget
If you have an eCommerce website or if your site has been up and running for a long time, you more than likely have a decent amount of pages of content sitting around. When search engines come around such as Google, they’ll start crawling and indexing every single thing on your site in their eternal quest to serve the best results for users across the web.
Here’s the issue, if your website has too many pages, search engines will naturally take longer to complete its crawl, in turn negatively impacting your keyword rankings.
This is what’s known as your ‘crawl budget’; the number of pages a search engine can crawl and index on your website within a certain timeframe. The amount of crawl budget you will have depends on a number of factors including site speed, URL popularity, and more. So be sure to take care of your site with properly thought out page speed optimizations and various other SEO strategies to maximize your budget.
Here’s another issue, if your total pages exceed your crawl budget limit, Googlebot won’t index your page at all. Essentially, some of the content you’ve spent countless hours optimizing for SEO won’t rank for anything, at all.
We want to make sure that your crawl budget is spent wisely, so taking the time to create a robots.txt file helps ensure that the pages that actually matter get crawled and indexed.
Prevent the public from access to private content
While you’ll likely want to index most of your website, there might be certain pages you don’t want seen.
For example, you’ll likely have a login page for employees to access the backend. There might be landing pages containing sensitive or confidential information. You might be staging a new website design for your brand.
Using robots.txt you can stop these pages from being indexed and avoid the public from accidentally coming upon a page they shouldn’t.
It’s important to note, just like travellers who choose not to listen to a guide at the nature park, a few search engines might choose to ignore some or all rules on robots.txt files. Thankfully, Google is a good bot and will generally obey every instruction you tell it.
Ready to create a robots.txt file that all but enhances your SEO? Let’s get started!
Finding your Robots.txt file
Finding the robots.txt file is actually rather easy if you’re purely looking to see if it exists. Moreover, it’s something you could do on most major websites across the internet if you want to do some quick competitor research on robots.txt files.
To see your robots.txt page, simply type up the URL in the search bar, then add /robots.txt at the end.
You’ll likely encounter one of the following scenarios:
1. There’s a complete robots.txt file, like ours!
2. There’s a robots.txt file, but it’s empty! For example, Disney has no file on their site here.
3. There’s no robots.txt, and the page returns a 404 error message (the page doesn’t exist) just like Burger King.
After checking if your robots.txt file does exist, you can then access and edit the actual file by accessing your websites root directory via your FTP tool such as cPanel.
If you’re not confident about opening the backend of your website and accessing your root directory, we recommend grabbing an SEO agency or someone with knowledge of site management to handle finding and editing the robots.txt file.
How to create a Robots.txt file (using best practice)
To create a robots.txt file, we start by opening any plain text editor such as Notepad, Sublime Text, or TextEdit if you’re on a Mac. While you could make robots.txt files using word processors such as Microsoft Word or Google Docs and saving it as a text file, the program may add extra code onto your robots.txt and lead to incompatibility issues. So keep it simple!
Whether or not you currently have a robots.txt file on your site, we recommend starting off with a blank file. That way you build a robots.txt file you completely understand.
Before we get started, you’ll need to understand some of the basic syntax used to ensure every rule you place works as intended.
Here’s a quick breakdown of common syntax:
User-agent: Search engine robots (such as Google) that will crawl your website.
For example: User-agent: Googlebot
Disallow: The command that tells a user agent not to access a particular URL or URL path.
For example: Disallow: /reviews/
Allow: The command that tells a user agent that a particular child URL of a disallowed parent URL can be accessed.
For example: Allows: /reviews/product/
Sitemap: This describes the location of the xml sitemap file (or files) on your website.
For example: Sitemap: https://www.examplesite.com.au/sitemap.xml
*: Known as the wildcard, this can be used on any of the above directives except sitemap.
For example: User-agent: * designates that all bots adhere to the following rules.
For an even more thorough breakdown of robots.txt rules and syntax, we strongly recommend checking out Google’s explanation.
Now that you have the syntax down, let’s go through how to construct a robots.txt file step-by-step!
First, create a plain text file and call it ‘robots.txt’. Make sure not to call it anything else as any form of typo or use of uppercase will not be considered by a crawler at all as a valid file. Then, open it up using a text editor.
Next, you’ll need to designate the user agent, to keep it simple, let’s make these directives affect all web crawlers like this:
Next, on a new line underneath user agent, we’ll set up the disallow rule like this:
It’s important we write each newly written directive on a new line to avoid Googlebot from becoming confused. For now, since we don’t want to disallow anything (yet), we’ll keep it blank.
Next, we’ll link the robots.txt file to your XML sitemap file. You can do this on a new line separated from the user agent and disallow rules by typing in:
If you have more than one sitemap on your website, make sure to add them in as well.
Congrats! You’ve created a basic robots.txt file that allows every search engine to crawl every page on your website. Now, let’s go one step further and transform this nifty little tool into one that boosts your SEO rankings.
Optimizing Robots.txt for SEO
Before you start adding in another command, it’s vital that you take the time and do a proper audit of every page on your website. Nothing is worse than creating a disallow command for a particular URL path only to find out a while later that one of its subpages was vital to your SEO!
It’s important to note that we do not use robots.txt to explicitly block pages on your site from a crawler or search engine, we are simply providing a clear map or guide for crawlers like Googlebot to follow. While Google will generally follow every command on the text file, other bots may choose to ignore these and access directories anyway.
The best place to start is by identifying which sections of your site aren’t (or aren’t meant to be) shown to the public.
For example, websites that use WordPress might think about disallowing the login page to the backend.
Similarly, if you wish to disallow any other page of the website, it’s as simple as this:
What’s more, if you wanted to allow access to a specific subpage from the directory ‘page’, you would add in the following command:
Now that we’ve learned the basics, here are some pages that you might consider adding to the disallow directives.
In general, you’d want to minimize the amount of duplicate content on your site. Not only can duplicate content cause confusion amongst potential customers visiting your website, but it can negatively impact your SEO results.
Nevertheless, in the situation where some duplicate content needs to exist, adding in a disallow directive will tell any bot to only crawl through the version you want.
Some content on your site such as thank you pages, a login page for WordPress, or resources such as images or PDFs might contain vital and private information you’d rather the public not stumble upon. Placing a disallow rule on these pages helps protect your more private content from being easily accessed on a search engine.
Specifying a user agent
Thinking about creating a set of commands for specific user agents such as Googlebot? All you need to do is type in that specific user agent, then list out the directives you wish that user agent to follow like so:
Note that every newly added user agent set of commands is separated by a line break. That’s because each set of commands mentioned in that line separated set will only apply for that specified user agent.
In the example above, while both user agents disallow the WordPress login URL ‘wp-admin’, Googlebot is disallowed to crawl the URL ‘otherpage’ while all other user agents cannot crawl the URL ‘page’.
Also, keep in mind that if a user agent will only follow the directives that most accurately apply to them. So, if Facebot (Facebook's own crawler) happened upon the example above, as there is no set of directives that explicitly mention their user agent, it will choose to follow the wildcard.
Test your robots.txt file
With your robots.txt file fully optimized, it’s time to test it to make sure everything added is valid and without any errors. Luckily, Google has their very own robots.txt tester tool which you can use on your own site.
After signing into your Webmaster account linked to your website, head on over to the testing page which should look like this:
This fantastic tool will display your current robots.txt file they’ve found on the website as well as any errors and warnings that might be present. Simply replace the text with your new robots.txt file then click on the ‘Test’ on the bottom right of the screen.
If your text is valid, you should receive an ‘Allowed’ message in green text.
Now, your nifty little robots.txt file is ready to upload to the web. Jump back over to the root directory and upload your robots.txt file.
Make sure when uploading robots.txt files to place them in the top-level root directory or folder on your website. While you might think bots such as Googlebot could find your robots.txt no matter the location, that is most definitely not the case.
When a robot visits your site to start indexing, it will only search in the root directory, otherwise, it will decide that your site does not use robots.txt and crawl your entire site with reckless abandon. So, ensure your hard work pays off by placing your robots.txt in the root directory for easy access by bots and crawlers.
Voila! You’ve just set up your website with a remarkable robots.txt file that’ll help you control which pages of content are crawled and indexed by bots across the web including Google in no time.
Over to you
While you’ve successfully uploaded a perfectly optimized robots.txt file, It’s important now to properly maintain your robots.txt file to ensure that long-term your file remains up-to-date with the latest changes you have on your website. What’s more, with search engines like Google and web crawlers regularly changing how they conduct their crawling and indexing, you’ll need to ensure that your robots.txt file is routinely checked.
While you could totally do this by yourself, working with a specialized team can help dramatically elevate your technical SEO alongside other channels including socials and PPC. That’s where we come in!
For eligible businesses, our team of experienced Gurus will happily conduct a comprehensive FREE 50+ page Digital Audit covering essential channels including SEO, PPC, Facebook and more.
What’s more, we’ll work with you over a fantastic strategy session then create a 6-month multichannel game plan where we outline step-by-step instructions and tools to help you unlock immediate digital growth. It’s time to take control of your digital strategy and build long-lasting growth for your brand, work with the Online Marketing Gurus today.