Check robots txt of a website
What is a robots.txt file?Robots.txt is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users. The REP also includes directives like meta robots, as well as page-, subdirectory-, or site-wide instructions for how search engines should treat links (such as “follow” or “nofollow”). Show
In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents. Basic format:User-agent: [user-agent name]Disallow: [URL string not to be crawled] Together, these two lines are considered a complete robots.txt file — though one robots file can contain multiple lines of user agents and directives (i.e., disallows, allows, crawl-delays, etc.). Within a robots.txt file, each set of user-agent directives appear as a discrete set, separated by a line break: In a robots.txt file with multiple user-agent directives, each disallow or allow rule only applies to the useragent(s) specified in that particular line break-separated set. If the file contains a rule that applies to more than one user-agent, a crawler will only pay attention to (and follow the directives in) the most specific group of instructions. Here’s an example: Msnbot, discobot, and Slurp are all called out specifically, so those user-agents will only pay attention to the directives in their sections of the robots.txt file. All other user-agents will follow the directives in the user-agent: * group. Example robots.txt:Here are a few examples of robots.txt in action for a www.example.com site: Robots.txt file URL: www.example.com/robots.txtBlocking all web crawlers from all contentUser-agent: * Disallow: / Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages on www.example.com, including the homepage. Allowing all web crawlers access to all contentUser-agent: * Disallow: Using this syntax in a robots.txt file tells web crawlers to crawl all pages on www.example.com, including the homepage. Blocking a specific web crawler from a specific folderUser-agent: Googlebot Disallow: /example-subfolder/ This syntax tells only Google’s crawler (user-agent name Googlebot) not to crawl any pages that contain the URL string www.example.com/example-subfolder/. Blocking a specific web crawler from a specific web pageUser-agent: Bingbot Disallow: /example-subfolder/blocked-page.html This syntax tells only Bing’s crawler (user-agent name Bing) to avoid crawling the specific page at www.example.com/example-subfolder/blocked-page.html. How does robots.txt work?Search engines have two main jobs:
To crawl sites, search engines follow links to get from one site to another — ultimately, crawling across many billions of links and websites. This crawling behavior is sometimes known as “spidering.” After arriving at a website but before spidering it, the search crawler will look for a robots.txt file. If it finds one, the crawler will read that file first before continuing through the page. Because the robots.txt file contains information about how the search engine should crawl, the information found there will instruct further crawler action on this particular site. If the robots.txt file does not contain any directives that disallow a user-agent’s activity (or if the site doesn’t have a robots.txt file), it will proceed to crawl other information on the site. Other quick robots.txt must-knows:(discussed in more detail below)
Identify critical robots.txt warnings using Moz ProMoz Pro's Site Crawl feature audits your site for issues and highlights urgent errors that could be keeping you from showing up on Google. Take a 30-day free trial on us and see what you can achieve: Start my free trial Technical robots.txt syntaxRobots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely to come across in a robots file. They include:
Pattern-matchingWhen it comes to the actual URLs to block or allow, robots.txt files can get fairly complex as they allow the use of pattern-matching to cover a range of possible URL options. Google and Bing both honor two regular expressions that can be used to identify pages or subfolders that an SEO wants excluded. These two characters are the asterisk (*) and the dollar sign ($).
Google offers a great list of possible pattern-matching syntax and examples here. Where does robots.txt go on a site?Whenever they come to a site, search engines and other web-crawling robots (like Facebook’s crawler, Facebot) know to look for a robots.txt file. But, they’ll only look for that file in one specific place: the main directory (typically your root domain or homepage). If a user agent visits www.example.com/robots.txt and does not find a robots file there, it will assume the site does not have one and proceed with crawling everything on the page (and maybe even on the entire site). Even if the robots.txt page did exist at, say, example.com/index/robots.txt or www.example.com/homepage/robots.txt, it would not be discovered by user agents and thus the site would be treated as if it had no robots file at all. In order to ensure your robots.txt file is found, always include it in your main directory or root domain. Why do you need robots.txt?Robots.txt files control crawler access to certain areas of your site. While this can be very dangerous if you accidentally disallow Googlebot from crawling your entire site (!!), there are some situations in which a robots.txt file can be very handy. Some common use cases include:
If there are no areas on your site to which you want to control user-agent access, you may not need a robots.txt file at all. Checking if you have a robots.txt fileNot sure if you have a robots.txt file? Simply type in your root domain, then add /robots.txt to the end of the URL. For instance, Moz’s robots file is located at moz.com/robots.txt. If no .txt page appears, you do not currently have a (live) robots.txt page. How to create a robots.txt fileIf you found you didn’t have a robots.txt file or want to alter yours, creating one is a simple process. This article from Google walks through the robots.txt file creation process, and this tool allows you to test whether your file is set up correctly. Looking for some practice creating robots files? This blog post walks through some interactive examples. SEO best practices
Robots.txt vs meta robots vs x-robotsSo many robots! What’s the difference between these three types of robot instructions? First off, robots.txt is an actual text file, whereas meta and x-robots are meta directives. Beyond what they actually are, the three all serve different functions. Robots.txt dictates site or directory-wide crawl behavior, whereas meta and x-robots can dictate indexation behavior at the individual page (or page element) level. Keep learning
Put your skills to workMoz Pro identifies whether your robots.txt file is blocking search engine access to your website. Try it >> What tools can be used for testing a robot txt file?txt testing tool in Webmaster Tools. You can find the updated testing tool in Webmaster Tools within the Crawl section: Here you'll see the current robots. txt file, and can test new URLs to see whether they're disallowed for crawling.
What is robots.txt file in website?A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.
|