Last update at :2024-05-11,Edit by888u
1. Description of how to write the Robots protocol
- User-agent: here represents all search engine types, * is a wildcard character;
- Disallow: /admin/ The definition here is to prohibit crawling directories under the admin directory;
- Disallow: /require/ The definition here is to prohibit crawling directories under the require directory;
- Disallow: /ABC/ The definition here is to prohibit crawling directories under the ABC directory;
- Disallow: /cgi-bin/*.htm prohibits access to all URLs with the suffix ".htm" (including subdirectories) in the /cgi-bin/ directory;
- Disallow: /? Disallows access to all URLs on the website that contain a question mark (?);
- Disallow: /.jpg$ prohibits crawling all .jpg format images on the web page;
- Disallow: /ab/adc.html prohibits crawling the adc.html file under the ab folder;
- Allow: /cgi-bin/ The definition here is to allow crawling of directories under the cgi-bin directory;
- Allow: /tmp The definition here is to allow crawling of the entire directory of tmp;
- Allow: .htm$ only allows access to URLs with the suffix ".htm";
- Allow: .gif$ allows crawling web pages and gif format images;
- Sitemap: Sitemap address tells the crawler that this page is a sitemap;
2. Robots protocol example
Example 1. Disable all search engines from accessing any part of the website
User-agent: * Disallow: /
Example 2. Allow all robots to access (or you can also create an empty file "/robots.txt")
User-agent: * Allow: /
Example 3. Disable access to a search engine
User-agent: BadBot Disallow: /
Example 4. Allow access to a search engine
User-agent: Baiduspider Allow:/
3. Robots protocol used by this site
User-agent: * Disallow: /wp-*/ Disallow: /*?connect=* Disallow: /date/* Disallow: /kod/* Disallow: /api/* Disallow: /*/trackback Disallow: /*.js$ Disallow: /*.css$ Disallow:/*?replytocom* Disallow: /comments/ Disallow: /*/comments/ Disallow: /feed/* Disallow: /*/*/feed/* Disallow: /*/*/*/feed/* Disallow:/articles/* Disallow:/shuoshuo/* Sitemap: https://yun.hrdtx.com/sitemap.xml
When building a website, it is inevitable that sometimes you do not want certain search engines to crawl certain pages. Of course, you can place the robots.txt file in the root directory to block search engines or set the range of files and rules that search engines can crawl. The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". Websites use Robots protocol to tell search engines which pages can be crawled and which pages cannot be crawled.
1. Description of how to write the Robots protocol
- User-agent: here represents all search engine types, * is a wildcard character;
- Disallow: /admin/ The definition here is to prohibit crawling directories under the admin directory;
- Disallow: /require/ The definition here is to prohibit crawling directories under the require directory;
- Disallow: /ABC/ The definition here is to prohibit crawling directories under the ABC directory;
- Disallow: /cgi-bin/*.htm prohibits access to all URLs with the suffix ".htm" (including subdirectories) in the /cgi-bin/ directory;
- Disallow: /? Disallows access to all URLs on the website that contain a question mark (?);
- Disallow: /.jpg$ prohibits crawling all .jpg format images on the web page;
- Disallow: /ab/adc.html prohibits crawling the adc.html file under the ab folder;
- Allow: /cgi-bin/ The definition here is to allow crawling of directories under the cgi-bin directory;
- Allow: /tmp The definition here is to allow crawling of the entire directory of tmp;
- Allow: .htm$ only allows access to URLs with the suffix ".htm";
- Allow: .gif$ allows crawling web pages and gif format images;
- Sitemap: Sitemap address tells the crawler that this page is a sitemap;
2. Robots protocol example
Example 1. Disable all search engines from accessing any part of the website
User-agent: * Disallow: /
Example 2. Allow all robots to access (or you can also create an empty file "/robots.txt")
User-agent: * Allow: /
Example 3. Disable access to a search engine
User-agent: BadBot Disallow: /
Example 4. Allow access to a search engine
User-agent: Baiduspider Allow:/
3. Robots protocol used by this site
User-agent: * Disallow: /wp-*/ Disallow: /*?connect=* Disallow: /date/* Disallow: /kod/* Disallow: /api/* Disallow: /*/trackback Disallow: /*.js$ Disallow: /*.css$ Disallow:/*?replytocom* Disallow: /comments/ Disallow: /*/comments/ Disallow: /feed/* Disallow: /*/*/feed/* Disallow: /*/*/*/feed/* Disallow:/articles/* Disallow:/shuoshuo/* Sitemap: https://yun.hrdtx.com/sitemap.xml
Recommended site searches: US servers and Japanese servers, Hong Kong cloud servers, 30-day virtual host trial, domain name query network, cheap domain names, personal domain name registration, mainland China registration-free space, overseas host rental, overseas servers, domain name registration center ,
发表评论