WordPress教程之通过robots协议屏蔽搜索引擎抓取网站部分内容

Last update at ：2024-05-11，Edit by888u

When building a website, it is inevitable that sometimes you do not want certain search engines to crawl certain pages. Of course, you can place the robots.txt file in the root directory to block search engines or set the range and rules of files that search engines can crawl. The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". Websites use Robots protocol to tell search engines which pages can be crawled and which pages cannot be crawled.

1. Description of how to write the Robots protocol

User-agent: here represents all search engine types, * is a wildcard character;
Disallow: /admin/ The definition here is to prohibit crawling directories under the admin directory;
Disallow: /require/ The definition here is to prohibit crawling directories under the require directory;
Disallow: /ABC/ The definition here is to prohibit crawling directories under the ABC directory;
Disallow: /cgi-bin/*.htm prohibits access to all URLs with the suffix ".htm" (including subdirectories) in the /cgi-bin/ directory;
Disallow: /? Disallows access to all URLs on the website that contain a question mark (?);
Disallow: /.jpg$ prohibits crawling all .jpg format images on the web page;
Disallow: /ab/adc.html prohibits crawling the adc.html file under the ab folder;
Allow: /cgi-bin/ The definition here is to allow crawling of directories under the cgi-bin directory;
Allow: /tmp The definition here is to allow crawling of the entire directory of tmp;
Allow: .htm$ only allows access to URLs with the suffix ".htm";
Allow: .gif$ allows crawling web pages and gif format images;
Sitemap: Sitemap address tells the crawler that this page is a sitemap;

2. Robots protocol example

Example 1. Disable all search engines from accessing any part of the website

User-agent: *
Disallow: /

Example 2. Allow all robots to access (or you can also create an empty file "/robots.txt")

User-agent: *
Allow: /

Example 3. Disable access to a search engine

User-agent: BadBot
Disallow: /

Example 4. Allow access to a search engine

User-agent: Baiduspider
Allow:/

3. Robots protocol used by this site

User-agent: *
Disallow: /wp-*/
Disallow: /*?connect=*
Disallow: /date/*
Disallow: /kod/*
Disallow: /api/*
Disallow: /*/trackback
Disallow: /*.js$
Disallow: /*.css$
Disallow:/*?replytocom*
Disallow: /comments/
Disallow: /*/comments/
Disallow: /feed/*
Disallow: /*/*/feed/*
Disallow: /*/*/*/feed/*
Disallow:/articles/*
Disallow:/shuoshuo/*
Sitemap: https://yun.hrdtx.com/sitemap.xml

When building a website, it is inevitable that sometimes you do not want certain search engines to crawl certain pages. Of course, you can place the robots.txt file in the root directory to block search engines or set the range of files and rules that search engines can crawl. The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". Websites use Robots protocol to tell search engines which pages can be crawled and which pages cannot be crawled.