WordPress Tutorial: Block search engines from crawling part of the website through the robots protocol

888u

Last update at :2024-05-11,Edit by888u

When building a website, it is inevitable that sometimes you do not want certain search engines to crawl certain pages. Of course, you can place the robots.txt file in the root directory to block search engines or set the range and rules of files that search engines can crawl. The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". Websites use Robots protocol to tell search engines which pages can be crawled and which pages cannot be crawled.

1. Description of how to write the Robots protocol

  • User-agent: here represents all search engine types, * is a wildcard character;
  • Disallow: /admin/ The definition here is to prohibit crawling directories under the admin directory;
  • Disallow: /require/ The definition here is to prohibit crawling directories under the require directory;
  • Disallow: /ABC/ The definition here is to prohibit crawling directories under the ABC directory;
  • Disallow: /cgi-bin/*.htm prohibits access to all URLs with the suffix ".htm" (including subdirectories) in the /cgi-bin/ directory;
  • Disallow: /? Disallows access to all URLs on the website that contain a question mark (?);
  • Disallow: /.jpg$ prohibits crawling all .jpg format images on the web page;
  • Disallow: /ab/adc.html prohibits crawling the adc.html file under the ab folder;
  • Allow: /cgi-bin/ The definition here is to allow crawling of directories under the cgi-bin directory;
  • Allow: /tmp The definition here is to allow crawling of the entire directory of tmp;
  • Allow: .htm$ only allows access to URLs with the suffix ".htm";
  • Allow: .gif$ allows crawling web pages and gif format images;
  • Sitemap: Sitemap address tells the crawler that this page is a sitemap;

2. Robots protocol example

Example 1. Disable all search engines from accessing any part of the website

User-agent: *
Disallow: /

Example 2. Allow all robots to access (or you can also create an empty file "/robots.txt")

User-agent: *
Allow: /

Example 3. Disable access to a search engine

User-agent: BadBot
Disallow: /

Example 4. Allow access to a search engine

User-agent: Baiduspider
Allow:/

3. Robots protocol used by this site

User-agent: *
Disallow: /wp-*/
Disallow: /*?connect=*
Disallow: /date/*
Disallow: /kod/*
Disallow: /api/*
Disallow: /*/trackback
Disallow: /*.js$
Disallow: /*.css$
Disallow:/*?replytocom*
Disallow: /comments/
Disallow: /*/comments/
Disallow: /feed/*
Disallow: /*/*/feed/*
Disallow: /*/*/*/feed/*
Disallow:/articles/*
Disallow:/shuoshuo/*
Sitemap: https://yun.hrdtx.com/sitemap.xml

When building a website, it is inevitable that sometimes you do not want certain search engines to crawl certain pages. Of course, you can place the robots.txt file in the root directory to block search engines or set the range of files and rules that search engines can crawl. The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". Websites use Robots protocol to tell search engines which pages can be crawled and which pages cannot be crawled.

1. Description of how to write the Robots protocol

  • User-agent: here represents all search engine types, * is a wildcard character;
  • Disallow: /admin/ The definition here is to prohibit crawling directories under the admin directory;
  • Disallow: /require/ The definition here is to prohibit crawling directories under the require directory;
  • Disallow: /ABC/ The definition here is to prohibit crawling directories under the ABC directory;
  • Disallow: /cgi-bin/*.htm prohibits access to all URLs with the suffix ".htm" (including subdirectories) in the /cgi-bin/ directory;
  • Disallow: /? Disallows access to all URLs on the website that contain a question mark (?);
  • Disallow: /.jpg$ prohibits crawling all .jpg format images on the web page;
  • Disallow: /ab/adc.html prohibits crawling the adc.html file under the ab folder;
  • Allow: /cgi-bin/ The definition here is to allow crawling of directories under the cgi-bin directory;
  • Allow: /tmp The definition here is to allow crawling of the entire directory of tmp;
  • Allow: .htm$ only allows access to URLs with the suffix ".htm";
  • Allow: .gif$ allows crawling web pages and gif format images;
  • Sitemap: Sitemap address tells the crawler that this page is a sitemap;

2. Robots protocol example

Example 1. Disable all search engines from accessing any part of the website

User-agent: *
Disallow: /

Example 2. Allow all robots to access (or you can also create an empty file "/robots.txt")

User-agent: *
Allow: /

Example 3. Disable access to a search engine

User-agent: BadBot
Disallow: /

Example 4. Allow access to a search engine

User-agent: Baiduspider
Allow:/

3. Robots protocol used by this site

User-agent: *
Disallow: /wp-*/
Disallow: /*?connect=*
Disallow: /date/*
Disallow: /kod/*
Disallow: /api/*
Disallow: /*/trackback
Disallow: /*.js$
Disallow: /*.css$
Disallow:/*?replytocom*
Disallow: /comments/
Disallow: /*/comments/
Disallow: /feed/*
Disallow: /*/*/feed/*
Disallow: /*/*/*/feed/*
Disallow:/articles/*
Disallow:/shuoshuo/*
Sitemap: https://yun.hrdtx.com/sitemap.xml

Recommended site searches: US servers and Japanese servers, Hong Kong cloud servers, 30-day virtual host trial, domain name query network, cheap domain names, personal domain name registration, mainland China registration-free space, overseas host rental, overseas servers, domain name registration center ,

WordPress Tutorial: Block search engines from crawling part of the website through the robots protocol

All copyrights belong to 888u unless special state
888u

888uV

VPS&Dedicated Server&Cloud

54416 Articles
9709 Tags
100W+ Views
广告

Populare

广告

Random Tags

广告
取消
微信二维码
微信二维码
支付宝二维码