Robots.txt - 多個用戶代理的抓取延遲的正確格式是什麼？

下面是一個示例的robots.txt文件以允許多個用戶代理與多個爬行延遲爲每個用戶代理。抓取延遲值僅用於說明目的，並且在真實的robots.txt文件中會有所不同。Robots.txt - 多個用戶代理的抓取延遲的正確格式是什麼？

我已經在網上搜索了正確的答案，但找不到一個。有太多的混合建議，我不知道哪個是正確的/正確的方法。

問題：

（1）每個用戶代理可以有自己的爬行延遲？（我認爲是）

（2）在Allow/Dissallow行之前或之後，你在哪裏放置每個用戶代理的爬行延遲線？

（3）每個用戶代理組之間是否必須有空白。

參考文獻：

http://www.seopt.com/2013/01/robots-text-file/

http://help.yandex.com/webmaster/?id=1113851#1113858

從本質上講，我期待找出最終的robots.txt文件應該如何看下面的示例中使用的值。

在此先感謝。

# Allow only major search spiders  
User-agent: Mediapartners-Google 
Disallow: 
Crawl-delay: 11 

User-agent: Googlebot 
Disallow: 
Crawl-delay: 12 

User-agent: Adsbot-Google 
Disallow: 
Crawl-delay: 13 

User-agent: Googlebot-Image 
Disallow: 
Crawl-delay: 14 

User-agent: Googlebot-Mobile 
Disallow: 
Crawl-delay: 15 

User-agent: MSNBot 
Disallow: 
Crawl-delay: 16 

User-agent: bingbot 
Disallow: 
Crawl-delay: 17 

User-agent: Slurp 
Disallow: 
Crawl-delay: 18 

User-agent: Yahoo! Slurp 
Disallow: 
Crawl-delay: 19 

# Block all other spiders 
User-agent: * 
Disallow:/

# Block Directories for all spiders 
User-agent: * 
Disallow: /ads/ 
Disallow: /cgi-bin/ 
Disallow: /scripts/

（4）如果我想將所有的用戶代理的有10秒的延遲抓取，將下面是正確的？

# Allow only major search spiders 
User-agent: * 
Crawl-delay: 10 

User-agent: Mediapartners-Google 
Disallow: 

User-agent: Googlebot 
Disallow: 

User-agent: Adsbot-Google 
Disallow: 

User-agent: Googlebot-Image 
Disallow: 

User-agent: Googlebot-Mobile 
Disallow: 

User-agent: MSNBot 
Disallow: 

User-agent: bingbot 
Disallow: 

User-agent: Slurp 
Disallow: 

User-agent: Yahoo! Slurp 
Disallow: 

# Block all other spiders 
User-agent: * 
Disallow:/

# Block Directories for all spiders 
User-agent: * 
Disallow: /ads/ 
Disallow: /cgi-bin/ 
Disallow: /scripts/

來源

2013-06-29 Sammy

（1）可以爲每個用戶代理有自己的爬行延遲？

是的。每個記錄，由一個或多個User-agent行開始，可以有一個Crawl-delay行。請注意，Crawl-delay不是original robots.txt specification的一部分。但它是沒有問題的，包括他們對那些理解解析器，爲規範defines：

無法識別的頭被忽略。

因此，較舊的robots.txt解析器將忽略您的Crawl-delay行。

（2）你在哪裏把每個用戶代理抓取延遲線，允許/ Dissallow線之前還是之後？

沒關係。

（3） - 是否有像每個用戶代理組之間的空白。

是的。記錄必須由一個或多個新行分隔。請參閱original spec：

該文件由一個或多個空白行（由CR，CR/NL或NL終止）分隔的一個或多個記錄組成。

（4）如果我想將所有的用戶代理的有10秒的延遲抓取，將下面是正確的？

編號機器人會查找與其用戶代理相匹配的記錄。只有當他們沒有找到記錄時，他們纔會使用User-agent: *記錄。所以在你的例子中所有列出的機器人（如Googlebot，MSNBot，Yahoo! Slurp等）將有沒有Crawl-delay。

另外請注意，你不能有several records with User-agent: *：

如果該值是「*」，記錄描述已不匹配任何其他記錄的任何機械手的默認訪問策略。 「/robots.txt」文件中不允許有多個這樣的記錄。

所以解析器看起來可能（如果沒有其他記錄相匹配）的第一條記錄與User-agent: *，並將餘下的人。對於你的第一個例子，這將意味着以/ads/,/cgi-bin/和/scripts/開頭的URL是而不是被阻止。

即使你只有一個記錄User-agent: *，那些Disallow行只適用於沒有其他記錄匹配的機器人！正如你的評論# Block Directories for all spiders建議，你希望這些URL路徑被阻止，因爲全部是蜘蛛，所以你必須重複Disallow行爲每記錄。

來源

2013-07-14 10:00:32 unor

Robots.txt - 多個用戶代理的抓取延遲的正確格式是什麼？

回答

相關問題