使用scrapy刮取多個域名的最佳方式是什麼？

我有大約10個奇怪的網站，我想刮。他們中的一些是WordPress的博客，他們遵循相同的HTML結構，儘管有不同的類。其他人是論壇或其他格式的博客。使用scrapy刮取多個域名的最佳方式是什麼？

我喜歡刮的信息很常見 - 發佈內容，時間戳，作者，標題和評論。

我的問題是，我必須爲每個域創建一個單獨的蜘蛛嗎？如果沒有，我該如何創建一個通用蜘蛛，通過從配置文件或類似的東西加載選項來讓我抓取？

我想我可以從一個文件加載xpath表達式，該位置可以通過命令行加載，但似乎有一些困難時，刮一些域名需要我使用正則表達式select(expression_here).re(regex)而有些則不。

來源

2011-03-31 goh

你應該使用BeautifulSoup，特別是如果你使用Python。它使您能夠在頁面中查找元素，並使用正則表達式提取文本。

來源

2011-04-01 01:17:01

我做那種使用下面的XPath表達式同樣的事情：

'/html/head/title/text()'爲標題
//p[string-length(text()) > 150]/text()的帖子內容。

來源

2011-06-03 21:13:54

在scrapy蜘蛛設置allowed_domains到例如域列表：

class YourSpider(CrawlSpider):  
    allowed_domains = [ 'domain1.com','domain2.com' ]

希望它有助於

來源

2011-06-11 05:18:19 llazzaro

您可以使用空allowed_domains屬性指示scrapy不過濾任何場外的請求。但在這種情況下，您必須小心，並且只能回覆您的蜘蛛的相關請求。

來源

2011-12-23 22:38:22

嗯，我面臨着同樣的問題，所以我創建動態使用type()蜘蛛類，

from scrapy.contrib.spiders import CrawlSpider 
import urlparse 

class GenericSpider(CrawlSpider): 
    """a generic spider, uses type() to make new spider classes for each domain""" 
    name = 'generic' 
    allowed_domains = [] 
    start_urls = [] 

    @classmethod 
    def create(cls, link): 
     domain = urlparse.urlparse(link).netloc.lower() 
     # generate a class name such that domain www.google.com results in class name GoogleComGenericSpider 
     class_name = (domain if not domain.startswith('www.') else domain[4:]).title().replace('.', '') + cls.__name__ 
     return type(class_name, (cls,), { 
      'allowed_domains': [domain], 
      'start_urls': [link], 
      'name': domain 
     })

所以說，創造一個蜘蛛「http://www.google.com」我只是做 -

In [3]: google_spider = GenericSpider.create('http://www.google.com') 

In [4]: google_spider 
Out[4]: __main__.GoogleComGenericSpider 

In [5]: google_spider.name 
Out[5]: 'www.google.com'

希望這會有所幫助

來源

2013-11-25 07:55:20 Optimus

使用scrapy刮取多個域名的最佳方式是什麼？

回答

相關問題