我的HTML代碼中包含了一些與大部分同類結構的div ...以下是包含2周這樣的divScrapy條件爬行
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<a href="www.xxxxxx.com"></a>
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<a href="www.yyyyyy.com"></a>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
所以這裏的代碼片段是我想Scrapy做 .. 。
如果與類=「外容器」的div包含另一個DIV與標題=在第一格上面的「驗證」一樣,它應該去的URL上面(即w^ww.xxxxxx.com)並在該頁面上獲取其他一些場景。
如果存在包含標題無DIV =「驗證」,如上面第二DIV,應該下DIV類=「麼」取所有數據。即公司名稱(Fat Dude,LLC),地址,城市,州等...並且不遵循網址(即www.yyyyy.com)
那麼我如何在Scrapy爬行器中應用這個條件/邏輯。我在考慮使用BeautifulSoup的,但不知道....
有什麼我試過到目前爲止....
class MySpider(CrawlSpider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
cName = soup.find_all("a", class_="mheading primary h4")
addrs = soup.find_all("span", itemprop_="Address")
loclity = soup.find_all("span", itemprop_="Locality")
region = soup.find_all("span", itemprop_="Region")
post = soup.find_all("span", itemprop_="postalCode")
nf['companyName'] = cName[0]['content']
nf['address'] = addrs[0]['content']
nf['locality'] = loclity[0]['content']
nf['state'] = region[0]['content']
nf['zipcode'] = post[0]['content']
yield nf
for url in hxs.xpath('//div[@class="inner-container"]/a/@href').extract():
yield Request(url, callback=self.parse)
Ofcourse,上面的代碼返回並抓取的所有網址的下DIV CLASS =「內部容器」因爲沒有在此代碼爬行規定,監守我不知道在哪裏/如何設置條件。
如果有人之前做類似的事情,請大家分享。由於