通過網站與href參考通過

我正在使用scrapy，我想通過www.rentler.com刮。我已經到網站和搜索的結果我很感興趣的城市，這裏是搜索結果的鏈接：通過網站與href參考通過

https://www.rentler.com/search?Location=millcreek&MaxPrice=

現在，所有我感興趣的是包含在頁面上的房源，並且我想遞歸地一一瀏覽它們。

每件物品下會列出：

<body>/<div id="wrap">/<div class="container search-res">/<ul class="search-results"><li class="result">

每個結果都有一個<a class="search-result-link" href="/listing/288910">

我知道，我需要創建爲crawlspider的規則，並把它看的是href和追加到網址。。這樣，它可以去的每一頁，並抓住這些數據，我很感興趣，

我想我需要這樣的：

rules = (Rule(SgmlLinkExtractor(allow="not sure what to insert here, but this is where I think I need to href appending", callback='parse_item', follow=true),)

UPDATE * 謝謝你的輸入。以下是我現在，它似乎運行，但不刮： *

import re 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from KSL.items import KSLitem 

class KSL(CrawlSpider): 
    name = "ksl" 
    allowed_domains = ["https://www.rentler.com"] 
    start_urls = ["https://www.rentler.com/ksl/listing/index/?sid=17403849&nid=651&ad=452978"] 
    regex_pattern = '<a href="listing/(.*?) class="search-result-link">' 

    def parse_item(self, response): 
     items = [] 
     hxs = HtmlXPathSelector(response) 
     sites = re.findall(regex_pattern, "https://www.rentler.com/search?location=millcreek&MaxPrice=") 

     for site in sites: 
      item = KSLitem() 
      item['price'] = site.select('//div[@class="price"]/text()').extract() 
      item['address'] = site.select('//div[@class="address"]/text()').extract() 
      item['stats'] = site.select('//ul[@class="basic-stats"]/li/div[@class="count"]/text()').extract() 
      item['description'] = site.select('//div[@class="description"]/div/p/text()').extract() 
      items.append(item) 
     return items

的思考？

來源

2013-10-17 SMPLGRP

如果需要抽取數據出一個html文件，其中是這樣，我會建議使用BeautifulSoup，它的安裝和使用非常簡單：

from bs4 import BeautifulSoup 

bs = BeautifulSoup(html) 
for link in bs.find_all('a'): 
    if link.has_attr('href'): 
     print link.attrs['href']

這個小腳本將得到所有href那在a HTML標籤內。

編輯：全功能的腳本：

我測試了我的電腦上，並如預期的結果，BeautifulSoup需要純HTML，你可以刮，你需要在它外面是什麼，看看這個代碼：

import requests 
from bs4 import BeautifulSoup 

html = requests.get(
    'https://www.rentler.com/search?Location=millcreek&MaxPrice=').text 
bs = BeautifulSoup(html) 
possible_links = bs.find_all('a') 
for link in possible_links: 
    if link.has_attr('href'): 
     print link.attrs['href']

那隻能說明你如何HREF刮出你想刮HTML頁面，當然你也可以使用它裏面scrapy，因爲我告訴你，BeautifulSoup只需要普通的HTML，這就是爲什麼我使用requests.get(url).text，你可以刮掉。所以我想scrapy可以將簡單的HTML傳遞給BeautifulSoup。

編輯2 好吧，看我不認爲你需要scrapy可言，所以如果前面的腳本讓你所有你想從作品取數據的鏈接，你只需要做這樣的事情：

假設我有一個有效的url列表我想從例如price，acres，address得到具體的數據......你可以只用這個腳本而不是打印urls來屏蔽你可以附加到一個列表並只追加以/listing/開頭的列表。這樣你就有了一個有效的url列表。

for url in valid_urls: 
    bs = BeautifulSoup(requests.get(url).text) 
    price = bs.find('span', {'class': 'amount'}).text 
    print price

你只需要看看源代碼，你會得到怎樣刮你的每一個網址所需要的數據的想法。

來源

2013-10-17 14:19:47 PepperoniPizza

我對BeautifulSoup沒有任何經驗。它是否在Scrapy內部運行？我已經在上面添加了新的代碼，你會不會建議BeautifulSoup？謝謝。 @PepperoniPizza – SMPLGRP

@benknighthorse看看這個新的例子，在你的計算機上試試它，看看結果。 – PepperoniPizza

這很棒@PepperoniPizza。我跑了腳本，它按預期工作。現在我需要將其添加到Scrapy中併爲它提供這些結果。我不知道如何/從哪裏開始。你能給我一個指針或地方開始？ – SMPLGRP

您可以使用正則表達式從鏈接中查找所有出租家庭ID。從那裏，您可以使用您擁有的ID並取而代之。

import re 
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">' 
rental_home_ids = re.findall(regex_pattern, SOURCE_OF_THE_RENTLER_PAGE) 
for rental_id in rental_home_ids: 
    #Process the data from the page here. 
    print rental_id

編輯： 這裏的工作，在其通自己版本的代碼。它打印所有鏈接ID。您可以按原樣使用它。

import re 
import urllib 
url_to_scrape = "https://www.rentler.com/search?Location=millcreek&MaxPrice=" 
page_source = urllib.urlopen(url_to_scrape).read() 
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">' 
rental_home_ids = re.findall(regex_pattern, page_source) 
for rental_id in rental_home_ids: 
    #Process the data from the page here. 
    print rental_id

來源

2013-10-17 14:29:43 GKBRK

感謝您的建議。我已經添加了代碼，它運行時沒有錯誤，但不是在拼湊。你可以看一下嗎？ @GKBRK – SMPLGRP

我想我找到了錯誤@benknighthorse。你把鏈接放在re.findall（）中。相反，您需要放置頁面源代碼。我不知道它如何完成scrapy，但可能並不困難。 – GKBRK

感謝您的快速回復@GKBRK。什麼是SOURCE_OF_THE_RENTLER_PAGE？ – SMPLGRP

通過網站與href參考通過

回答

相關問題