2014-06-20 67 views
0

匹配的正則表達式這是我的代碼:我SGML鏈接提取不scrapy

class MySpider(CrawlSpider): 
    name = "scraper" 
    allowed_domains = ["amazon.com"] 
    start_urls = ["http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011"] 

    rules = [Rule(SgmlLinkExtractor(allow=('.*?/\gp/\product.*?')),callback='parse_items',follow=True)] 

def parse_items(self, response): 

    sel=Selector(response) 
    items = [] 
    url=response.url 
    item = AmazonScraper() 
    print 'inside' 
    print sel.css('#btAsinTitle::text').extract() 
    item ["title"] = ''.join(sel.css('#btAsinTitle::text').extract()) 
    print '-----',item["title"] 
    print response.url 
    item ["digitalprice"] = ''.join(sel.css('.digitalListPrice>.listprice::text').extract()) 
    item["digitalprice"]=re.sub('\s+','',item["digitalprice"]) 
    item ["listprice"] = ''.join(sel.css('.listPrice::text').extract()) 
    item["listprice"]=re.sub('\s+','',item["listprice"]) 
    item ["kindleprice"] = ''.join(sel.css('.priceLarge::text').extract()) 
    item["kindleprice"]=re.sub('\s+','',item["kindleprice"]) 


    if item["digitalprice"] != None and item["listprice"] != None and item["kindleprice"] != None: 
     items.append(item) 

    print items 

    return items 

我得到urls不匹配regex也。
這是爲什麼?我想要在種子頁面中抓取所有圖書鏈接。

+0

你正在使用什麼User-Agent?我使用真正的User-Agent值獲得了比標準Scrapy更多的'/ gp/product /'鏈接。另外,你確定你的正則表達式嗎? '/ gp/product /'可以更直接地匹配亞馬遜產品 –

+0

我使用的是mozilla用戶代理程序.... 用戶代理程序:Mozilla/5.0(X11; Linux i686)AppleWebKit/537.36(KHTML,如Gecko)Chrome /34.0.1847.137 Safari/537.36 –

回答

0

正如我在我的評論中所建議的,也許看看你的正則表達式。

這是一個相當長的(通過鏈接的數量,我跳過其中一些)scrapy shell會話(來自法國,也許在你的世界的反應是不一樣的),它似乎取得了相當很多產品鏈接:

[email protected]:~$ scrapy shell "http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011" --set USER_AGENT="Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36" 
2014-06-20 12:58:05+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot) 
... 
2014-06-20 12:58:06+0200 [default] INFO: Spider opened 
2014-06-20 12:58:08+0200 [default] DEBUG: Crawled (200) <GET http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011> (referer: None) 
[s] Available Scrapy objects: 
[s] crawler <scrapy.crawler.Crawler object at 0x7f6ec6fb4310> 
[s] item  {} 
[s] request <GET http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011> 
[s] response <200 http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011> 
[s] sel  <Selector xpath=None data=u'<html>\n <head>\n  <meta http-equ'> 
[s] settings <CrawlerSettings module=None> 
[s] spider  <Spider 'default' at 0x7f6ec6740590> 
[s] Useful shortcuts: 
[s] shelp()   Shell help (print this help) 
[s] fetch(req_or_url) Fetch request (or URL) and update local objects 
[s] view(response) View response in a browser 

In [1]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
In [2]: lx = SgmlLinkExtractor(allow=('.*?/\gp/\product.*?',)) 
In [3]: import pprint 
In [4]: pprint.pprint([link.url for link in lx.extract_links(response)]) 
['http://www.amazon.com/gp/product/B00DBYBNEE/ref=gno_joinprmlogo/181-5939241-1829655', 
'http://www.amazon.com/gp/product/B00DBYBNEE/ref=nav_prime_join/181-5939241-1829655', 
'http://www.amazon.com/gp/product/B007HCCNJU/ref=topnav_storetab_kstore/181-5939241-1829655', 
'http://www.amazon.com/gp/product/B00FL3YL7O/ref=amb_link_410918762_2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1775973302&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-top-1&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00GL3MGTI/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101', 
'http://www.amazon.com/gp/product-reviews/B00GL3MGTI/ref=s9_al_bw_rs1/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1', 
'http://www.amazon.com/gp/product/B00HWI5OP4/ref=s9_al_bw_g351_i2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101', 
'http://www.amazon.com/gp/product-reviews/B00HWI5OP4/ref=s9_al_bw_rs2/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1', 
'http://www.amazon.com/gp/product/B009NF6Z2K/ref=s9_al_bw_g351_i3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101', 
'http://www.amazon.com/gp/product-reviews/B009NF6Z2K/ref=s9_al_bw_rs3/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1', 
... 
'http://www.amazon.com/gp/product-reviews/B00DN7BAUG/ref=s9_hps_bw_rs3/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101&showViewpoints=1', 
'http://www.amazon.com/gp/product/B00A7H2CFW/ref=s9_hps_bw_g351_i4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101', 
'http://www.amazon.com/gp/product-reviews/B00A7H2CFW/ref=s9_hps_bw_rs4/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101&showViewpoints=1', 
'http://www.amazon.com/gp/product/B00B52IQNA/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00B52IQNA/ref=s9_al_bw_g351_t1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00B52IQT4/ref=s9_al_bw_g351_i2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00B52IQT4/ref=s9_al_bw_g351_t2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00B52IQSA/ref=s9_al_bw_g351_i3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00B52IQSA/ref=s9_al_bw_g351_t3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00FL3YL6K/ref=amb_link_410240162_1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00FL3YL6K/ref=amb_link_410240162_3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00DZQE2Y6/ref=amb_link_410240162_4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00C7XTOMS/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711175222&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-8&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00C7XTOMS/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711175222&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-8&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101'] 

In [5]: lx = SgmlLinkExtractor(allow=('/gp/product/',)) 

In [6]: pprint.pprint([link.url for link in lx.extract_links(response)]) 
['http://www.amazon.com/gp/product/B00DBYBNEE/ref=gno_joinprmlogo/181-5939241-1829655', 
'http://www.amazon.com/gp/product/B00DBYBNEE/ref=nav_prime_join/181-5939241-1829655', 
... 
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101', 
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101'] 

In [7]: len([link.url for link in lx.extract_links(response)]) 
Out[7]: 106 

所以我得到的比185與您正則表達式106 /gp/product/鏈接。

+0

你能告訴我這個頁面上的點燃電子書的鏈接的正則表達式: http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011 –

+0

那麼,我可能會重複自己,但不'SgmlLinkExtractor(allow =('/ gp/product /',))'做你想做的事? –

+0

No ..我只想點燃電子書... 這是我的新正則表達式:\/dp \ /B00.*digital-text 現在我的開始網址是http://www.amazon.com/s/ref = sr_nr_n_0?rh = n%3A133140011%2Cn%3A%21133141011%2Cn%3A154606011%2Cn%3A668010011%2Cn%3A158591011%2Cn%3A158592011&bbn = 158591011&ie = UTF8&qid = 1403264414&rnid = 158591011 ...現在我的抓取工具沒有抓取細節。 。你能告訴我爲什麼嗎?如果可能,請包括代碼 –