正如我在我的評論中所建議的,也許看看你的正則表達式。
這是一個相當長的(通過鏈接的數量,我跳過其中一些)scrapy shell會話(來自法國,也許在你的世界的反應是不一樣的),它似乎取得了相當很多產品鏈接:
[email protected]:~$ scrapy shell "http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011" --set USER_AGENT="Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36"
2014-06-20 12:58:05+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot)
...
2014-06-20 12:58:06+0200 [default] INFO: Spider opened
2014-06-20 12:58:08+0200 [default] DEBUG: Crawled (200) <GET http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f6ec6fb4310>
[s] item {}
[s] request <GET http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011>
[s] response <200 http://www.amazon.com/Kindle-eBooks/b?ie=UTF8&node=154606011>
[s] sel <Selector xpath=None data=u'<html>\n <head>\n <meta http-equ'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x7f6ec6740590>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [2]: lx = SgmlLinkExtractor(allow=('.*?/\gp/\product.*?',))
In [3]: import pprint
In [4]: pprint.pprint([link.url for link in lx.extract_links(response)])
['http://www.amazon.com/gp/product/B00DBYBNEE/ref=gno_joinprmlogo/181-5939241-1829655',
'http://www.amazon.com/gp/product/B00DBYBNEE/ref=nav_prime_join/181-5939241-1829655',
'http://www.amazon.com/gp/product/B007HCCNJU/ref=topnav_storetab_kstore/181-5939241-1829655',
'http://www.amazon.com/gp/product/B00FL3YL7O/ref=amb_link_410918762_2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1775973302&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-top-1&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00GL3MGTI/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101',
'http://www.amazon.com/gp/product-reviews/B00GL3MGTI/ref=s9_al_bw_rs1/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1',
'http://www.amazon.com/gp/product/B00HWI5OP4/ref=s9_al_bw_g351_i2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101',
'http://www.amazon.com/gp/product-reviews/B00HWI5OP4/ref=s9_al_bw_rs2/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1',
'http://www.amazon.com/gp/product/B009NF6Z2K/ref=s9_al_bw_g351_i3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101',
'http://www.amazon.com/gp/product-reviews/B009NF6Z2K/ref=s9_al_bw_rs3/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1826829602&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-5&pf_rd_t=101&showViewpoints=1',
...
'http://www.amazon.com/gp/product-reviews/B00DN7BAUG/ref=s9_hps_bw_rs3/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101&showViewpoints=1',
'http://www.amazon.com/gp/product/B00A7H2CFW/ref=s9_hps_bw_g351_i4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101',
'http://www.amazon.com/gp/product-reviews/B00A7H2CFW/ref=s9_hps_bw_rs4/181-5939241-1829655?ie=UTF8&pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1819075922&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-12&pf_rd_t=101&showViewpoints=1',
'http://www.amazon.com/gp/product/B00B52IQNA/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00B52IQNA/ref=s9_al_bw_g351_t1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00B52IQT4/ref=s9_al_bw_g351_i2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00B52IQT4/ref=s9_al_bw_g351_t2/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00B52IQSA/ref=s9_al_bw_g351_i3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00B52IQSA/ref=s9_al_bw_g351_t3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711163122&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-5&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00DGALTQA/ref=amb_link_409685542_4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1749675842&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-6&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00FL3YL6K/ref=amb_link_410240162_1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00FL3YL6K/ref=amb_link_410240162_3/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00DZQE2Y6/ref=amb_link_410240162_4/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1752410382&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-7&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00C7XTOMS/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711175222&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-8&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00C7XTOMS/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1711175222&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-8&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101']
In [5]: lx = SgmlLinkExtractor(allow=('/gp/product/',))
In [6]: pprint.pprint([link.url for link in lx.extract_links(response)])
['http://www.amazon.com/gp/product/B00DBYBNEE/ref=gno_joinprmlogo/181-5939241-1829655',
'http://www.amazon.com/gp/product/B00DBYBNEE/ref=nav_prime_join/181-5939241-1829655',
...
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_i1/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101',
'http://www.amazon.com/gp/product/B00JUWYGDQ/ref=s9_al_bw_g351_more/181-5939241-1829655?pf_rd_i=154606011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=1814488482&pf_rd_r=17JQXD2H3N2EZ3M7CF1R&pf_rd_s=merchandised-search-right-9&pf_rd_t=101']
In [7]: len([link.url for link in lx.extract_links(response)])
Out[7]: 106
所以我得到的比185與您正則表達式106 /gp/product/
鏈接。
你正在使用什麼User-Agent?我使用真正的User-Agent值獲得了比標準Scrapy更多的'/ gp/product /'鏈接。另外,你確定你的正則表達式嗎? '/ gp/product /'可以更直接地匹配亞馬遜產品 –
我使用的是mozilla用戶代理程序.... 用戶代理程序:Mozilla/5.0(X11; Linux i686)AppleWebKit/537.36(KHTML,如Gecko)Chrome /34.0.1847.137 Safari/537.36 –