2017-07-11 47 views
2

我想有兩個「擴展」運行Scrapy蜘蛛:如何使用Scrapy既飛濺以及Tor Privoxy的以上在泊塢撰寫

  1. Splash渲染JavaScript中,
  2. Tor-Privoxy提供匿名。

作爲一個例子,我在https://github.com/scrapy-plugins/scrapy-splash/tree/master/example中使用quotes.toscrape.com的刮刀。這裏是我的目錄結構:

. 
├── docker-compose.yml 
└── example 
    ├── Dockerfile 
    ├── scrapy.cfg 
    └── scrashtest 
     ├── __init__.py 
     ├── settings.py 
     └── spiders 
      ├── __init__.py 
      └── quotes.py 

其中example目錄是從scrapy-splash倉庫克隆。我已經添加了以下docker-compose.yml文件:

version: '3' 

services: 
    scraper: 
    build: ./example 
    environment: 
     - http_proxy=http://tor-privoxy:8118 
    links: 
     - tor-privoxy 
     - splash 

    tor-privoxy: 
    image: rdsubhas/tor-privoxy-alpine 

    splash: 
    image: scrapinghub/splash 

其中在settings.py文件我已經改變了SPLASH_URL

# SPLASH_URL = 'http://127.0.0.1:8050/' 
SPLASH_URL = 'http://splash:8050' 

由於飛濺在本地主機上運行,​​而是在一個單獨容器相連名爲splash。該Dockerfilescraper

FROM python:alpine 
RUN apk --update add libxml2-dev libxslt-dev libffi-dev gcc musl-dev libgcc openssl-dev curl bash 
RUN pip install scrapy scrapy-splash 
COPY . /scraper 
WORKDIR /scraper 
CMD ["scrapy", "crawl", "quotes"] 

的問題是,當我運行這個使用docker-compose builddocker-compose up,我得到以下日誌:

Starting examplecompose_tor-privoxy_1 
Starting examplecompose_splash_1 
Recreating examplecompose_scraper_1 
Attaching to examplecompose_splash_1, examplecompose_tor-privoxy_1, examplecompose_scraper_1 
splash_1  | 2017-07-11 16:10:13+0000 [-] Log opened. 
splash_1  | 2017-07-11 16:10:13.794595 [-] Splash version: 3.0 
tor-privoxy_1 | 2017-07-11 16:10:13.568 7f08e999eee8 Info: Privoxy version 3.0.23 
tor-privoxy_1 | 2017-07-11 16:10:13.568 7f08e999eee8 Info: Program name: privoxy 
tor-privoxy_1 | Jul 11 16:10:13.578 [notice] Tor v0.2.6.10 (git-58c51dc6087b0936) running on Linux with Libevent 2.0.22-stable, OpenSSL 1.0.2d and Zlib 1.2.8. 
tor-privoxy_1 | Jul 11 16:10:13.578 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning 
splash_1  | 2017-07-11 16:10:13.795925 [-] Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2 
splash_1  | 2017-07-11 16:10:13.796204 [-] Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609] 
tor-privoxy_1 | Jul 11 16:10:13.578 [notice] Configuration file "/etc/tor/torrc" not present, using reasonable defaults. 
tor-privoxy_1 | Jul 11 16:10:13.581 [notice] Opening Socks listener on 127.0.0.1:9050 
splash_1  | 2017-07-11 16:10:13.796541 [-] Open files limit: 1048576 
tor-privoxy_1 | Jul 11 16:10:13.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip. 
splash_1  | 2017-07-11 16:10:13.796706 [-] Can't bump open files limit 
tor-privoxy_1 | Jul 11 16:10:13.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6. 
splash_1  | 2017-07-11 16:10:13.903844 [-] Xvfb is started: ['Xvfb', ':1896918638', '-screen', '0', '1024x768x24', '-nolisten', 'tcp'] 
splash_1  | QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root' 
tor-privoxy_1 | Jul 11 16:10:13.000 [warn] You are running Tor as root. You don't need to, and you probably shouldn't. 
splash_1  | 2017-07-11 16:10:13.984515 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles 
tor-privoxy_1 | Jul 11 16:10:13.000 [notice] Bootstrapped 0%: Starting 
splash_1  | 2017-07-11 16:10:14.041562 [-] verbosity=1 
splash_1  | 2017-07-11 16:10:14.041732 [-] slots=50 
tor-privoxy_1 | Jul 11 16:10:13.000 [notice] Bootstrapped 5%: Connecting to directory server 
splash_1  | 2017-07-11 16:10:14.041806 [-] argument_cache_max_entries=500 
tor-privoxy_1 | Jul 11 16:10:13.000 [notice] Bootstrapped 80%: Connecting to the Tor network 
splash_1  | 2017-07-11 16:10:14.043083 [-] Web UI: enabled, Lua: enabled (sandbox: enabled) 
splash_1  | 2017-07-11 16:10:14.044088 [-] Site starting on 8050 
splash_1  | 2017-07-11 16:10:14.044240 [-] Starting factory <twisted.web.server.Site object at 0x7f73a4e4b3c8> 
tor-privoxy_1 | Jul 11 16:10:14.000 [notice] Bootstrapped 85%: Finishing handshake with first hop 
scraper_1  | 2017-07-11 16:10:15 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrashtest) 
scraper_1  | 2017-07-11 16:10:15 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scrashtest', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'NEWSPIDER_MODULE': 'scrashtest.spiders', 'SPIDER_MODULES': ['scrashtest.spiders']} 
scraper_1  | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled extensions: 
scraper_1  | ['scrapy.extensions.corestats.CoreStats', 
scraper_1  | 'scrapy.extensions.telnet.TelnetConsole', 
scraper_1  | 'scrapy.extensions.memusage.MemoryUsage', 
scraper_1  | 'scrapy.extensions.logstats.LogStats'] 
scraper_1  | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled downloader middlewares: 
scraper_1  | ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
scraper_1  | 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
scraper_1  | 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
scraper_1  | 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
scraper_1  | 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
scraper_1  | 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
scraper_1  | 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
scraper_1  | 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
scraper_1  | 'scrapy_splash.SplashCookiesMiddleware', 
scraper_1  | 'scrapy_splash.SplashMiddleware', 
scraper_1  | 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
scraper_1  | 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
scraper_1  | 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
scraper_1  | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled spider middlewares: 
scraper_1  | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
scraper_1  | 'scrapy_splash.SplashDeduplicateArgsMiddleware', 
scraper_1  | 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
scraper_1  | 'scrapy.spidermiddlewares.referer.RefererMiddleware', 
scraper_1  | 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
scraper_1  | 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
scraper_1  | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled item pipelines: 
scraper_1  | [] 
scraper_1  | 2017-07-11 16:10:15 [scrapy.core.engine] INFO: Spider opened 
scraper_1  | 2017-07-11 16:10:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
scraper_1  | 2017-07-11 16:10:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
tor-privoxy_1 | Jul 11 16:10:16.000 [notice] Bootstrapped 90%: Establishing a Tor circuit 
tor-privoxy_1 | Jul 11 16:10:17.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working. 
tor-privoxy_1 | Jul 11 16:10:17.000 [notice] Bootstrapped 100%: Done 
tor-privoxy_1 | Jul 11 16:10:17.000 [warn] Received http status code 404 ("Not found") from server '216.218.222.10:443' while fetching "/tor/keys/fp/585769C78764D58426B8B52B6651A5A71137189A+80550987E1D626E3EBA5E5E75A458DE0626D088C". 
scraper_1  | 2017-07-11 16:10:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None) 
scraper_1  | 2017-07-11 16:10:29 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.goodreads.com': <GET https://www.goodreads.com/quotes> 
scraper_1  | 2017-07-11 16:10:29 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'scrapinghub.com': <GET https://scrapinghub.com> 
tor-privoxy_1 | Jul 11 16:10:44.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up. 
tor-privoxy_1 | Jul 11 16:10:44.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up. 
scraper_1  | 2017-07-11 16:10:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/adulthood/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error 
scraper_1  | 2017-07-11 16:10:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/be-yourself/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error 
tor-privoxy_1 | Jul 11 16:10:55.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up. 
tor-privoxy_1 | Jul 11 16:10:55.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up. 
scraper_1  | 2017-07-11 16:10:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/success/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error 
scraper_1  | 2017-07-11 16:10:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/books/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error 
tor-privoxy_1 | Jul 11 16:10:56.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up. 
scraper_1  | 2017-07-11 16:10:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error 
tor-privoxy_1 | Jul 11 16:10:57.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up. 
tor-privoxy_1 | Jul 11 16:10:57.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up. 
scraper_1  | 2017-07-11 16:10:57 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/classic/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error 
scraper_1  | 2017-07-11 16:10:57 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/aliteracy/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error 

在那裏我已經中斷了簡短的過程。看起來scrapertor-privoxy服務似乎分別抱怨500 Internal Service Error,並且不能「解析或連接到地址」。

我在努力弄清楚爲什麼http_proxy和Splash不能'一起工作'。任何人都可以將我指向正確的方向嗎?

回答

3

繼水族館模板項目(https://github.com/TeamHG-Memex/aquarium)後,我發現竅門是讓Splash使用Tor而不是直接使用蜘蛛。

我適應項目具有以下結構:

. 
├── docker-compose.yml 
├── example 
│   ├── Dockerfile 
│   ├── scrapy.cfg 
│   └── scrashtest 
│    ├── __init__.py 
│    ├── settings.py 
│    └── spiders 
│     ├── __init__.py 
│     └── quotes.py 
└── splash 
    └── proxy-profiles 
     └── default.ini 

docker-compose.yml

version: '3' 

services: 
    scraper: 
    build: ./example 
    links: 
     - splash 

    tor-privoxy: 
    image: rdsubhas/tor-privoxy-alpine 

    splash: 
    image: scrapinghub/splash 
    volumes: 
     - ./splash/proxy-profiles:/etc/splash/proxy-profiles:ro 
    links: 
     - tor-privoxy 

在那裏我已經安裝了proxy-profiles目錄如下http://splash.readthedocs.io/en/stable/api.html#proxy-profiles捲到splash容器。該default.ini讀取

[proxy] 

host=tor-privoxy 
port=8118 

(我還注意到有必要把它default.ini)。

使用此設置,在docker-compose builddocker-compose up上使用Splash成功運行刮板。

+0

謝謝你的幫助!我剛剛開始使用docker,並且我不明白「我已將代理配置文件目錄作爲卷掛載到splash容器中」的含義。在運行'docker-compose build','docker-compose up'之前,你是否安裝了代理配置文件?你如何掛載它?我嘗試了一下,正如文檔'docker run -p 8050:8050 -v/splash/proxy-profiles:/ etc/splash/filters scrapinghub/splash'所示,但是這會創建另一個容器,而不是一個'docker-compose build',並且'碼頭 - 組成'創建。在https://httpbin.org/ip上,我可以看到代理沒有被使用。 –