我在我的電腦上有一個270MB數據集（10000個html文件）。我可以使用Scrapy在本地抓取此數據集嗎？怎麼樣？離線（本地）數據上的Python Scrapy

2013-10-15 Sagi

SimpleHTTP服務器託管

如果你真的想在本地建立並使用scrapy，你可以通過導航到它的存儲在目錄服務並運行SimpleHTTPServer（如下圖所示的8000端口）：

python -m SimpleHTTPServer 8000

然後，只需在127.0.0.1:8000

$ scrapy crawl 127.0.0.1:8000

文件指向scrapy：//

另一種方法是隻scrapy點到設定的文件直接：

$ scrapy crawl file:///home/sagi/html_files # Assuming you're on a *nix system

結束語

一旦你設置了刮板的scrapy（見example dirbot），只需運行履帶：

$ scrapy crawl 127.0.0.1:8000

如果html文件中的鏈接是絕對的而不是相對的，但這些可能無法正常工作。你需要自己調整文件。

來源

2013-10-15 16:16:55

你知道給自己獎勵獎勵並不會讓你獲得一頂帽子，對嗎？ :-P –

@MartijnPieters我給幾個獎勵。節日快樂！在某種程度上，我希望提問者能接受答案。： -/ –

你的回答肯定是足夠的，至少得到*一些*反饋，的確如此！ –

轉到您的數據集文件夾：

import os 
files = os.listdir(os.getcwd()) 
for file in files: 
    with open(file,"r") as f: 
     page_content = f.read() 
     #do here watever you want to do with page_content. I guess parsing with lxml or Beautiful soup.

沒有必要去爲Scrapy！

來源

2013-10-15 17:25:03

離線（本地）數據上的Python Scrapy

回答

SimpleHTTP服務器託管

文件指向scrapy：//

結束語

相關問題