2016-01-07 53 views
1
開始

我剛開始Scrapy的文件,我想知道如果任何人都可以通過下面的代碼行的解釋爲我提供了一個合適的線路:與Scrapy

def parse(self, response): 
    filename = response.url.split("/")[-2] + '.html' 
    with open(filename, 'wb') as f: 
     f.write(response.body) 

回答

2

你見過http://doc.scrapy.org/en/stable/intro/tutorial.html#our-first-spider

parse():一個蜘蛛的方法,它將與每個啓動URL的下載的Response對象一起被調用。作爲第一個也是唯一的參數將響應傳遞給方法。

# a method called parse that takes one argument: response 
def parse(self, response): 
    # get the URL (string) from the response object [1] 
    # split [2] the string on the "/" character 
    # generate a filename from the list of split strings 
    filename = response.url.split("/")[-2] + '.html' 
    # open [3] a file called filename and write [4] into it the body 
    # of the response (i.e. the contents of the scraped page) 
    with open(filename, 'wb') as f: 
     f.write(response.body) 

[1] http://doc.scrapy.org/en/stable/topics/request-response.html#scrapy.http.Response

[2] https://docs.python.org/2/library/stdtypes.html#str.split

[3] https://docs.python.org/2/library/functions.html#open

[4] https://docs.python.org/2/library/stdtypes.html#file.write

1

您具有下載網頁的蜘蛛並將響應保存在文件中。 蜘蛛適用於回覆收到您定義的parse方法的響應:

line1:define接收響應作爲參數的解析方法。響應是你從網絡服務器獲得的。

line2:定義響應數據將被保存的文件名。在根據'/'字符分割URL之後,該名稱將從URL中取出,而不是URL中的最後一個字符串。比追加.html文件名。

line3中:打開定義的文件寫爲二進制模式,把wb

LINE4內部數據:HTML數據寫入到從response.body截取文件。