Scrapy抓取後的空CSV格式

-1

運行我的代碼後，我一直在獲取空的CSV文件。我懷疑它可能是XPath，但我真的不知道我在做什麼。終端輸出中沒有報告任何錯誤。我試圖從各種Craigslist pages獲取信息。Scrapy抓取後的空CSV格式

from scrapy.spiders import Spider 
from scrapy.selector import Selector 
from craigslist_probe.items import CraigslistSampleItem 

class MySpider(Spider): 
name = "why" 

allowed_domains = ["craigslist.org"] 

f = open("urls.txt") 
start_urls = [url.strip() for url in f.readlines()] 
f.close() 

def parse(self, response): 
    titles = response.selector.xpath("/section[@id='pagecontainer']") 
    items = [] 
    for titles in titles: 
     item = CraigslistSampleItem() 
     item["img"] = titles.xpath("./div[@class='tray']").extract() 
     item["body"] = titles.xpath("./section[@id='postingbody']/text()").extract() 
     item["itemID"] = titles.xpath(".//div[@class='postinginfos']/p[@class='postinginfo']").extract() 
     items.append(item) 
    return items

來源

2016-04-28 chrae

您是否檢查了項目字段是否正在填充？使用一些「打印」或登錄他們？ – eLRuLL

您可以使用'scrapy shell'或使用Firebug for Firefox等工具來試用xpaths。這些將允許您運行xpath搜索並查看返回的值。非常便利。在一個樣式註釋中，「對於標題中的標題：'不太好，'對於標題中的標題：'會更好。 Python似乎並不在乎。 – Steve

我懷疑你的XPath不對應於頁面的HTML結構。請注意，單斜槓（/）推斷直接孩子，因此，例如，/section只會在頁面的根元素爲<section>元素時才起作用，這幾乎不是這種情況。全部嘗試使用//：

def parse(self, response): 
    titles = response.selector.xpath("//section[@id='pagecontainer']") 
    items = [] 
    for titles in titles: 
     item = CraigslistSampleItem() 
     item["img"] = titles.xpath(".//div[@class='tray']").extract() 
     item["body"] = titles.xpath(".//section[@id='postingbody']/text()").extract() 
     item["itemID"] = titles.xpath(".//div[@class='postinginfos']/p[@class='postinginfo']").extract() 
     items.append(item)

來源

2016-04-28 03:46:21 har07

Scrapy抓取後的空CSV格式

回答

相關問題