2017-04-12 90 views
0

如何爲".//*[@id='object']//tbody//tr//td//span//a[2]"?的網址返回NaN?我想:如何爲沒有抓取信息的網站返回NaN?

def parse(self, response): 
    links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]") 
    if not links: 
     item = ToyItem() 
     item['link'] = 'NaN' 
     item['name'] = response.url 
     return item 

    for links in links: 
     item = ToyItem() 
     item['link'] = links.xpath('@href').extract_first() 
     item['name'] = response.url # <-- see here 
    yield item 

    list_of_dics = [] 
    list_of_dics.append(item) 
    df = pd.DataFrame(list_of_dics) 
    print(df) 
    df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False) 

然而,而不是返回(*)

'link1.com' 'NaN' 
'link2.com' 'NAN' 
'link3.com' 'extracted3.link.com' 

我:

'link3.com' 'extracted3.link.com' 

我怎樣才能返回(*)

回答

1

您可以返工此使用scrapy管道:

from scrapy import Spider 
class MySpider(Spider): 
    name = 'myspider' 
    start_urls = ['link1','link2','link3'] 

    def parse(self, response): 
     links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]") 
     if not links: 
      item = ToyItem() 
      item['link'] = 'NaN' 
      item['name'] = response.url 
      yield item 
     else: 
      for links in links: 
       item = ToyItem() 
       item['link'] = link.xpath('@href').extract_first() 
       item['name'] = response.url # <-- see here 
       yield item 

現在,在您pipelines.py

class PandasPipeline: 

    def open_spider(self, spider): 
     self.data = [] 

    def process_item(self, item, spider): 
     self.data.append(item) 
     return item 

    def close_spider(self, spider): 
     df = pd.DataFrame(self.data) 
     print('saving dataframe') 
     df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False) 

settings.py

ITEM_PIPELINES = { 
    'myproject.pipelines.PandasPipeline': 900 
} 
+1

@tumbleweed很好,這意味着沒有鏈接在頁面上找到。是'parse'被多次調用? – Granitosaurus

+0

是的,對於那些未找到鏈接,我想知道如何對NaN進行變形,而不是沒有'None',將其參考網址保存在左側。 – tumbleweed

+1

夥計,每次解析被稱爲'to_csv'用新數據覆蓋舊的csv,所以基本上你最終只會得到最後一次'parse'調用的數據,即最後一次被抓取的鏈接。 – Granitosaurus