2016-09-20 61 views
0

抓取鏈接,這樣,我是新來的網絡爬蟲的世界和我有一點困難爬行一個簡單的JSON文件和檢索從它的鏈接。我正在使用scrapy框架來嘗試完成此操作。從JSON文件

我的JSON示例文件:

{ 

"pages": [ 

{ 

    "address":"http://foo.bar.com/p1", 

    "links": ["http://foo.bar.com/p2", 

    "http://foo.bar.com/p3", "http://foo.bar.com/p4"] 

}, 

{ 

    "address":"http://foo.bar.com/p2", 

    "links": ["http://foo.bar.com/p2", 

    "http://foo.bar.com/p4"] 

    }, 

{ 

    "address":"http://foo.bar.com/p4", 

    "links": ["http://foo.bar.com/p5", 

    "http://foo.bar.com/p1", "http://foo.bar.com/p6"] 

    }, 

    { 

     "address":"http://foo.bar.com/p5", 

     "links": [] 

    }, 

    { 

     "address":"http://foo.bar.com/p6", 

     "links": ["http://foo.bar.com/p7", 

     "http://foo.bar.com/p4", "http://foo.bar.com/p5"] 

     } 

    ] 

    } 

我items.py文件

import scrapy 
from scrapy.item import Item, Field 


class FoobarItem(Item): 
    # define the fields for your item here like: 
    title = Field() 
    link = Field() 

我的蜘蛛文件

from scrapy.spider import Spider 
from scrapy.selector import Selector 
from foobar.items import FoobarItem 

class MySpider(Spider): 
    name = "foo" 
    allowed_domains = ["localhost"] 
    start_urls = ["http://localhost/testdata.json"] 


    def parse(self, response): 
     yield response.url 

最後,我想抓取文件,並返回鏈接在沒有重複的對象,但現在我甚至努力爬json。我認爲上面的代碼會抓取json對象並返回鏈接,但是我的輸出文件是空的。不知道我在做什麼錯,但任何幫助,將不勝感激

回答

-1

所以首先你需要有一種方法來解析json文件,json lib應該做得很好。然後下一步就是運行你的履帶與URL。

import json 

with open("myExample.json", 'r') as infile: 
    contents = json.load(infile) 

#contents is now a dictionary of your json but it's a json array/list 
#continuing on you would then iterate through each dictionary 
#and fetch the pieces you need. 

    links_list = [] 
    for item in contents: 
     for key, val in item.items(): 
       if 'http' in key: 
        links_list.append(key) 
       if 'http' in value: 
        if isinstance(value, list): 
         for link in value: 
           links_list.append(link) 
        else: 
         links_list.append(value) 
    #get rid of dupes 
    links_list = list(set(links_list)) 
#do rest of your crawling with list of links