從JSON文件

抓取鏈接，這樣，我是新來的網絡爬蟲的世界和我有一點困難爬行一個簡單的JSON文件和檢索從它的鏈接。我正在使用scrapy框架來嘗試完成此操作。從JSON文件

我的JSON示例文件：

{ 

"pages": [ 

{ 

    "address":"http://foo.bar.com/p1", 

    "links": ["http://foo.bar.com/p2", 

    "http://foo.bar.com/p3", "http://foo.bar.com/p4"] 

}, 

{ 

    "address":"http://foo.bar.com/p2", 

    "links": ["http://foo.bar.com/p2", 

    "http://foo.bar.com/p4"] 

    }, 

{ 

    "address":"http://foo.bar.com/p4", 

    "links": ["http://foo.bar.com/p5", 

    "http://foo.bar.com/p1", "http://foo.bar.com/p6"] 

    }, 

    { 

     "address":"http://foo.bar.com/p5", 

     "links": [] 

    }, 

    { 

     "address":"http://foo.bar.com/p6", 

     "links": ["http://foo.bar.com/p7", 

     "http://foo.bar.com/p4", "http://foo.bar.com/p5"] 

     } 

    ] 

    }

我items.py文件

import scrapy 
from scrapy.item import Item, Field 


class FoobarItem(Item): 
    # define the fields for your item here like: 
    title = Field() 
    link = Field()

我的蜘蛛文件

from scrapy.spider import Spider 
from scrapy.selector import Selector 
from foobar.items import FoobarItem 

class MySpider(Spider): 
    name = "foo" 
    allowed_domains = ["localhost"] 
    start_urls = ["http://localhost/testdata.json"] 


    def parse(self, response): 
     yield response.url

最後，我想抓取文件，並返回鏈接在沒有重複的對象，但現在我甚至努力爬json。我認爲上面的代碼會抓取json對象並返回鏈接，但是我的輸出文件是空的。不知道我在做什麼錯，但任何幫助，將不勝感激

來源

2016-09-20 bos570

-1

所以首先你需要有一種方法來解析json文件，json lib應該做得很好。然後下一步就是運行你的履帶與URL。

import json 

with open("myExample.json", 'r') as infile: 
    contents = json.load(infile) 

#contents is now a dictionary of your json but it's a json array/list 
#continuing on you would then iterate through each dictionary 
#and fetch the pieces you need. 

    links_list = [] 
    for item in contents: 
     for key, val in item.items(): 
       if 'http' in key: 
        links_list.append(key) 
       if 'http' in value: 
        if isinstance(value, list): 
         for link in value: 
           links_list.append(link) 
        else: 
         links_list.append(value) 
    #get rid of dupes 
    links_list = list(set(links_list)) 
#do rest of your crawling with list of links

來源

2016-09-20 02:25:06

回答

相關問題