0
抓取鏈接,這樣,我是新來的網絡爬蟲的世界和我有一點困難爬行一個簡單的JSON文件和檢索從它的鏈接。我正在使用scrapy框架來嘗試完成此操作。從JSON文件
我的JSON示例文件:
{
"pages": [
{
"address":"http://foo.bar.com/p1",
"links": ["http://foo.bar.com/p2",
"http://foo.bar.com/p3", "http://foo.bar.com/p4"]
},
{
"address":"http://foo.bar.com/p2",
"links": ["http://foo.bar.com/p2",
"http://foo.bar.com/p4"]
},
{
"address":"http://foo.bar.com/p4",
"links": ["http://foo.bar.com/p5",
"http://foo.bar.com/p1", "http://foo.bar.com/p6"]
},
{
"address":"http://foo.bar.com/p5",
"links": []
},
{
"address":"http://foo.bar.com/p6",
"links": ["http://foo.bar.com/p7",
"http://foo.bar.com/p4", "http://foo.bar.com/p5"]
}
]
}
我items.py文件
import scrapy
from scrapy.item import Item, Field
class FoobarItem(Item):
# define the fields for your item here like:
title = Field()
link = Field()
我的蜘蛛文件
from scrapy.spider import Spider
from scrapy.selector import Selector
from foobar.items import FoobarItem
class MySpider(Spider):
name = "foo"
allowed_domains = ["localhost"]
start_urls = ["http://localhost/testdata.json"]
def parse(self, response):
yield response.url
最後,我想抓取文件,並返回鏈接在沒有重複的對象,但現在我甚至努力爬json。我認爲上面的代碼會抓取json對象並返回鏈接,但是我的輸出文件是空的。不知道我在做什麼錯,但任何幫助,將不勝感激