2013-04-07 47 views
2

我使用scrapy抓取網頁,我想以某種格式輸出到xml文件,下面是我的代碼。Python Scrapy自定義抓取項目xml格式

Item類

class Item(Item): 
# define the fields for your item here like: 
    id = Field() 
    name = Field() 
    address = Field() 
    birthdate = Field() 
    review = Field() 

蜘蛛類

class FriendSpider(BaseSpider): 
# identifies of the Spider 
name = "friend" 
count = 0 
allowed_domains = ["example.com.us"] 
start_urls = [ 
    "http://example.com.us/biz/friendlist/" 
] 
def start_requests(self): 
    for i in range(0,1722,40): 
     yield self.make_requests_from_url("http://example.com.us/biz/friendlist/?start=%d" % i) 

def parse(self, response): 

    response = response.replace(body=response.body.replace('<br />', '\n')) 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//ul/li') 
    items = [] 

    for site in sites: 
     item = Item() 
     self.count += 1 
     item['id'] = str(self.count) 
     item['name'] = site.select('.//div/div/h4/text()').extract() 
     item['address'] = site.select('h4/span/text()').extract() 
     item['review'] = ''.join(site.select('.//div[@class="review"]/p/text()').extract()) 
     item['birthdate'] = site.select('.//div/div/h5/text()').extract() 

     items.append(item) 
    return items 

輸出是格式如下:

<?xml version="1.0" encoding="utf-8"?> 
<items> 
    <item> 
    <id>1</id> 
    <name><value>Keith</value></name> 
    <review>txt............</review> 
    <address><value>United States</value></address> 
    <birthdate><value>1988-04-03</value></birthdate> 
    </item> 
    ..... 
<items> 

我如何自定義XML格式如下:刪除值標籤並將id移到項目根目錄。

<?xml version="1.0" encoding="utf-8"?> 
<items> 
    <friend id = "1"> 
    <name>Keith</name> 
    <review>txt............</review> 
    <address>United States</address> 
    <birthdate>1988-04-03</birthdate> 
    </friend> 
    ..... 
<items> 

回答

1

對於您的問題,您可以獲取列表中的一個,在這個page介紹或者自己寫XML序列化的基礎上,OrderedDict類型,例如。抓取結束後,您可以簡單地使用必需的參數調用serialize()並獲取XML文檔。