蟒scrapy如何刪除額外解析字符

在使用scrapy解析過程中，我發現這個輸出蟒scrapy如何刪除額外解析字符

[u'TARTARINI AUTO SPA（CENTRALINO SELEZIONE的Passante）']，「[u'VCBONAZZI \ xa043 '，u'40013'，u'CASTEL MAGGIORE']「，[u'0516322411']，[u'[email protected]']，[u'CARS（LPG INSTALLERS）']，[u'track.aspx ？ID = 0 & URL = HTTP：//www.tartariniauto.it']

正如你看到有喜歡的

一些額外的字符
U」 \ xa043「'[]

，我不想要的。如何刪除這些？此外還有5個項目在這個字符串中。我希望字符串是這樣的：

項目1，項目2，項目3，ITEM4，ITEM5

這裏是我的pipelines.py代碼

from scrapy.contrib.loader import ItemLoader 
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join 
import re 
import json 
import csv 

class InfobelPipeline(object): 
    def __init__(self): 
     self.file = csv.writer(open('items.csv','wb')) 
    def process_item(self, item, spider): 
     name = item['name'] 
     address = item['address'] 
     phone = item['phone'] 
     email = item['email'] 
     category = item['category'] 
     website = item['website'] 
     self.file.writerow((name,address,phone,email,category,website)) 
    return item

感謝

來源

2012-05-01 qmaruf

只是迭代你的字符串，並刪除A）當你'str（）'或B）每個字符超過某個序數時拋出錯誤的每個字符。 –

@JoelCornett這是非常pythonic – Edwardr

我擔心你問如何刪除像方括號和引號的東西？也就是說，你問的是如何從包裹列表中刪除字符串，或者你已經將它們輸出到外部文件並重新讀取它們？無論如何，所有這些工作都應該在您的物品加載器中完成，而不是在我的意見中。 – Edwardr

額外你看到的字符是unicode字符串。如果你在網上搜索，你會看到他們很多。常見示例包括版權符號：©unicode point U+00A9或商標符號™unicode點U+2122。

刪除它們最快的方式是儘量編碼它們的ASCII，然後扔掉，如果他們不ASCII字符（他們都不是）

>>> example = u"Xerox ™ printer" 
>>> example 
u'Xerox \u2122 printer' 
>>> example.encode('ascii') 
Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 6: ordinal 
not in range(128) 
>>> example.encode('ascii', errors='ignore') 
'Xerox printer' 
>>>

正如你可以看到，當您嘗試將符號解碼爲ascii，將會產生一個UnicodeEncodeError，因爲該字符不能用ascii表示。但是，如果添加errors='ignore'關鍵字參數，則它將簡單地忽略它無法編碼的符號。

來源

2012-05-01 17:22:41 Edwardr

這是編輯後的[code]（http://paste.ubuntu.com/960572/）正常工作。但一次顯示這[錯誤]（http://paste.ubuntu.com/960570/）。 – qmaruf

@MarufRahman'0'位置的'IndexError'意味着數組是空的。如果item的行爲類似於內建'dict'，我不記得Scrapy'Items'是否可以，那麼你可以交換item.get（'xx'，['] ']）'爲每一行。 – Edwardr

你能否請編輯一行作爲例子？ – qmaruf

蟒scrapy如何刪除額外解析字符

回答

相關問題