2015-05-05 46 views
0

我正在從網站列表中成功提取我需要的文本。問題是,當我以csv格式保存時,由於長文本和文本行之間的中斷,某些行變得雜亂無章。 例如:Scrapy - csv格式的雜亂文本

(不能上傳圖片:()

於是開始與0線/ 1S是不同的網站,但這一形象在過去的網站開始在CSV文件中的幾個新行。這將阻止我從與文本分析繼續

任何幫助將高度讚賞爲無法找到一個解決方案,至今

非常感謝

編輯 - 添加代碼:。 這也不行:

data = "".join(sel.select("//body//text()").extract()).strip() 

也不是這行代碼:

data = " ".join(" ".join(sel.select("//body//text()").extract()).strip().split()) 

沒有工作

+0

您可以加入一些細節有關提取的文本或提供您從該頁面提取的一些示例鏈接和實體? – Jithin

回答

0

您可以通過對給定的執行都join()split()刪除新行字符和所有文本。在產生這些項目之前,請確保您已正確清理了提取的文本。

假設我想獲取以下url的一些體育新聞,而將是這個樣子,

In [1]: text = response.xpath('//div[@id="page-1"]/p//text()').extract() 

In [2]: text 
Out[2]: 
[u'\nThe retirement of Jonathan Trott from international cricket last night cast \nfurther doubt on the position of Peter Moores, who admitted he was uncertain \nabout his own future as head coach. The decision to recall Trott as an \nopening batsman for the series against West Indies, 18 months after his \nbreakdown in Australia, backfired spectacularly as England slid to a defeat \nin Bridgetown on Sunday that enabled the home side to level the Test series \n1-1.\n', 
u'\nThe defeat in Bridgetown added to the pressure on Moores after a disastrous \nWorld Cup earlier this year. The head coach conceded yesterday that'] 

In [3]: cleaned_text = ' '.join(' '.join(text).split()) 

In [4]: cleaned_text 
Out[4]: u'The retirement of Jonathan Trott from international cricket last night cast further doubt on the position of Peter Moores, who admitted he was uncertain about his own future as head coach. The decision to recall Trott as an opening batsman for the series against West Indies, 18 months after his breakdown in Australia, backfired spectacularly as England slid to a defeat in Bridgetown on Sunday that enabled the home side to level the Test series 1-1. The defeat in Bridgetown added to the pressure on Moores after a disastrous World Cup earlier this year. The head coach conceded yesterday that' 

希望這能幫助

+0

非常感謝您的回覆。我很抱歉沒有先複製我的代碼,但這裏是: data =「」.join(sel.select(「// body // text()」)。extract())。strip() and在您的建議後,我將其更改爲: data =「」.join(「」.join(sel.select(「// body // text()」)。extract())。strip()。split()) 不幸的是仍然造成同樣的問題(雖然有點友好的方式) – Angie

+0

我可以得到的網址嗎? – Jithin

+0

與長文本網站發生,例如: http://www.astleyclarke.com – Angie