您可以通過對給定的執行都join()
和split()
刪除新行字符和所有文本。在產生這些項目之前,請確保您已正確清理了提取的文本。
假設我想獲取以下url的一些體育新聞,而將是這個樣子,
In [1]: text = response.xpath('//div[@id="page-1"]/p//text()').extract()
In [2]: text
Out[2]:
[u'\nThe retirement of Jonathan Trott from international cricket last night cast \nfurther doubt on the position of Peter Moores, who admitted he was uncertain \nabout his own future as head coach. The decision to recall Trott as an \nopening batsman for the series against West Indies, 18 months after his \nbreakdown in Australia, backfired spectacularly as England slid to a defeat \nin Bridgetown on Sunday that enabled the home side to level the Test series \n1-1.\n',
u'\nThe defeat in Bridgetown added to the pressure on Moores after a disastrous \nWorld Cup earlier this year. The head coach conceded yesterday that']
In [3]: cleaned_text = ' '.join(' '.join(text).split())
In [4]: cleaned_text
Out[4]: u'The retirement of Jonathan Trott from international cricket last night cast further doubt on the position of Peter Moores, who admitted he was uncertain about his own future as head coach. The decision to recall Trott as an opening batsman for the series against West Indies, 18 months after his breakdown in Australia, backfired spectacularly as England slid to a defeat in Bridgetown on Sunday that enabled the home side to level the Test series 1-1. The defeat in Bridgetown added to the pressure on Moores after a disastrous World Cup earlier this year. The head coach conceded yesterday that'
希望這能幫助
您可以加入一些細節有關提取的文本或提供您從該頁面提取的一些示例鏈接和實體? – Jithin