0
我試圖使用正則表達式,特別是re模塊來提取rss提要中的標題,日期和內容。到目前爲止我用下面的代碼:使用python正則表達式從rss提要中提取內容
titles = re.findall(r'<title>(.*?)</title>',html_code)
descriptions = re.findall(r'<description>(.*?)</description>',html_code)
dates = re.findall(r'<pubDate>(.*?)</pubDate>',html_code)
for title in titles:
if 'The Guardian' in title:
pass
else:
print "Headline:" ,title
print
for description in descriptions:
if 'Latest news and features from theguardian.com' in description:
pass
else:
print "Description:" ,description
print
for date in dates:
print "Date:" ,date
print
該代碼給出了以下的輸出:
Headline: Tim Bresnan denies involvement in Kevin Pietersen parody Twitter account
Description: I 100% did NOT have any password, and wasnt involved<br /> ECB confirms Alec Stewart reported incident in 2012 <br /><a href="http://www.theguardian.com/sport/2014/oct/08/kevin-pietersen-parody-twitter-account-author-denies-england-players-involved" title=""> Twitter account author denies players were involved</a><br /><a href="http://www.theguardian.com/sport/blog/2014/oct/08/ecb-england-cricket-kevin-pietersen-tom-harrison" title=""> Owen Gibson: ECB at crossroads amid fallout</a><p>Tim Bresnan has denied having any involvement in the controversial @KPgenius Twitter account after Kevin Pietersens autobiography claimed his former England team-mates were behind it.</p><p>In his book, Pietersen revealed the extent to which the account had angered and upset him, and claimed that the accounts author had told the former England wicketkeeper Alec Stewart that some of the guys in the dressing room are tweeting from it.</p><p>Disappointed to be implicated in the <a href="https://twitter.com/hashtag/kpgenius?src=hash">#kpgenius</a> account. I 100% did NOT have any password. And wasn't involved In any posting.</p> <a href="http://www.theguardian.com/sport/2014/oct/09/tim-bresnan-kevin-pietersen-parody-twitter">Continue reading...</a>
Date: Thu, 09 Oct 2014 11:56:43 GMT
打印這些結果對每個新聞文章。我的問題是,我如何去清理內容部分,並刪除所有的HTML垃圾?我只需要一些沒有所有標籤的文章的基本信息。我如何使用正則表達式來刪除這些(例如鏈接和「& lt;/p & gt;」)? Thankyou
也許你最好使用XML解析器 - 「xml.etree.ElementTree」或「lxml」或類似的。 – mhawke 2014-10-09 12:18:31
我只是想使用沒有任何其他模塊的正則表達式 – user2747367 2014-10-09 16:18:22