2014-10-09 34 views
0

我試圖使用正則表達式,特別是re模塊來提取rss提要中的標題,日期和內容。到目前爲止我用下面的代碼:使用python正則表達式從rss提要中提取內容

titles = re.findall(r'<title>(.*?)</title>',html_code) 
    descriptions = re.findall(r'<description>(.*?)</description>',html_code) 
    dates = re.findall(r'<pubDate>(.*?)</pubDate>',html_code) 

    for title in titles: 
     if 'The Guardian' in title: 
      pass 
     else: 
      print "Headline:" ,title 
      print 


    for description in descriptions: 
     if 'Latest news and features from theguardian.com' in description: 
      pass 
     else: 
      print "Description:" ,description 
      print 

    for date in dates: 
     print "Date:" ,date 
     print 

該代碼給出了以下的輸出:

Headline: Tim Bresnan denies involvement in Kevin Pietersen parody Twitter account 

Description: I 100% did NOT have any password, and wasnt involved&lt;br /&gt; ECB confirms Alec Stewart reported incident in 2012 &lt;br /&gt;&lt;a href="http://www.theguardian.com/sport/2014/oct/08/kevin-pietersen-parody-twitter-account-author-denies-england-players-involved" title=""&gt; Twitter account author denies players were involved&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.theguardian.com/sport/blog/2014/oct/08/ecb-england-cricket-kevin-pietersen-tom-harrison" title=""&gt; Owen Gibson: ECB at crossroads amid fallout&lt;/a&gt;&lt;p&gt;Tim Bresnan has denied having any involvement in the controversial @KPgenius Twitter account after Kevin Pietersens autobiography claimed his former England team-mates were behind it.&lt;/p&gt;&lt;p&gt;In his book, Pietersen revealed the extent to which the account had angered and upset him, and claimed that the accounts author had told the former England wicketkeeper Alec Stewart that some of the guys in the dressing room are tweeting from it.&lt;/p&gt;&lt;p&gt;Disappointed to be implicated in the &lt;a href="https://twitter.com/hashtag/kpgenius?src=hash"&gt;#kpgenius&lt;/a&gt; account. I 100% did NOT have any password. And wasn't involved In any posting.&lt;/p&gt; &lt;a href="http://www.theguardian.com/sport/2014/oct/09/tim-bresnan-kevin-pietersen-parody-twitter"&gt;Continue reading...&lt;/a&gt;   

Date: Thu, 09 Oct 2014 11:56:43 GMT 

打印這些結果對每個新聞文章。我的問題是,我如何去清理內容部分,並刪除所有的HTML垃圾?我只需要一些沒有所有標籤的文章的基本信息。我如何使用正則表達式來刪除這些(例如鏈接和「& lt;/p & gt;」)? Thankyou

+0

也許你最好使用XML解析器 - 「xml.etree.ElementTree」或「lxml」或類似的。 – mhawke 2014-10-09 12:18:31

+0

我只是想使用沒有任何其他模塊的正則表達式 – user2747367 2014-10-09 16:18:22

回答