2014-06-08 51 views
1

我有一個小實用工具,用於以純文本形式讀取RSS提要。這裏是典型代碼:如何解析Python中的RSS提要中的HTML標記

#!/usr/bin/python 

# /usr/lib/xscreensaver/phosphor -scale 3 -program 'python newsfeed.py | tee /dev/stderr | festival --tts' 

import sys 
import os 
import feedparser 
from subprocess import call 

def printLine(): 
    terminalRows, terminalColumns = os.popen('stty size', 'r').read().split() 
    for i in range(0, int(terminalColumns)): 
     sys.stdout.write("-") 
    print("\n") 

feed = feedparser.parse('http://home.web.cern.ch/scientists/updates/feed') 

for post in feed.entries: 
    printLine() 
    print post.title + "\n" 
    print post.description + "\n" 
printLine() 

當這個運行時,輸出看起來是這樣的:

----------------------------------------------------------------------------------------------------- 

LHC seminar: Higgs boson width 

<div class="field-body"> 
    <p>Constraints on the total Higgs boson width, Gamma_H, are presented using off-shell production and decay to ZZ in the 4l and 2l2nu final states. The analysis is based on data collected in 2012 by the CMS experiment at the LHC, corresponding to an integrated luminosity of L = 19.7/fb at a centre-of-mass energy of 8 TeV. The combined analysis of the 4l and 2l2nu events at high mass with the 4l measurement of the Higgs boson peak at 125.6 GeV leads to an upper limit on the Higgs boson width of Gamma_H &lt; 4.2 x Gamma_H(SM) at the 95% confidence level, assuming Gamma_H(SM) = 4.15 MeV. This result considerably improves over previous experimental constraints from direct measurements at the Higgs resonance peak.</p> 
<h2><a href="https://indico.cern.ch/event/313506/">Watch the webcast at 11am CET</a></h2> 
    </div> 

----------------------------------------------------------------------------------------------------- 

Neutrinos and nucleons 

<p class="field-byline-taxonomy"> 
<a href="http://home.web.cern.ch/authors/christine-sutton">Christine Sutton</a></p> 
    <div class="field-body"> 
    <p>On 7 April 1934 the journal <em>Nature</em> published a paper in which Hans Bethe and Rudolf Peierls made a first calculation of the neutrino cross-section and concluded that "it seems highly improbable that, even for cosmic ray energies, the cross-section becomes large enough to allow the process to be observed". Forty years on, neutrino cross-sections were not only being measured with the <a href="http://home.web.cern.ch/about/experiments/gargamelle">Gargamelle</a> bubble chamber at CERN's <a href="http://home.web.cern.ch/about/accelerators/proton-synchrotron">Proton Synchrotron</a>, they were helping to reveal a more fundamental layer to nature - the quarks.</p> 
<p><strong>Read more:</strong> "<a href="http://cerncourier.com/cws/article/cern/56605">Neutrinos and nucleons</a>"- <em>CERN Courier</em></p> 
    </div> 

----------------------------------------------------------------------------------------------------- 

什麼將是可能是普遍意義要變成純文本這卻大多RSS饋送明智的方法HTML代碼?

回答

1

你可以試試python模塊beautifulsoup4(通過點可用)。 This question可能會指導您如何使用它來達到您的目的。

作爲開始:

from bs4 import BeautifulSoup 
soup = BeautifulSoup(post.description) 
texts = soup.findAll(text = True) 
print ''.join(texts) 

這表明

Christine Sutton 

On 7 April 1934 the journal Nature published a paper in which Hans Bethe and Rudolf Peierls made a first calculation of the neutrino cross-section and concluded that "it seems highly improbable that, even for cosmic ray energies, the cross-section becomes large enough to allow the process to be observed". Forty years on, neutrino cross-sections were not only being measured with the Gargamelle bubble chamber at CERN's Proton Synchrotron, they were helping to reveal a more fundamental layer to nature - the quarks. 
Read more: "Neutrinos and nucleons"- CERN Courier 
+0

嗯,這個作品非常好。非常感謝您的指導! – d3pd