2014-05-12 80 views
6

我有一個HTML頁面:如何從html內容中刪除「& nbsp」?

<div class="theater"> 
    <div class="desc" id="theater_16109207495969942346"> 
     <h2 class="name"><a href="/movies?near=pune&amp;tid=df8f66de0a592b4a" id="link_1_theater_16109207495969942346">Esquare Victory Camp</a></h2> 
     <div class="info">site no 2429,general thimayya road, camp contonment,oppositekayani bakery, Pune - 020 2613 2975 
      <a class="fl" href="" target="_top"></a> 
     </div> 
    </div> 
    <div class="showtimes"> 
     <div class="show_left"> 
      <div class="movie"> 
       <div class="name"><a href="/movies?near=pune&amp;mid=1cdcf90092189400">Hawaa Hawaai</a> 
       </div><span class="info">Drama - Hindi</span> 
       <div class="times"><span style="color:#666"><span style="padding:0 "></span> 
        <!-- -->10:30am</span><span style="color:#666"><span style="padding:0 "> &amp;nbsp</span> 
        <!-- -->3:45</span><span style="color:#666"><span style="padding:0 "> &amp;nbsp</span> 
        <!-- -->6:00</span><span style="color:"><span style="padding:0 "> &amp;nbsp</span> 
        <!-- -->8:30pm</span> 
       </div> 
      </div> 
     </div> 
     <div class="show_right"> 
      <div class="movie"> 
       <div class="name"><a href="/movies?near=pune&amp;mid=6b59ad39004d895b">The Amazing Spider Man 2</a> 
       </div><span class="info">Action/Adventure/Thriller - English - <a class="fl" href="/url?q=http://www.youtube.com/watch%3Fv%3DSCjCk59PIzw&amp;sa=X&amp;oi=movies&amp;ii=0&amp;usg=AFQjCNGpVM5U04h0acABA7eApb6EIO4Ejw">Trailer</a></span> 
       <div class="times"><span style="color:#666"><span style="padding:0 "></span> 
        <!-- -->1:00</span><span style="color:"><span style="padding:0 "> &amp;nbsp</span> 
        <!-- -->10:45pm</span> 
       </div> 
      </div> 
     </div> 
     <p class="clear"></p> 
    </div> 
</div> 

這裏我們可以看到,我們在很多地方都有&amp;nbsp。還有許多其他的unicode字符。我想提取此頁面的內容。 什麼我做的是:

def removeNonAscii(s): return "".join(i for i in s if ord(i)<128) 

myName = soup.findAll("div", {"class" : "theater"}) 
for x in myName: 
    xt = str(x) 
    print removeNonAscii(xt) 
    print "<br>" 

結果:

Esquare Victory Camp 
site no 2429,general thimayya road, camp contonment,oppositekayani bakery, Pune - 020 2613 2975 
Hawaa Hawaai 
Drama - Hindi 
10:30am &nbsp3:45 &nbsp6:00 &nbsp8:30pm 
The Amazing Spider Man 2 
Action/Adventure/Thriller - English - Trailer 
1:00 &nbsp10:45pm 

一切看起來除了&nbsp不錯。我嘗試更換& nbsp,並尋找其他解決方案,但仍然沒有解決方案。我認爲&nbsp沒有;正在創建問題。 &nbsp如何被移除?

+0

是字符已經向你走來雙逃脫這樣呢?如果可以,最好的選擇是從良好的數據開始。 – Dan

+0

是的。字符就像那樣。我沒有其他選擇與它合作。有什麼方法可以殺死這些unicode字符?&nbsp? – impossible

回答

5

根據您想要刪除不間斷空間的處理階段,它可能非常容易。例如,當你處理你提供你可以從文本元素除去字符串「& NBSP」的HTML片段:

s = """your HTML""" 
soup = BeautifulSoup(s) 
texts = soup.find_all(text=True) 
for t in texts: 
    newtext = t.replace("&nbsp", "") 
    t.replace_with(newtext)