美麗的湯將標準普爾變成標準普爾; AT＆T進軍AT＆T; ？

我正在使用BeautifulSoup 4（4.3.2）解析一些相當混亂的HTML文檔，並且遇到一個問題，它會將公司名稱S&P（Standard和Poors）或M&S（Marks and Spencer）AT&T轉換爲S&P; ，M&S;和AT&T;。因此，它想要將&[A-Z]+模式完成爲html實體，但實際上並未使用html實體查找表，因爲&P;不是html實體。美麗的湯將標準普爾變成標準普爾; AT＆T進軍AT＆T; ？

如何讓它不這樣做，或者我只是需要正則表達式匹配無效實體並將其更改回來？

>>> import bs4 
>>> soup = bs4.BeautifulSoup('AT&T announces new plans') 
>>> soup.text 
u'AT&T; announces new plans' 

>>> import bs4 
>>> soup = bs4.BeautifulSoup('AT&TOP announces new plans') 
>>> soup.text 
u'AT&TOP; announces new plans'

我試過在OSX 10.8.5的Python 2.7.5和Scientifix版Linux 6的Python上述2.7.5

來源

2013-12-16 Matti Lyra

你正在運行什麼版本？看起來這是4.2.0中已知的一個已在4.2.1中解決的bug：http://stackoverflow.com/a/17168523/231316 –

@ChrisHaas運行版本4.3.2 –

在Ubuntu 13.10上使用你最小的例子， bs4'4.3.2'我無法重現這個問題。 – Hooked

這似乎是在這樣一個錯誤或BeautifulSoup4處理未知的HTML實體引用。正如Ignacio在上面的評論中所說的那樣，預處理輸入並用HTML實體替換'&'符號可能會更好（'& amp;'）。

但是，如果你不想出於某種原因這樣做 - 唯一的辦法就是找到解決問題的辦法，就是「修補」代碼。這個腳本爲我工作（Python的2.73在Mac OS X）：

import bs4 

def my_handle_entityref(self, name): 
    character = bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name) 
    if character is not None: 
     data = character 
    else: 
     #the original code mishandles unknown entities (the following commented-out line) 
     #data = "&%s;" % name 
     data = "&%s" % name 
    self.handle_data(data) 

bs4.builder._htmlparser.BeautifulSoupHTMLParser.handle_entityref = my_handle_entityref 
soup = bs4.BeautifulSoup('AT&T announces new plans') 
print soup.text 
soup = bs4.BeautifulSoup('AT&TOP announces new plans') 
print soup.text

它產生的輸出：

AT&T announces new plans 
AT&TOP announces new plans

你可以看到該方法與這裏的問題：

http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L81

並在此處解決問題：

http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L86

來源

2013-12-19 20:16:19

優秀，這比我寫的正則表達式攻擊好得多 –

美麗的湯將標準普爾變成標準普爾; AT＆T進軍AT＆T; ？

回答

相關問題