使用BeautifulSoup的replaceWith用標記內容替換所有'a'標記

編輯：基本上，我試圖執行分解，但不是刪除標記並完全摧毀其內容，我想用其內容。使用BeautifulSoup的replaceWith用標記內容替換所有'a'標記

我想用字符串格式的標籤內容替換html文檔中的所有'a'標籤。這會讓我更容易將html寫入csv。但是我無法通過替換步驟。我一直試圖使用BeautifulSoup的replace_with（）來完成它，但結果並沒有像預期的那樣回來。

# Import modules 
from bs4 import BeautifulSoup 
from urllib2 import urlopen 

# URL to soup 
URL = 'http://www.barringtonhills-il.gov/foia/ordinances_12.htm' 
html_content = urlopen(URL).read() 
soup = BeautifulSoup(html_content) 

# Replaces links with link text 
links = soup.find_all('a') 
for link in links: 
    linkText = link.contents[0] 
    linkTextCln = '%s' % (linkText.string) 
    if linkTextCln != 'None': 
     link.replaceWith(linkTextCln) 
     print link

這將返回：

<a href="index.htm">Home</a> 
<a href="instruct.htm">Instructions</a> 
<a href="requests.htm">FOIA Requests</a> 
<a href="kiosk.htm">FOIA Kiosk</a> 
<a href="geninfo.htm">Government Profile</a> 
etc etc etc

但預期收益：

Home 
Instructions 
FOIA Requests 
FOIA Kiosk 
Government Profile 
etc etc etc

爲什麼預期replaceWith不工作有什麼想法？是否有更好的方法來解決這個問題？

來源

2013-03-11 guyute

你還是結果中包含非字符串HTML內容。返回： [] [u'Home '] [u'Instructions'] [u'FOIA請求] 等...等... – guyute 2013-03-19 20:04:55

的link.contents VS linkTextCln是不是我的。然而問題 - 嘗試用link.contents替換鏈接標記也不起作用。 – guyute 2013-03-19 20:06:49

基本上，我試圖執行分解，但不是刪除標記並完全摧毀它的內容，而是想用其內容替換標記/ – guyute 2013-03-19 20:07:37

我相信，BS4，該方法現在是replace_with，但如果你只是希望輸出的標記，以下工作內容：

from bs4 import BeautifulSoup 

s = ''' 
<a href="index.htm">Home</a> 
<a href="instruct.htm">Instructions</a> 
<a href="requests.htm">FOIA Requests</a> 
<a href="kiosk.htm">FOIA Kiosk</a> 
<a href="geninfo.htm">Government Profile</a> 
''' 
soup = BeautifulSoup(s, 'html.parser') 

for tag in soup.findAll('a'): 
    print(tag.string)

輸出：

Home 
Instructions 
FOIA Requests 
FOIA Kiosk 
Government Profile 
[Finished in 0.2s]

來源

2016-03-02 21:28:57

使用BeautifulSoup的replaceWith用標記內容替換所有'a'標記

回答

相關問題