Python3 BeautifulSoup返回串聯字符串

我想從這個網站拉角色名單，一旦我找到它Python3 BeautifulSoup返回串聯字符串

actors_anchor = soup.find('a', href = re.compile('Actor&p')) 
parent_tag = actors_anchor.parent 
next_td_tag = actors_anchor_parent.findNext('td') 

next_td_tag 

<font size="2">Wes Bentley<br><a href="/people/chart/ 
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a 
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert   
Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font>

的問題是，當我拉文則會返回一個字符串名稱之間沒有空格

print(next_td_tag.get_text()) 
'''this returns''' 
'Wes BentleyBryce Dallas HowardRobert RedfordKarl Urban'

我需要這些名稱爲每個名字就像 [「韋斯賓利」，「布萊斯·達拉斯·霍華德」，「羅伯特·雷德福」，「卡爾·厄本」]

分隔的列表

任何建議都非常有用。

來源

2016-12-31 Chace Mcguyer

你不能使用'find_all（'a'，...）'和'for-loop'而沒有'parent'和'findNext'嗎？ – furas

請詳細說明。感謝您的格式編輯這是我的第一篇文章。 –

所以問題是，並非所有演員的名字都包含在一個標籤html中的許多名稱出現在
標籤之間，當我使用該方法時，它不允許我獲得'Wes Bentley' –

找到發現td內的所有a元素：

[a.get_text() for a in next_td_tag.find_all('a')]

這雖然不會覆蓋「韋斯本特利」的文字被掛無a元素。

我們用另一種方式，並找到所有文本節點代替：

next_td_tag.find_all(text=True)

您可能需要清理，刪除「空」的項目：

texts = [text.strip().replace("\n", " ") for text in next_td_tag.find_all(text=True)] 
texts = [text for text in texts if text] 
print(texts)

將打印：

['Wes Bentley', 'Bryce Dallas Howard', 'Robert Redford', 'Karl Urban']

來源

2016-12-31 03:25:07 alecxe

這解決了我的問題。現在很簡單，我明白了，你的幫助是值得讚賞的。 –

您可以使用stripped_strings讓所有的字符串作爲列表

html = '''<td><font size="2">Wes Bentley<br><a href="/people/chart/ 
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a 
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font></td>''' 

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html, 'html.parser') 

next_td_tag = soup.find('td') 

print(list(next_td_tag.stripped_strings))

結果

['Wes Bentley', 'Bryce Dallas Howard', 'Robert Redford', 'Karl Urban']

stripped_strings是發電機，所以你可以用for -loop使用它，或者用得到的所有元素list()

來源

2016-12-31 03:37:31 furas

啊，完全適合這個問題！ – alecxe

@alecxe頭部或尾部沒有空白，stripped_strings在這裏注意。並且答案的html代碼被修改，'\ n'被刪除。 –

import bs4 

html = '''<font size="2">Wes Bentley<br><a href="/people/chart/ 
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a 
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert   
Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font>''' 

soup = bs4.BeautifulSoup(html, 'lxml') 

text = soup.get_text(separator='|') # concat the stings by separator 
# 'Wes Bentley|Bryce Dallas Howard|Robert  \nRedford|Karl Urban' 
split_text = text.replace('  \n', '').split('|') # than split string in separator. 
# ['Wes Bentley', 'Bryce Dallas Howard', 'RobertRedford', 'Karl Urban'] 

# do it one line 
list_text = soup.get_text(separator='|').replace('  \n', '').split('|')

或者使用字符串生成器來避免手動將字符串拆分爲列表：

[i.replace('  \n', '') for i in soup.strings]

來源

2016-12-31 04:00:58

Python3 BeautifulSoup返回串聯字符串

回答

相關問題