如何使用BeautifulSoup根據孩子和兄弟姐妹選擇標籤？

我正試圖從2012年奧巴馬 - 羅姆尼總統的辯論中摘錄報價。問題是the site組織不良。因此，結構是這樣的：如何使用BeautifulSoup根據孩子和兄弟姐妹選擇標籤？

<span class="displaytext"> 
    <p> 
     <i>OBAMA</i>Obama's first quotes 
    </p> 
    <p>More quotes from Obama</p> 
    <p>Some more Obama quotes</p> 

    <p> 
     <i>Moderator</i>Moderator's quotes 
    </p> 
    <p>Some more quotes</p> 

    <p> 
     <i>ROMNEY</i>Romney's quotes 
    </p> 
    <p>More quotes from Romney</p> 
    <p>Some more Romney quotes</p> 
</span>

有沒有一種方法來選擇<p>，其第一個孩子是一個i具有文本OBAMA和所有它的p兄弟姐妹，直到你遇到下一個p他們的第一個孩子是一個i沒有文字Obama ??

這裏是我試過到目前爲止，但它僅抓住了第一個p無視兄弟姐妹

input = '''<span class="displaytext"> 
     <p> 
      <i>OBAMA</i>Obama's first quotes 
     </p> 
     <p>More quotes from Obama</p> 
     <p>Some more Obama quotes</p> 

     <p> 
      <i>Moderator</i>Moderator's quotes 
     </p> 
     <p>Some more quotes</p> 

     <p> 
      <i>ROMNEY</i>Romney's quotes 
     </p> 
     <p>More quotes from Romney</p> 
     <p>Some more Romney quotes</p> 
     </span>''' 

soup = BeautifulSoup(input) 
debate_text = soup.find("span", { "class" : "displaytext" }) 
president_quotes = debate_text.find_all("i", text="OBAMA") 

for i in president_quotes: 
    siblings = i.next_siblings 
    for sibling in siblings: 
     print(sibling)

其中僅打印Obama's first quotes

來源

2016-12-04 hsalama

我覺得有種finite state machine式的解決方案將在這裏工作。就像這樣：

soup = BeautifulSoup(input, 'lxml') 
debate_text = soup.find("span", { "class" : "displaytext" }) 
obama_is_on = False 
obama_tags = [] 
for p in debate_text("p"): 
    if p.i and 'OBAMA' in p.i: 
     # assuming <i> is used only to indicate speaker 
     obama_is_on = True 
    if p.i and 'OBAMA' not in p.i: 
     obama_is_on = False 
     continue 
    if obama_is_on: 
     obama_tags.append(p) 
print(obama_tags) 

[<p> 
<i>OBAMA</i>Obama's first quotes 
     </p>, <p>More quotes from Obama</p>, <p>Some more Obama quotes</p>]

來源

2016-12-04 17:10:54

其他奧巴馬引號是p，而不是i的兄弟姐妹，所以你需要找到i的父母的兄弟姐妹。當你通過這些兄弟姐妹循環時，你可以停止當有一個i。事情是這樣的：

for i in president_quotes: 
    print(i.next_sibling) 
    siblings = i.parent.find_next_siblings('p') 
    for sibling in siblings: 
     if sibling.find("i"): 
      break 
     print(sibling.string)

它打印：

Obama's first quotes 

More quotes from Obama 
Some more Obama quotes

來源

2016-12-04 17:33:44 Joey

如何使用BeautifulSoup根據孩子和兄弟姐妹選擇標籤？

回答

相關問題