2014-03-29 35 views
1

我已經解析了以下string BeautifulSoup提取數據,但我無法獲得一些數據。嘗試過不同的方法。我設法弄清了「a」標籤,鏈接和每個鏈接之外的文字之間的文字。如何提取文本,鏈接和文本之後的鏈接和另一個文本後與python

<html> 
<body> 
    <p align="left"> 
    <font face="Arial, Helvetica, sans-serif" size="2"> 
    <b> 
    <font size="4"> 
     GOVERNOR: 
    </font> 
    </b> 
    <br/> 
    </font> 
    <font face="Arial, Helvetica, sans-serif" size="2"> 
    <a href="http://governor.alabama.gov/"> 
    <strong> 
     Robert 
       Bentley (R)* 
    </strong> 
    </a> 
    - Ex-Morgan County Commissioner &amp; State Correctional Officer 
    <strong> 
    <br/> 
    <a href="http://www.facebook.com/stacy.george.3139"> 
     Stacy George 
       (R) 
    </a> 
    - Ex-Morgan County Commissioner &amp; State Correctional Officer 
    <br/> 
    Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate &amp; '12 Scottsboro Mayor Candidate 
    <br/> 
    <a href="http://www.bassforbama.com/"> 
     Kevin Bass (D) 
    </a> 
    - Businessman &amp; Ex-Pro Baseball Player 
    <br/> 
    <a href="http://www.parkergriffithforcongress.com/"> 
     Parker Griffith 
       (D) 
    </a> 
    - Ex-Congressman, Ex-State Sen., Physician &amp; Ex-Republican 
    </strong> 
    </font> 
    </p> 
</body> 
</html> 

這是我實現與BeautifulSoup

從BS4進口BeautifulSoup 湯= BeautifulSoup(Above_String)

"""for br in soup.find_all("br"): 
    print br 
    #print br.nextSibling.content 
""" 
for link in soup.find_all("a"): 
    if link.string == None: 
     print link.strong.string, link.get("href"),link.next_sibling 
    else: 
     print link.string, link.get("href"),link.next_sibling,link.next_sibling 

上面的代碼打印出這樣的事:

> Robert 
       Bentley (R)* 
     http://governor.alabama.gov/ 

>  Stacy George 
       (R) 
     http://www.facebook.com/stacy.george.3139 
    - Ex-Morgan County Commissioner & State Correctional Officer 

>  Kevin Bass (D) 
     http://www.bassforbama.com/ 
    - Businessman & Ex-Pro Baseball Player 


>  Parker Griffith 
       (D) 
     http://www.parkergriffithforcongress.com/ 
    - Ex-Congressman, Ex-State Sen., Physician & Ex-Republican 

錯過了第三項

Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate &amp; '12 Scottsboro Mayor Candidate 

請問如何使用BeautifulSoup解決此問題? 我試圖用find_all("br")來做,但它不起作用br標籤返回NoneType

回答

1

超越每一個環節都文本節點:

from itertools import takewhile 
from bs4 import NavigableString 

not_link = lambda t: getattr(t, 'name') not in ('a', 'strong') 

for link in soup.find_all("a"): 
    print 'Link contents:' 
    text = link.text.strip() 
    for sibling in takewhile(not_link, link.next_siblings): 
     if isinstance(sibling, NavigableString): 
      text += unicode(sibling).strip() 
     else: 
      text += sibling.text.strip() 
    print text 

此打印:

Link contents: 
Robert 
       Bentley (R)*- Ex-Morgan County Commissioner & State Correctional Officer 
Link contents: 
Stacy George 
       (R)- Ex-Morgan County Commissioner & State Correctional OfficerBob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate 
Link contents: 
Kevin Bass (D)- Businessman & Ex-Pro Baseball Player 
Link contents: 
Parker Griffith 
       (D)- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican 
+0

我很欣賞這種幫助和它的作品。作爲學習的一部分,有沒有使用itertools的另一種方式,因爲既然我是新的,我想如果有其他方式沒有導入其他任何東西?由於我是Python的初學者,並且從未使用過像itertools這樣的高級內容。幾周前,他才學會了Python並挑戰了自己。 –

+1

@ user3428883:你可以用'for'循環遍歷'next_siblings',並在到達下一個不再有趣的兄弟節點時使用'break'結束該循環。 –

+0

@ user3428883:這確實是'takewhile'的一切;循環'next_siblings'並給你一切,直到'lambda'函數返回'False',結束循環。 –

相關問題