python如何提取br之後的文本？

我正在使用2.7.8並且有點意外bcz我得到所有文本，但是包含在最後<「br」>之後的文本沒有得到。就像我的html頁面：python如何提取br之後的文本？

<html> 
<body> 
<div class="entry-content" > 
<p>Here is a listing of C interview questions on 「Variable Names」 along with answers, explanations and/or solutions: 
</p> 

<p>Which of the following is not a valid C variable name?<br> 
a) int number;<br> 
b) float rate;<br> 
c) int variable_count;<br> 
d) int $main;</p> <!--not getting--> 

<p> more </p> 

<p>Which of the following is true for variable names in C?<br> 
a) They can contain alphanumeric characters as well as special characters<br> 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)<br> 
c) Variable names cannot start with a digit<br> 
d) Variable can be of any length</p> <!--not getting -->! 

</div> 
</body> 
</html>

和我的代碼：

url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/" 
#url="http://www.sanfoundry.com/c-programming-questions-answers-variable-names-2/" 
req = Request(url) 
resp = urllib2.urlopen(req) 
htmls = resp.read() 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(htmls) 
for br in soup.findAll('br'): 
    next = br.nextSibling 
    if not (next and isinstance(next,NavigableString)): 
     continue 
    next2 = next.nextSibling 
    if next2 and isinstance(next2,Tag) and next2.name == 'br': 
     text = str(next).strip() 
     if text: 
      print "Found:", next.encode('utf-8') 
      # print '...........sfsdsds.............',answ[0].encode('utf-8') #

輸出：

Found: 
a) int number; 
Found: 
b) float rate; 
Found: 
c) int variable_count; 

Found: 
a) They can contain alphanumeric characters as well as special characters 
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static) 
Found: 
c) Variable names cannot start with a digit

但是我沒有得到最後的「文本」，這是例如：

d) int $main 
    and 
d) Variable can be of any length

後面是<「BR」>

和輸出我想獲得：

Found: 
a) int number; 
Found: 
b) float rate; 
Found: 
c) int variable_count; 
Found: 
d) int $main 

Found: 
a) They can contain alphanumeric characters as well as special characters 
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static) 
Found: 
c) Variable names cannot start with a digit 
d) Variable can be of any length

來源

2015-12-09 user3440716

添加更多打印語句。當你繼續打印你正在跳過的內容時。將其他語句放到你的if語句中並打印你正在跳過的內容。 –

好的，我正在嘗試......... – user3440716

爲什麼你仍舊按照舊的方式來做，而不是我建議的方式[here]（http://stackoverflow.com/a/34159940/771848）？ – alecxe

你可以使用Requests而不是urllib2的，並通過lxml的HTML模塊提取XML。

from lxml import html 
import requests 

#request page 
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/") 

#get content in html format 
page_content=html.fromstring(page.content) 

#recover all text from <p> elements 
items=page_content.xpath('//p/text()')

上面的代碼返回包含在<a>元件文檔中的所有文本的數組。
因此，您可以簡單地索引到數組中以打印您想要的內容。

來源

2015-12-09 17:12:53 Valkyrie

這是因爲BeautifulSoup僅由</p>收盤前<br>標籤強制文本轉換成有效的XML。該版本美化對此是清楚的：

<p> 
Which of the following is not a valid C variable name? 
<br> 
    a) int number; 
    <br> 
    b) float rate; 
    <br> 
    c) int variable_count; 
    <br> 
    d) int $main; 
    </br> 
    </br> 
    </br> 
</br> 
</p>

，使文本d) int $main;是最後<br>標籤的不是兄弟，但是這個標籤的文本。

代碼可以（在這裏）：

... 
soup = BeautifulSoup(htmls) 
for br in soup.findAll('br'): 
    if len(br.contents) > 0: # avoid errors if a tag is correctly closed as <br/> 
     print 'Found', br.contents[0]

它使預期：

Found 
a) int number; 
Found 
b) float rate; 
Found 
c) int variable_count; 
Found 
d) int $main; 
Found 
a) They can contain alphanumeric characters as well as special characters 
Found 
b) It is not an error to declare a variable to be one of the keywords(like goto, static) 
Found 
c) Variable names cannot start with a digit 
Found 
d) Variable can be of any length

來源

2015-12-09 16:39:27

我得到這個：IndexError：列表索引超出範圍 – user3440716

任何想法......？ – user3440716

@ user3440716：很難說沒有你的真實意見。我認爲這是因爲'br.contents [0]'。我最後的編輯應該修復它 –

python如何提取br之後的文本？

回答

相關問題