使用Python中的BeautifulSoup 4從div標籤中提取文本

我想從使用BeautifulSoup4和python的div標籤中提取文本。下面的HTML代碼存儲在一個文件（example.html的）使用Python中的BeautifulSoup 4從div標籤中提取文本

我的HTML：

<table class="NZX1058422900" cols="20" style="border-collapse: collapse; width: 1496px;" cellspacing="0" cellpadding="0" border="0"> 
<tbody> 
<td class="A10dbmytr2499b"> 
<div class="VWP1058422499" alt="Total Cases: 5 - Level 1, Level 2, or On Hold 2 - Completed" title="Total Cases: 5 - Level 1, Level 2, On Hold 2 - Completed">5/2</div> 
</td> 
</tbody> 
</table> 

I want the output to look like below: 
Total Cases: 
5 - Level 1, Level 2, or On Hold 
2 - Completed

到目前爲止，我的代碼是：

from bs4 import BeautifulSoup 
openFile = open("C:\\example.html") 
readFile = openFile.read() 
soup = BeautifulSoup(readFile, "lxml")

我曾嘗試下面的代碼沒有任何成功：

soup.find("div", class_="VWP1058422499")

任何人都可以幫助如何提取上述數據？

來源

2017-08-13 LinuxUser

從@ so1989擴大的答案，你也想知道如何與您指定的格式打印，我建議這種做法：

from bs4 import BeautifulSoup 

openFile = open("C:\\example.html") 
readFile = openFile.read() 

soup = BeautifulSoup(readFile, "lxml") 
alt = soup.find("div", {"class":"VWP1058422499"}).get("alt").split() 

for i, char in enumerate(alt): 
    if char == '-': 
     alt[i-2] = alt[i-2] + '\n' 
    if char[0] in ['-', 'C', 'L', 'o']: 
     alt[i] = ' ' + alt[i] 

alt = ''.join(alt) 
print(alt)

來源

2017-08-13 16:55:36

謝謝大家的回答！ @ so1989 但是我得到了「AttributeError：'NoneType'對象沒有屬性'get'」錯誤： alt = soup.find（「div」，{「class」：「VWP1058422499」}）。get（「alt 「）任何想法如何解決這個問題？我無法執行.get方法.. – LinuxUser

@LinuxUser你可以在這裏發佈網址的網址，你試圖刮？ – so1989

@LinuxUser我用你給我們從文件中讀取的html文本測試了它，它工作正常，可能是與文件位置或網站url有關的任何錯誤？ –

alt = soup.find("div", {"class":"VWP1058422499"}).get("alt") 
print(alt.text) #or just print(alt)

來源

2017-08-13 16:39:13 so1989

榮譽給你，我希望你不要介意我決定改進你的答案。 –

使用Python中的BeautifulSoup 4從div標籤中提取文本

回答

相關問題