我使用Selenium,Python和Beautiful Soup來抓取頁面,並且我希望將表格的行作爲逗號分隔的值。不幸的是,該頁面的HTML遍佈全球。到目前爲止,我設法通過使用元素的ID來提取兩列。剩下的值只包含在沒有標識符的類中,例如class或id。以下是結果的一個示例。訪問在html/css頁面上使用python和BeautifulSoup時沒有ID或類的表中的元素
<table id="tblResults" style="z-index: 102; left: 18px; width: 956px;
height: 547px" cellspacing="1" width="956" border="0">
<tr style="color:Black;background-color:LightSkyBlue;font-family:Arial;font-weight:normal;font-style:normal;text-decoration:none;">
<td> </td>
<td> </td>
<td>Select</td>
<td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl00','')" style="color:Black;">T</a></td>
<td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl01','')" style="color:Black;">Party</a></td>
<td>Opposite Party</td>
<td style="width:50px;"><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl02','')" style="color:Black;">Type</a></td>
<td style="width:100px;"><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl03','')" style="color:Black;">Book-Page</a></td>
<td style="width:70px;"><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl04','')" style="color:Black;">Date</a></td>
<td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl05','')" style="color:Black;">Town</a></td>
</tr>
<tr style="font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnView" value="View" id="ContentPlaceHolder1_grdResults_btnView_0" title="Click to view this document" style="width:50px;" />
</td>
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_0" title="Click to add this document to My Documents" style="width:60px;" />
</td>
<td valign="top">
<span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_0" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$CheckBox1" /></span>
</td>
<td>1</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty1_0" title="Grantors:
ALBERT G MOSES FARM
MOSES ALBERT G
Grantees:
">MOSES ALBERT G</span>
</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty2_0" title="Grantors:
ALBERT G MOSES FARM
MOSES ALBERT G
Grantees:
"></span>
</td>
<td valign="top">MAP</td>
<td valign="top">- </td>
<td valign="top">01/16/1953</td>
<td valign="top">TOWN OF BINGHAMTON</td>
</tr>
<tr style="background-color:Gainsboro;font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnView" value="View*" id="ContentPlaceHolder1_grdResults_btnView_1" title="Click to view this document" style="width:50px;" />
</td>
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_1" title="Click to add this document to My Documents" style="width:60px;" />
</td>
<td valign="top">
<span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_1" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$CheckBox1" /></span>
</td>
<td>1</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty1_1" title="Grantors:
MOSS EMMY-IND&GDN
MOSES ALEXANDRA/GDN
Grantees:
GOODRICH MERLE L
GOODRICH CHARITY M
">MOSES ALEXANDRA/GDN</span>
</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty2_1" title="Grantors:
MOSS EMMY-IND&GDN
MOSES ALEXANDRA/GDN
Grantees:
GOODRICH MERLE L
GOODRICH CHARITY M
">GOODRICH MERLE L</span>
</td>
</table>
這是對兩列的作品,我至今寫的腳本:
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = open('searched.html')
bsObj = BeautifulSoup(html)
myTable = bsObj.findAll("tr",{ "style":re.compile("font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;")})
for table_ in myTable:
party = table_.find("span", {"id": re.compile("Party1_*")})
oppositeParty= table_.find("span", {"id": re.compile("Party2_*")})
print(party.get_text()+ "," + oppositeParty.get_text())
我曾嘗試使用myTable的兒童如下改進:
myTable.children
你想要輸出什麼?你的問題不完整。它在您的代碼的一行後停止。請回過頭來編輯這個問題,確保你包含了問題的所有相關信息,包括你想要完成的內容,你已經嘗試過的內容(包括格式正確的代碼)以及結果是什麼(帶有任何錯誤消息並詳細描述了你得到的和你的期望)。 – JeffC