2016-07-27 19 views
0

我使用Selenium,Python和Beautiful Soup來抓取頁面,並且我希望將表格的行作爲逗號分隔的值。不幸的是,該頁面的HTML遍佈全球。到目前爲止,我設法通過使用元素的ID來提取兩列。剩下的值只包含在沒有標識符的類中,例如class或id。以下是結果的一個示例。訪問​​在html/css頁面上使用python和BeautifulSoup時沒有ID或類的表中的元素

<table id="tblResults" style="z-index: 102; left: 18px; width: 956px; 
    height: 547px" cellspacing="1" width="956" border="0"> 
    <tr style="color:Black;background-color:LightSkyBlue;font-family:Arial;font-weight:normal;font-style:normal;text-decoration:none;"> 
     <td>&nbsp;</td> 
     <td>&nbsp;</td> 
     <td>Select</td> 
     <td><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl00&#39;,&#39;&#39;)" style="color:Black;">T</a></td> 
     <td><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl01&#39;,&#39;&#39;)" style="color:Black;">Party</a></td> 
     <td>Opposite Party</td> 
     <td style="width:50px;"><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl02&#39;,&#39;&#39;)" style="color:Black;">Type</a></td> 
     <td style="width:100px;"><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl03&#39;,&#39;&#39;)" style="color:Black;">Book-Page</a></td> 
     <td style="width:70px;"><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl04&#39;,&#39;&#39;)" style="color:Black;">Date</a></td> 
     <td><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl05&#39;,&#39;&#39;)" style="color:Black;">Town</a></td> 
    </tr> 
    <tr style="font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;"> 
     <td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;"> 
     <input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnView" value="View" id="ContentPlaceHolder1_grdResults_btnView_0" title="Click to view this document" style="width:50px;" /> 
     </td> 
     <td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;"> 
     <input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_0" title="Click to add this document to My Documents" style="width:60px;" /> 
     </td> 
     <td valign="top"> 
     <span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_0" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$CheckBox1" /></span> 
     </td> 
     <td>1</td> 
     <td> 
     <span id="ContentPlaceHolder1_grdResults_lblParty1_0" title="Grantors: 
      ALBERT G MOSES FARM 
      MOSES ALBERT G 
      Grantees: 
      ">MOSES ALBERT G</span> 
     </td> 
     <td> 
     <span id="ContentPlaceHolder1_grdResults_lblParty2_0" title="Grantors: 
      ALBERT G MOSES FARM 
      MOSES ALBERT G 
      Grantees: 
      "></span> 
     </td> 
     <td valign="top">MAP</td> 
     <td valign="top">- </td> 
     <td valign="top">01/16/1953</td> 
     <td valign="top">TOWN OF BINGHAMTON</td> 
    </tr> 
    <tr style="background-color:Gainsboro;font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;"> 
     <td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;"> 
     <input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnView" value="View*" id="ContentPlaceHolder1_grdResults_btnView_1" title="Click to view this document" style="width:50px;" /> 
     </td> 
     <td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;"> 
     <input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_1" title="Click to add this document to My Documents" style="width:60px;" /> 
     </td> 
     <td valign="top"> 
     <span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_1" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$CheckBox1" /></span> 
     </td> 
     <td>1</td> 
     <td> 
     <span id="ContentPlaceHolder1_grdResults_lblParty1_1" title="Grantors: 
      MOSS EMMY-IND&amp;GDN 
      MOSES ALEXANDRA/GDN 
      Grantees: 
      GOODRICH MERLE L 
      GOODRICH CHARITY M 
      ">MOSES ALEXANDRA/GDN</span> 
     </td> 
     <td> 
     <span id="ContentPlaceHolder1_grdResults_lblParty2_1" title="Grantors: 
      MOSS EMMY-IND&amp;GDN 
      MOSES ALEXANDRA/GDN 
      Grantees: 
      GOODRICH MERLE L 
      GOODRICH CHARITY M 
      ">GOODRICH MERLE L</span> 
     </td> 
</table> 

這是對兩列的作品,我至今寫的腳本:

import re 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 

html = open('searched.html') 
bsObj = BeautifulSoup(html) 
myTable = bsObj.findAll("tr",{ "style":re.compile("font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;")}) 

for table_ in myTable: 
    party = table_.find("span", {"id": re.compile("Party1_*")}) 
    oppositeParty= table_.find("span", {"id": re.compile("Party2_*")}) 
    print(party.get_text()+ "," + oppositeParty.get_text()) 

我曾嘗試使用myTable的兒童如下改進:

myTable.children

+0

你想要輸出什麼?你的問題不完整。它在您的代碼的一行後停止。請回過頭來編輯這個問題,確保你包含了問題的所有相關信息,包括你想要完成的內容,你已經嘗試過的內容(包括格式正確的代碼)以及結果是什麼(帶有任何錯誤消息並詳細描述了你得到的和你的期望)。 – JeffC

回答

0

如果您只想轉儲內容,則應該這樣做:

myTable = bsObj.find_element_by_tag_name("table") 
for table_ in myTable: 
    rows = table_.find_elements_by_tag_name("tr") 
    for row_ in rows: 
     columns = row_.find_elements_by_tag_name("td") 
     for column_ in columns: 
      # print out comma delimited text of columns... 
     # print the end of your row 

如果您真的想要刮取特定信息,您需要向我們提供有關您的最終目標是什麼的更多說明。

+0

根據他的定位器,「myTable」實際上已經是'TR's。 – JeffC

+0

好點@JeffC! (只是表明準確命名變量的重要性)。我更新了我的答案以獲取表格,並更正了方法名稱find_elements_by_tag_name而不是find_element_by_tag_name –

相關問題