2016-09-21 74 views
2

我有這個HTML表:我需要從這個表中獲取特定數據並將其分配給一個變量,我不需要所有的信息。 flag =「阿拉伯聯合酋長國」,home_port =「Sharjah」等。由於html元素沒有'class',我們如何提取這些數據。BeautifulSoup HTML表分析爲無標記的標記

 r = requests.get('http://maritime-connector.com/ship/'+str(imo_number), headers={'User-Agent': 'Mozilla/5.0'}) 
    soup = BeautifulSoup(r.content, "lxml") 
    table = soup.find("table", { "class" : "ship-data-table" }) 
    for row in table.findAll("tr"): 
     tname = row.findAll("th") 
     cells = row.findAll("td") 


     print (type(tname)) 
     print (type(cells)) 

我使用python模塊beautfulSoup。

<table class="ship-data-table" style="margin-bottom:3px"> 
         <thead> 
         <tr> 
          <th>IMO number</th> 
          <td>9492749</td> 
         </tr> 
         <tr> 
          <th>Name of the ship</th> 
          <td>SHARIEF PILOT</td> 
         </tr> 
                <tr> 
          <th>Type of ship</th> 
          <td>ANCHOR HANDLING VESSEL</td> 
         </tr> 
                       <tr> 
          <th>MMSI</th> 
          <td>470535000</td> 
         </tr> 
                       <tr> 
          <th>Gross tonnage</th> 
          <td>499 tons</td> 
         </tr> 
                       <tr> 
          <th>DWT</th> 
          <td>222 tons</td> 
         </tr> 
                       <tr> 
          <th>Year of build</th> 
          <td>2008</td> 
         </tr> 
                       <tr> 
          <th>Builder</th> 
          <td>NANYANG SHIPBUILDING - JINGJIANG, CHINA</td> 
         </tr> 
                       <tr> 
          <th>Flag</th> 
          <td>UNITED ARAB EMIRATES</td> 
         </tr> 
                              <tr> 
          <th>Home port</th> 
          <td>SHARJAH</td> 
         </tr> 
                              <tr> 
          <th>Manager & owner</th> 
          <td>GLOBAL MARINE SERVICES - SHARJAH, UNITED ARAB EMIRATES</td> 
         </tr> 
                                     <tr> 
          <th>Former names</th> 
          <td>SUPERIOR PILOT until 2008 Sep</td> 
         </tr> 
                </thead> 
        </table> 
+0

內容我使用Python模塊beautfulSoup。不使用任何正則表達式。 –

回答

2

去了所有在表格中th元素,讓文字和以下td兄弟姐妹的文字:

from pprint import pprint 

from bs4 import BeautifulSoup 

data = """your HTML here""" 

soup = BeautifulSoup(data, "html.parser") 

result = {header.get_text(strip=True): header.find_next_sibling("td").get_text(strip=True) 
      for header in soup.select("table.ship-data-table tr th")} 
pprint(result) 

這將構建一個很好的字典,標題密鑰和相應的td文本作爲值:

{'Builder': 'NANYANG SHIPBUILDING - JINGJIANG, CHINA', 
'DWT': '222 tons', 
'Flag': 'UNITED ARAB EMIRATES', 
'Former names': 'SUPERIOR PILOT until 2008 Sep', 
'Gross tonnage': '499 tons', 
'Home port': 'SHARJAH', 
'IMO number': '9492749', 
'MMSI': '470535000', 
'Manager & owner': 'GLOBAL MARINE SERVICES - SHARJAH, UNITED ARAB EMIRATES', 
'Name of the ship': 'SHARIEF PILOT', 
'Type of ship': 'ANCHOR HANDLING VESSEL', 
'Year of build': '2008'} 
+1

我喜歡這個解決方案。 –

+0

謝謝@alecxe。它的工作.. –

+0

@alecxe我得到錯誤時,值是沒有的。 AttributeError:'NoneType'對象沒有屬性'get_text'。我在哪裏可以使用try和exception –

0

我會做這樣的事情:

html = """ 
     <your table> 
    """ 

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html, 'html.parser') 

flag = soup.find("th", string="Flag").find_next("td").get_text(strip=True) 
home_port = soup.find("th", string="Home port").find_next("td").get_text(strip=True) 


print(flag) 
print(home_port) 

這樣,我只在th要素確保我匹配文字和獲取的下一td