2017-02-16 21 views
2

使用Beautiful soupPandas刮網以獲取表格。其中一列有一些網址。當我將html傳遞給熊貓時,href丟失。如何保存與美麗的湯和熊貓刮桌子時的鏈接

是否有任何方法保留url鏈接只爲該列?

實施例的數據(編輯的更好地適合RAL情況下):

<html> 
     <body> 
      <table> 
       <tr> 
       <td>customer</td> 
       <td>country</td> 
       <td>area</td> 
       <td>website link</td> 
      </tr> 
      <tr> 
       <td>IBM</td> 
       <td>USA</td> 
       <td>EMEA</td> 
       <td><a href="http://www.ibm.com">IBM site</a></td> 
      </tr> 
      <tr> 
      <td>CISCO</td> 
      <td>USA</td> 
      <td>EMEA</td> 
      <td><a href="http://www.cisco.com">cisco site</a></td> 
     </tr> 
      <tr> 
      <td>unknown company</td> 
      <td>USA</td> 
      <td>EMEA</td> 
      <td></td> 
     </tr> 
     </table> 
    </body> 
    </html> 

我的Python代碼:

file = open(url,"r") 

    soup = BeautifulSoup(file, 'lxml') 

    parsed_table = soup.find_all('table')[1] 

    df = pd.read_html(str(parsed_table),encoding='utf-8')[0] 

df 

輸出(出口到CSV):

customer;country;area;website 
IBM;USA;EMEA;IBM site 
CISCO;USA;EMEA;cisco site 
unknown company;USA;EMEA; 

DF輸出是好的,但鏈接丟失。我需要保留鏈接。至少是URL。

任何提示?

回答

2

只是檢查是否存在標籤是這樣的:

import numpy as np 

with open(url,"r") as f: 
    sp = bs.BeautifulSoup(f, 'lxml') 
    tb = sp.find_all('table')[56] 
    df = pd.read_html(str(tb),encoding='utf-8', header=0)[0] 
    df['href'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tb.find_all('a')] 
5

pd.read_html假定您感興趣的數據在文本中,而不是標籤屬性。然而,這並不難自己颳去表:

import bs4 as bs 
import pandas as pd 

with open(url,"r") as f: 
    soup = bs.BeautifulSoup(f, 'lxml') 
    parsed_table = soup.find_all('table')[1] 
    data = [[td.a['href'] if td.find('a') else 
      ''.join(td.stripped_strings) 
      for td in row.find_all('td')] 
      for row in parsed_table.find_all('tr')] 
    df = pd.DataFrame(data[1:], columns=data[0]) 
    print(df) 

產生

  customer country area   website link 
0    IBM  USA EMEA http://www.ibm.com 
1   CISCO  USA EMEA http://www.cisco.com 
2 unknown company  USA EMEA      
+0

你能幫我一個問題我有美女? – Nobi

+0

@Nobi:我可能不知道答案,但是如果你發佈一個問題,我會看看。 – unutbu

+0

好的,謝謝我會馬上去做 – Nobi