Python和熊貓

我試圖得到結果的表使用此代碼檢索的資料：如果你不是在英國Python和熊貓

import pandas as pd 
url = 'https://www.betfair.co.uk/sport/football' 
df = pd.read_html(url, header = None) 
df[0]

的URL可能會有所不同。

我認爲它會像這樣的代碼，它完美的工作（我得到的表）爲我。

import pandas as pd 
url = 'https://en.wikipedia.org/wiki/Opinion_polling_for_the_French_presidential_election,_2017' 
df = pd.read_html(url, skiprows=3) 
df[0]

在第一個例子中，HTML是圍繞<ul>，並<li>組織。

在第二個，它是一個適當的表。

我該如何調整大熊貓以獲得第一種情況下的數據？

來源

2017-04-11 Quora Feans

不幸的是，pandas.read_html（docs）僅提取從HTML表格數據：

import pandas as pd 
html = '''<html> 
      <body> 
       <table> 
       <tr> 
        <th>Col1</th> 
        <th>Col2</th> 
       </tr> 
       <tr> 
        <td>Val1</td> 
        <td>Val2</td> 
       </tr> 
       </table> 
      </body> 
      </html>''' 
dfs = pd.read_html(html) 
df[0]

輸出：

0  1 
0 Col1 Col2 
1 Val1 Val2

對於其中我們的HTML包含一個無序列表代替第二種情況下，現有的熊貓功能將不起作用。您可以使用HTML解析庫（如 BeautifulSoup4）解析列表（以及它的所有子項），並逐行構建數據幀。這裏有一個簡單的例子：

import pandas as pd 
from bs4 import BeautifulSoup 

html = '''<html> 
      <body> 
       <ul id="target"> 
       <li class="row"> 
        Name 
        <ul class="details"> 
        <li class="Col1">Val1</li> 
        <li class="Col2">Val2</li> 
        </ul> 
       </li> 
       </ul> 
      </body> 
      </html>''' 

# Parse the HTML string 
soup = BeautifulSoup(html, 'lxml') 

# Select the target <ul> and build dicts for each row 
data_dicts = [] 
target = soup.select('#target')[0] 
for row in target.select('.row'): 
    row_dict = {} 
    row_dict['name'] = row.contents[0].strip() # Remove excess whitespace 
    details = row.select('.details') 
    for col in details[0].findChildren('li'): 
     col_name = col.attrs['class'][0] 
     col_value = col.text.strip() 
     row_dict[col_name] = col_value 
    data_dicts.append(row_dict) 

# Convert list of dicts to dataframe 
df = pd.DataFrame(data_dicts)

輸出：

Col1 Col2 name 
0 Val1 Val2 Name

的findChildren和select一些組合應該讓你提取你鏈接的網站的基於表的各子組件。 BeautifulSoup有很多挖掘HTML的方法，所以我強烈建議通過一些例子來研究一下，如果你試圖解析出一組特定的元素，就會看到文檔。

來源

2017-04-11 18:26:20

回答

相關問題