處理網頁中的空單元格

我試圖從籃球參考表（http://www.basketball-reference.com/leagues/NBA_2015_per_poss.html）中獲取所有數據。當我使用XPath獲取數據時，它以一個長列表的形式出現。我有一個「塊」方法，將列表分成多個列表，但是，由於表格中有空單元格，所以方法會錯誤地將列表分開。有什麼辦法可以解決這個問題嗎？處理網頁中的空單元格

來源

2015-10-30 sam oconnell

我的建議：使用pandas.DataFrame。它可以從許多來源加載數據，包括HTML。

您可以使用fillna方法輕鬆處理空單元格。

考慮這個例子：

import pandas as pd 

# read_excel returns list of dataframes. 
# In this case we know there is only one in the page 
df = pd.read_html('http://www.basketball-reference.com/leagues/NBA_2015_per_poss.html', 
        attrs={'id': 'per_poss'})[0] 

# the headers repeat every 20 lines, filtering them out 
df = df[df['Rk'] != 'Rk'] 

# inserting 0 to empty cells 
# could also use inplace=True kwarg instead of reassigning, or pass a 
# dictionary to use different value for each column 
df = df.fillna(0)

來源

2015-10-30 19:53:11 DeepSpace

好方法的確！ – SIslam

該表不與「空」單元格一起進入，單元格不出現。例如網站上的第四行有0,0，然後是3P，3PA，3P％的空白。這會在表格中顯示爲0,0,4.5（3P％後的下一個值）。並且我得到錯誤「找不到html5lib，請安裝它」，即使我已安裝html5lib，但在運行代碼時 –

處理網頁中的空單元格

回答

相關問題