熊貓 - 從DB描述表 - 大數據

我想使用熊貓的描述方法爲SQL表，但我無法將所有數據拉入內存 - 是否可以使用僅使用sql查詢獲取信息？熊貓 - 從DB描述表 - 大數據

感謝

2017-03-14 Lee

還有就是，據我所知，沒有任何方式方便，df.describe（），但也有SQL語句，可以讓你所有你想要的信息。

下面的im使用SQL Server中的存儲過程返回所有列及其數據類型。循環遍歷它們以獲取float-type的所有列名稱，然後從它們構建新的查詢。

然後將所有內容放入最終的數據框中。我只包括90百分位數，但我認爲你可以計算出如何增加更多。你可能想要添加更多的數據類型而不是浮動。

這個解決方案是醜陋和緩慢的，但它對我來說只是將所有數據拉到一個數據幀失敗的內存。

import pyodbc 
import pandas as pd  

def sql2df(sql, connection): 
    df = pd.read_sql(sql=sql, con=connection) 
    return df 

cnx = pyodbc.connect(r'DRIVER={SQL Server};SERVER=.\SQLEXPRESS;DATABASE=TEST;Trusted_Connection=yes;') 

df_columns = sql2df('exec sp_columns test_table', cnx)[['COLUMN_NAME', 'TYPE_NAME']] 

numeric_columns = [] 
for index, row in df_columns.iterrows(): 
    if row[1] == 'float': #or int or any numeric 
     numeric_columns.append(row[0]) 

final_df = pd.DataFrame(index=(['stdev', 'count', '90%', 'mean'])) 

for col in numeric_columns: 
    standard_dev = sql2df('SELECT STDEV('+col+') FROM dbo.test_table', cnx)\ 
     .get_value(0,0, takeable=True) 

    cnt = sql2df('SELECT COUNT(' + col + ') FROM dbo.test_table', cnx)\ 
     .get_value(0,0, takeable=True) 

    # percentile is 100-N so top 10 means 90 percentile 
    ninety_percentile = sql2df('SELECT Min(subq.' + col + ') FROM(SELECT TOP 10 PERCENT ' + col + 
           ' FROM dbo.test_table ORDER BY ' + col + ' DESC) AS subq', cnx)\ 
     .get_value(0,0, takeable=True) 

    mean = sql2df('SELECT AVG(' + col + ') FROM dbo.test_table', cnx)\ 
     .get_value(0,0, takeable=True) 

    final_df[str(col)] = [standard_dev, cnt, ninety_percentile, mean] 

print final_df 
cnx.close()

來源

2017-03-14 10:28:14

熊貓 - 從DB描述表 - 大數據

回答

相關問題