對大數據集上的numpy polyfit的解釋

我正在分析一個公開可用的數據集：舊金山屬性的稅務評估（https://data.sfgov.org/Housing-and-Buildings/Historic-Secured-Property-Tax-Rolls/wv5m-vpq2）。它可以下載爲CSV file，它假定文件名'Historic_Secured_Property_Tax_Rolls.csv'。對大數據集上的numpy polyfit的解釋

使用此文件，我試圖找出土地價值的年增長率，不包括零值。數據集非常大，如果我試圖繪製它，我會得到錯誤，所以我首先試圖依靠我對polyfit如何工作的理解。

我用下面的代碼來推導出「土地價值」一欄的自然對數的線性擬合暗算「財年」列：

import pandas as pd 

# Read in data downloaded from https://data.sfgov.org/api/views/wv5m-vpq2/rows.csv?accessType=DOWNLOAD 
df = pd.read_csv('Historic_Secured_Property_Tax_Rolls.csv') 

df_nz = df[df['Closed Roll Assessed Land Value'] > 0] # Only consider non-zero Land Values 

p = np.polyfit(df_nz['Closed Roll Fiscal Year'], np.log(df_nz['Closed Roll Assessed Land Value']), 1)

我們得到以下值p ：

In [42]: p 
Out[42]: array([ 4.18802559e-02, -7.23804441e+01])

據我所知，線性擬合的斜率應當由p[1]表示。但是，這將代表每年-724％的不合理增長率。但是，如果它是p[0]，那麼每年這個數字會更加合理，爲4.2％。

我想知道如果我沒有以某種方式誤解結果，並且增長率是否由p[0]而不是p[1]代表？

來源

2016-07-16 Kurt Peek

數據孵化器多少？ ;） –

「有關挑戰性問題，請：1.自己回答問題，而不要求其他人協助。」 –

Returns 
------- 
p : ndarray, shape (M,) or (M, K) 
    Polynomial coefficients, highest power first. If `y` was 2-D, the 
    coefficients for `k`-th data set are in ``p[:,k]``.

這告訴我4.2%是對數項的係數。

我的第一反應是看均值，中位數的成長率等

columns = ['Closed Roll Fiscal Year', 'Closed Roll Assessed Land Value'] 
df_ = df[columns].copy() 
df_.columns = ['Year', 'Value'] 
df_ = df_[df_.iloc[:, 1] > 0] 
df_['log_value'] = np.log(df_.Value) 

df_desc = df_.groupby('Year').log_value.describe() 

desc_cols = ['mean', '25%', '50%', '75%'] 

df_desc.unstack()[desc_cols].plot()

只是一個想法。

來源

2016-07-16 21:42:30 piRSquared

感謝piRSquared，我必須第一次誤讀http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html上的文檔。 –

對大數據集上的numpy polyfit的解釋

回答

相關問題