2015-11-16 139 views
1

我試圖通過一個csv文件,我轉換成一個熊貓數據框循環。通過熊貓數據幀循環並創建新的列值

我需要遍歷每一行並檢查我擁有的經度和緯度數據(2個單獨的列),並將一個代碼(0,1或2)添加到同一行,具體取決於lat數據是否落入在一定範圍內。

我對Python有點新,並且會喜歡你可能會有的任何幫助。

它在我身上扔掉了很多錯誤。

book = 'yellow_tripdata_2014-04.csv' 
write_book = 'yellow_04.csv' 
yank_max_long = -73.921630300 
yank_min_long = -73.931169700 
yank_max_lat = 40.832823000 
yank_min_lat = 40.825582000 
mets_max_long = 40.760523000 
mets_min_long = 40.753277000 
mets_max_lat = -73.841035400 
mets_min_lat = -73.850564600 

df = pd.read_csv(book) 


##To check for Yankee Stadium Lat's and Long's, if within gps units then Stadium_Code = 1 , if mets then Stadium_Code=2 

df['Stadium_Code'] = 0 

for i, row in df.iterrows(): 
    if yank_min_lat <= float(row['dropoff_latitude']) <= yank_max_lat and yank_min_long <=float(row('dropoff_longitude')) <=yank_max_long: 
     row['Stadium_Code'] == 1 
    elif mets_min_lat <= float(row['dropoff_latitude']) <= mets_max_lat and mets_min_long <=float(row('dropoff_longitude')) <=mets_max_long: 
     row['Stadium_Code'] == 2 

我嘗試使用的.loc命令,但是遇到了這個錯誤信息:

--------------------------------------------------------------------------- 
KeyError         Traceback (most recent call last) 
<ipython-input-33-9a9166772646> in <module>() 
----> 1 yank_mask = (df['dropoff_latitude'] > yank_min_lat) & (df['dropoff_latitude'] <= yank_max_lat) & (df['dropoff_longitude'] > yank_min_long) & (df['dropoff_longitude'] <= yank_max_long) 
     2 
     3 mets_mask = (df['dropoff_latitude'] > mets_min_lat) & (df['dropoff_latitude'] <= mets_max_lat) & (df['dropoff_longitude'] > mets_min_long) & (df['dropoff_longitude'] <= mets_max_long) 
     4 
     5 df.loc[yank_mask, 'Stadium_Code'] = 1 

/Users/benjaminprice/anaconda/lib/python3.4/site-packages/pandas/core/frame.py in __getitem__(self, key) 
    1795    return self._getitem_multilevel(key) 
    1796   else: 
-> 1797    return self._getitem_column(key) 
    1798 
    1799  def _getitem_column(self, key): 

/Users/benjaminprice/anaconda/lib/python3.4/site-packages/pandas/core/frame.py in _getitem_column(self, key) 
    1802   # get column 
    1803   if self.columns.is_unique: 
-> 1804    return self._get_item_cache(key) 
    1805 
    1806   # duplicate columns & possible reduce dimensionaility 

/Users/benjaminprice/anaconda/lib/python3.4/site-packages/pandas/core/generic.py in _get_item_cache(self, item) 
    1082   res = cache.get(item) 
    1083   if res is None: 
-> 1084    values = self._data.get(item) 
    1085    res = self._box_item_values(item, values) 
    1086    cache[item] = res 

/Users/benjaminprice/anaconda/lib/python3.4/site-packages/pandas/core/internals.py in get(self, item, fastpath) 
    2849 
    2850    if not isnull(item): 
-> 2851     loc = self.items.get_loc(item) 
    2852    else: 
    2853     indexer = np.arange(len(self.items))[isnull(self.items)] 

/Users/benjaminprice/anaconda/lib/python3.4/site-packages/pandas/core/index.py in get_loc(self, key, method) 
    1570   """ 
    1571   if method is None: 
-> 1572    return self._engine.get_loc(_values_from_object(key)) 
    1573 
    1574   indexer = self.get_indexer([key], method=method) 

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)() 

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)() 

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)() 

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)() 

KeyError: 'dropoff_latitude' 

我通常不搞清楚什麼這些錯誤代碼的意思是太糟糕了,但是這一次把我摔下。

+0

一般來說,當你報告,你得到錯誤發佈錯誤跟蹤和它們出現的行是很有用的。 – EdChum

+0

你的錯誤意味着你已經名不副實的列,可以從'df.columns.tolist()' – EdChum

+0

[ 'VENDOR_ID', 'pickup_datetime', 'dropoff_datetime', 'passenger_count' 後輸出, 'trip_distance ' 'pickup_longitude', 'pickup_latitude', 'rate_code', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', '收費', ' mta_tax ', 'tip_amount', 'tolls_amount', 'total_amount', 'Stadium_Code'] –

回答

1

首先,如果有矢量化的解決方案可以一次運行在整個df上,那麼迭代row-wise會非常浪費。

我會爲你的2個條件創建一個布爾值掩碼,並將它們傳遞給.loc來掩蓋符合條件的行並將它們設置爲值。

這裏掩碼使用按位運算符&and由於運算符的優先級,條件和括號用在每個條件周圍。

所以下面應該工作:

yank_mask = (df['dropoff_latitude'] > yank_min_lat) & (df['dropoff_latitude'] <= yank_max_lat) & (df['dropoff_longitude'] > yank_min_long) & (df['dropoff_longitude'] <= yank_max_long) 

mets_mask = (df['dropoff_latitude'] > mets_min_lat) & (df['dropoff_latitude'] <= mets_max_lat) & (df['dropoff_longitude'] > mets_min_long) & (df['dropoff_longitude'] <= mets_max_long) 

df.loc[yank_mask, 'Stadium_Code'] = 1 
df.loc[mets_mask, 'Stadium_Code'] = 2 

如果不這樣做的話我看了docs這將幫助你理解上面的是如何工作的

+0

的左上角有一個空的勾號標記我以前嘗試過這種方法,但嘗試了一種我遇到上述錯誤後更熟悉的方式(最近編輯過後)。 –