2014-12-20 135 views
2

我有一個包含經/緯度列表的數據框座標:矢量化在大熊貓的功能

d = {'Provider ID': {0: '10001', 
    1: '10005', 
    2: '10006', 
    3: '10007', 
    4: '10008', 
    5: '10011', 
    6: '10012', 
    7: '10016', 
    8: '10018', 
    9: '10019'}, 
'latitude': {0: '31.215379379000467', 
    1: '34.22133455500045', 
    2: '34.795039606000444', 
    3: '31.292159523000464', 
    4: '31.69311635000048', 
    5: '33.595265517000485', 
    6: '34.44060759100046', 
    7: '33.254429322000476', 
    8: '33.50314015000049', 
    9: '34.74643089500046'}, 
'longitude': {0: ' -85.36146587999968', 
    1: ' -86.15937514799964', 
    2: ' -87.68507485299966', 
    3: ' -86.25539902199966', 
    4: ' -86.26549483099967', 
    5: ' -86.66531866799966', 
    6: ' -85.75726760699968', 
    7: ' -86.81407933399964', 
    8: ' -86.80242858299965', 
    9: ' -87.69893502799965'}} 
df = pd.DataFrame(d) 

我的目標是使用半正矢函數計算出每個項目之間的距離在KM:

from math import radians, cos, sin, asin, sqrt 
def haversine(lon1, lat1, lon2, lat2): 
    """ 
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees) 
    """ 

    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) 

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 
    c = 2 * asin(sqrt(a)) 

    # 6367 km is the radius of the Earth 
    km = 6367 * c 
    return km 

我的目標是獲得一個數據幀,看起來像下面其中的值是每個供應商ID之間的距離result_df:

result_df = pd.DataFrame(columns = df['Provider ID'], index=df['Provider ID']) 

我可以在循環中做到這一點,但速度非常慢。我在這個轉換爲量化的方法尋找一些幫助:

for first_hospital_coordinates in result_df.columns: 
    for second_hospital_coordinates in result_df['Provider ID']: 
     if first_hospital_coordinates == 'Provider ID': 
      pass 
     else: 
      L1 = df[df['Provider ID'] == first_hospital_coordinates]['latitude'].astype('float64').values 
      O1 = df[df['Provider ID'] == first_hospital_coordinates]['longitude'].astype('float64').values 
      L2 = df[df['Provider ID'] == second_hospital_coordinates]['latitude'].astype('float64').values 
      O2 = df[df['Provider ID'] == second_hospital_coordinates]['longitude'].astype('float64').values 

      distance = haversine(O1, L1, O2, L2) 

      crit = result_df['Provider ID'] == second_hospital_coordinates 
      result_df.loc[crit, first_hospital_coordinates] = distance 
+0

我回答了類似的問題:HTTP://stackoverflow.com/questions/25767596/using-haversine-formula-with-data-stored-在熊貓數據框/ 25767765#25767765 – EdChum

+0

關閉,haversine方面是相同的。這是以有效的方式創建10x10矩陣是主要的問題,儘管 – DataSwede

回答

4

向量化的代碼,你將需要在完整的數據幀,而不是對個人拉特和多頭操作。我在這方面做了一個嘗試。我需要的結果DF和新的功能H2,

import numpy as np 
def h2(df, p): 
    inrad = df.applymap(radians) 
    dlon = inrad.longitude-inrad.longitude[p] 
    dlat = inrad.latitude-inrad.latitude[p] 
    lat1 = pd.Series(index = df.index, data = [df.latitude[p] for i in range(len(df.index))]) 
    a = np.sin(dlat/2)*np.sin(dlat/2) + np.cos(df.latitude) * np.cos(lat1) * np.sin(dlon/2)**2 
    c = 2 * 1/np.sin(np.sqrt(a)) 
    km = 6367 * c 
    return km 

df = df.set_index('Provider ID') 
df = df.astype(float) 
df2 = pd.DataFrame(index = df.index, columns = df.index) 
for c in df2.columns: 
    df2[c] = h2(df, c) 

print (df2) 

這應該屈服,(我不能確定自己是否擁有正確的答案...我的目標是向量化的代碼)

Provider ID   10001   10005   10006   10007 \ 
Provider ID               
10001     inf 5.021936e+05 5.270062e+05 1.649088e+06 
10005  5.021936e+05   inf 9.294868e+05 4.985233e+05 
10006  5.270062e+05 9.294868e+05   inf 4.548412e+05 
10007  1.649088e+06 4.985233e+05 4.548412e+05   inf 
10008  1.460299e+06 5.777248e+05 5.246954e+05 3.638231e+06 
10011  6.723581e+05 2.004199e+06 1.027439e+06 6.394402e+05 
10012  4.559090e+05 3.265536e+06 7.573411e+05 4.694125e+05 
10016  7.680036e+05 1.429573e+06 9.105474e+05 7.517467e+05 
10018  7.096548e+05 1.733554e+06 1.020976e+06 6.701920e+05 
10019  5.436342e+05 9.278739e+05 2.891822e+07 4.638858e+05 

Provider ID   10008   10011   10012   10016 \ 
Provider ID               
10001  1.460299e+06 6.723581e+05 4.559090e+05 7.680036e+05 
10005  5.777248e+05 2.004199e+06 3.265536e+06 1.429573e+06 
10006  5.246954e+05 1.027439e+06 7.573411e+05 9.105474e+05 
10007  3.638231e+06 6.394402e+05 4.694125e+05 7.517467e+05 
10008     inf 7.766998e+05 5.401081e+05 9.496953e+05 
10011  7.766998e+05   inf 1.341775e+06 4.220911e+06 
10012  5.401081e+05 1.341775e+06   inf 1.119063e+06 
10016  9.496953e+05 4.220911e+06 1.119063e+06   inf 
10018  8.236437e+05 1.242451e+07 1.226941e+06 5.866259e+06 
10019  5.372119e+05 1.051748e+06 7.514774e+05 9.362341e+05 

Provider ID   10018   10019 
Provider ID        
10001  7.096548e+05 5.436342e+05 
10005  1.733554e+06 9.278739e+05 
10006  1.020976e+06 2.891822e+07 
10007  6.701920e+05 4.638858e+05 
10008  8.236437e+05 5.372119e+05 
10011  1.242451e+07 1.051748e+06 
10012  1.226941e+06 7.514774e+05 
10016  5.866259e+06 9.362341e+05 
10018     inf 1.048895e+06 
10019  1.048895e+06   inf 

[10 rows x 10 columns] 
+0

我改變了一些東西:將inrad = df.applymap(弧度)移出函數並進入預處理意味着它不會在我的實際集合中執行4000次以上。然後,而不是列表理解,data = np.empty(len(df.index)); data.fill(df.latitude [柱]); lat1 = pd.Series(index = df.index,data = data)似乎有點快。謝謝你的幫助! – DataSwede

1

你不需要任何幻想,只是幾個器官功能障礙綜合徵的功能。

首先,不要使用math庫。如果你正在做真正的數學或科學,你可能會更喜歡numpy。

其次,我們將使用數據幀方法applyapply所做的是接受一個函數,並通過它運行每一行(axis = 1)或列(axis = 0),並用所有返回的值構建一個新的熊貓物件。所以我們需要設置haversine作爲一個數據幀的行,然後解壓縮這些值。它變成:

def haversine(row): 
    """ 
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees) 
    """ 
    import numpy as np 

    # convert all of the row to radians 
    row = np.radians(row) 

    # unpack the values for convenience 
    lat1 = row['lat1'] 
    lat2 = row['lat2'] 
    lon1 = row['lon1'] 
    lon2 = row['lon2'] 

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2 
    c = 2 * np.arcsin(np.sqrt(a)) 

    # 6367 km is the radius of the Earth 
    km = 6367 * c 
    return km 

好吧,現在我們需要讓你的數據框形狀。在你的問題中,每件事都是一個字符串,這不適合做數學。因此,使用您的變量d,我說:

df = pandas.DataFrame(d).set_index('Provider ID').astype(float) 

使創建的字符串的數據幀,設置提供商爲索引,然後轉換所有列到花車,因爲我們正在做數學。

現在我們需要製作兩組座標的行。爲此,我們將使用shift方法並將結果加入原始數據框。這樣做,一下子看起來是這樣的:

df = df.join(df.shift(), lsuffix='1', rsuffix='2') 
print(df.head()) 

        lat1  lon1  lat2  lon2 
Provider ID            
10001  31.215379 -85.361466  NaN  NaN 
10005  34.221335 -86.159375 31.215379 -85.361466 
10006  34.795040 -87.685075 34.221335 -86.159375 
10007  31.292160 -86.255399 34.795040 -87.685075 
10008  31.693116 -86.265495 31.292160 -86.255399 

rsuffixlsuffix是連接操作過程中什麼是「1」和「2」追加到列名。

「2」列從df.shift()開始,您會注意到它們等於上一行的「1」列。你也會看到第一行的「2」列是NaN,因爲沒有前面的到第一行。

所以現在我們可以apply半正矢函數:

distance = df.apply(haversine, axis=1) 
print(distance) 
Provider ID 
10001     NaN 
10005   342.261590 
10006   153.567591 
10007   411.393751 
10008   44.566642 
10011   214.661170 
10012   125.775583 
10016   163.973219 
10018   27.659157 
10019   160.901128 
dtype: float64 
+0

啊,你太親近了!距離需要是每個提供商ID到每個其他提供商ID,而不僅僅是它下面的行。輸出應該是一個10x10矩陣,其距離爲值。 – DataSwede

+0

啊,我明白了。 @DataSwede。很高興你把它整理出來。 –

0

你應該能夠在整個事情操作。我對熊貓不是很熟悉,所以我只需要使用底層numpy陣列。使用您的數據d

df = pd.DataFrame(d) 
df1 = df.astype(float) 
a = np.radians(df1.values[:,1:]) 
# a.shape is 10,2, it contains the Lat/Lon only 

# transpose and subtract 
# add a new axes so they can be broadcast 
diff = a[...,np.newaxis] - a.T 
# diff.shape is (10,2,10): dLat is diff[:,0,:], dLon is diff[:,1,:] 

b = np.square(np.sin(diff/2)) 
# b.shape is (10,2,10): sin^2(dLat/2) is b[:,0,:], sin^2(dLon/2) is b[:,1,:] 

# make this term: cos(Lat1) * cos(Lat2) 
cos_Lat = np.cos(a[:,0]) 
c = cos_Lat * cos_Lat[:, np.newaxis] # shape 10x10 

# sin^2(dLon/2) is b[:,1,:] 
b[:,1,:] = b[:,1,:] * c 
g = b.sum(axis = 1) 
h = 6367000 * 2 * np.arcsin((np.sqrt(g))) # meters 

返回一個pandas.DataFrame

df2 = pd.DataFrame(h, index = df['Provider ID'].values, columns = df['Provider ID'].values) 

我沒有嘗試任何性能測試。有很多中間數組的創建正在進行,並且可能會很昂貴 - 使用ufuncs的可選輸出參數可能會減輕這一點。

就地操作同樣的事情:

df = pd.DataFrame(d) 
df_A = df.astype(float) 
z = df_A.values[:,1:] 

# cos(Lat1) * cos(Lat2) 
w = np.cos(z[:,0]) 
w = w * w[:, np.newaxis] # w.shape is (10,10) 

# sin^2(dLat/2) and sin^2(dLon/2) 
np.radians(z, z) 
z = z[...,np.newaxis] - z.T 
np.divide(z, 2, z) 
np.sin(z, z) 
np.square(z,z) 
# z.shape is now (10,2,10): sin^2(dLat/2) is z[:,0,:], sin^2(dLon/2) is z[:,1,:] 

# cos(Lat1) * cos(Lat2) * sin^2(dLon/2) 
np.multiply(z[:,1,:], w, z[:,1,:]) 

# sin^2(dLat/2) + cos(Lat1) * cos(Lat2) * sin^2(dLon/2) 
z = z.sum(axis = 1) 

np.sqrt(z, z) 
np.arcsin(z,z) 
np.multiply(z, 6367000 * 2, z) #meters 

df_B = pd.DataFrame(z, index = df['Provider ID'].values, columns = df['Provider ID'].values)