2016-05-23 256 views
0

我試圖做一些快速和骯髒的反向地理編碼。根據cKDTree索引從熊貓數據框中選擇行

我有數據幀poi(約50,000行),其中每個興趣點都有一個緯度/經度座標。

我也有數據幀postcode_existing(大約180,000行),它將緯度/經度座標映射到郵政編碼。

我拿出相關的座標列,並使用cKDTree爲poi中的每個感興趣點確定postcode_existing中最近的緯度/經度座標。

import pandas as pd 
import numpy as np 
from scipy.spatial import cKDTree 

# read poi and postcode csv files 

# Extract subset 
postcode_existing_coordinates = postcode_existing[['Latitude', 'Longitude']] 

# Extract subset 
poi_coordinates = poi[['Latitude', 'Longitude']] 

# Construct tree 
tree = cKDTree(postcode_existing_coordinates) 

# Query 
distances, indices = tree.query(poi_coordinates) 

我結束了相關指數。我現在正在尋找使用這些索引從數據框postcode_existing中選擇行。

我試過postcode_existing.ix[indices],但這似乎沒有得到正確的行。

例如:

>>> postcode_existing.ix[indices].head() 
     Postcode Latitude Longitude Easting Northing GridRef \ 
78579 HA3 0NS 51.57553 -0.304296 517605.0 187658.0 TQ176876 
178499  NaN  NaN  NaN  NaN  NaN  NaN 
62392  NaN  NaN  NaN  NaN  NaN  NaN 
78662 HA3 0TA 51.58409 -0.288764 518659.0 188635.0 TQ186886 
79470  NaN  NaN  NaN  NaN  NaN  NaN 

       County District Ward DistrictCode ... Terminated \ 
78579 Greater London Brent Kenton E09000005 ...   NaN 
178499    NaN  NaN  NaN   NaN ...   NaN 
62392    NaN  NaN  NaN   NaN ...   NaN 
78662 Greater London Brent Kenton E09000005 ...   NaN 
79470    NaN  NaN  NaN   NaN ...   NaN 

     Parish NationalPark Population Households Built up area \ 
78579  NaN   NaN  72.0  25.0 Greater London 
178499 NaN   NaN  NaN  NaN    NaN 
62392  NaN   NaN  NaN  NaN    NaN 
78662  NaN   NaN  152.0  39.0 Greater London 
79470  NaN   NaN  NaN  NaN    NaN 

     Built up sub-division Lower layer super output area \ 
78579     Brent      Brent 004D 
178499     NaN       NaN 
62392     NaN       NaN 
78662     Brent      Brent 003E 
79470     NaN       NaN 

        Rural/urban Region 
78579 Urban major conurbation London 
178499      NaN  NaN 
62392      NaN  NaN 
78662 Urban major conurbation London 
79470      NaN  NaN 

[5 rows x 25 columns] 

但是:

>>> postcode_existing.iloc[78579] 
Postcode             NW1 3AU 
Latitude             51.5237 
Longitude            -0.143188 
Easting             528915 
Northing             182163 
GridRef             TQ289821 
County            Greater London 
District            Westminster 
Ward          Marylebone High Street 
DistrictCode           E09000033 
WardCode            E05000641 
Country             England 
CountyCode            E11000009 
Constituency      Cities of London and Westminster 
Introduced            1980-01-01 
Terminated             NaN 
Parish              NaN 
NationalPark             NaN 
Population              7 
Households              1 
Built up area          Greater London 
Built up sub-division       City of Westminster 
Lower layer super output area     Westminster 013A 
Rural/urban        Urban major conurbation 
Region              London 
Name: 133733, dtype: object 

另外:

>>> postcode_existing.iloc[178499] 
Postcode          WC1E 6JL 
Latitude           51.5236 
Longitude          -0.135522 
Easting           529447 
Northing           182168 
GridRef           TQ294821 
County         Greater London 
District           Camden 
Ward           Bloomsbury 
DistrictCode         E09000007 
WardCode          E05000129 
Country           England 
CountyCode          E11000009 
Constituency      Holborn and St Pancras 
Introduced         1980-01-01 
Terminated           NaN 
Parish            NaN 
NationalPark           NaN 
Population            1 
Households            1 
Built up area        Greater London 
Built up sub-division        Camden 
Lower layer super output area    Camden 026D 
Rural/urban      Urban major conurbation 
Region           London 
Name: 307029, dtype: object 

這似乎是正確的。

爲什麼postcode_existing.ix[indices]沒有選擇正確的行?我應該用什麼來代替?

回答

0

我解決了這個問題。由於刪除了某些行,導致數據框中的位置與索引之間的位置不匹配。

爲了解決這個問題,我只是重置索引:

postcode_existing.reset_index(inplace=True, drop=True) 

當時我能夠用loc提取相關行:

postcode_existing.loc[indices] 
0

問題是您在索引中使用整數。當大熊貓企圖跟蹤基於列表的位置以及標籤時,這會讓事情變得糟糕。 ix試圖找出答案。它將indices解釋爲列表位置。在這種情況下,使用loc

Documentation

DataFrame.ix 甲主要標籤的基於位置的分度,以整數位置回退。

.ix []支持基於混合整數和標籤的訪問。它主要是基於標籤的,但是會回退到整數位置訪問,除非相應的軸是整數類型。

.ix是最通用的索引器,它將支持.loc和.iloc中的任何輸入。 .ix也支持浮點標籤方案。 .ix在處理基於混合位置和標籤的層次索引時特別有用。

但是,如果軸是基於整數的,則僅支持基於標籤的訪問而不支持位置訪問。因此,在這種情況下,通常更好的是明確並使用.iloc.loc

+0

我得到'loc'同樣的問題: '>>> postcode_existing.loc [指數]。頭() 郵編緯度經度東座標北向GridRef \ 78579 HA3爲0ns 51.57553 -0.304296 517605.0 187658.0 TQ176876' –

+0

'loc'是基於標籤的。它拉動了索引爲「78579」的那一行。 'iloc'是基於位置的,並且將拉動位置爲'78579'的行。沒有您的數據樣本,我無法驗證或驗證任何內容。我假設'tree.query(poi_coordinates)'返回的'indices'對象是對標籤的引用。因此,你應該使用'loc'。如果你說這是錯誤的,我不知道,因爲我沒有你的數據。 – piRSquared