是的,你可以使用類似SpatialSpark。它僅適用於Spark 1.6.1,但您可以使用BroadcastSpatialJoin創建效率極高的RTree
。
下面是使用SpatialSpark與PySpark我的一個例子,以檢查是否不同多邊形在彼此或相交的:
from ast import literal_eval as make_tuple
print "Java Spark context version:", sc._jsc.version()
spatialspark = sc._jvm.spatialspark
rectangleA = Polygon([(0, 0), (0, 10), (10, 10), (10, 0)])
rectangleB = Polygon([(-4, -4), (-4, 4), (4, 4), (4, -4)])
rectangleC = Polygon([(7, 7), (7, 8), (8, 8), (8, 7)])
pointD = Point((-1, -1))
def geomABWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt)
])
def geomCWithId():
return sc.parallelize([
(0L, rectangleC.wkt)
])
def geomABCWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt),
(2L, rectangleC.wkt)])
def geomDWithId():
return sc.parallelize([
(0L, pointD.wkt)
])
dfAB = sqlContext.createDataFrame(geomABWithId(), ['id', 'wkt'])
dfABC = sqlContext.createDataFrame(geomABCWithId(), ['id', 'wkt'])
dfC = sqlContext.createDataFrame(geomCWithId(), ['id', 'wkt'])
dfD = sqlContext.createDataFrame(geomDWithId(), ['id', 'wkt'])
# Supported Operators: Within, WithinD, Contains, Intersects, Overlaps, NearestD
SpatialOperator = spatialspark.operator.SpatialOperator
BroadcastSpatialJoin = spatialspark.join.BroadcastSpatialJoin
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
joinRDD.count()
results = joinRDD.collect()
map(lambda result: make_tuple(result.toString()), results)
# [(0, 0), (1, 1), (2, 0)] read as:
# ID 0 is within 0
# ID 1 is within 1
# ID 2 is within 0
注線
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
的最後一個參數是一個緩衝器值,在你的情況下,這將是你想要使用的寬容。如果您使用緯度/經度,它可能會是一個非常小的數字,因爲它是一個徑向系統,並且取決於您想要的公差,您需要輸入calculate based on lat/lon for your area of interest。
使用k-d樹可以開始。 –
我工作過類似的用例。我們使用'GeoHashes'的屬性來定義單元格,並定義每個單元格的過程。這仍然是一個廣泛的問題。也許你可以展示你目前的方法的代碼來推動討論? – maasg
@LostInOverflow - 當然,通過它。 – OrangeRind