如何在使用MapReduce API映射到雲存儲之前過濾數據存儲區數據？

關於代碼實驗here，我們如何過濾mapreduce作業中的數據存儲數據，而不是獲取某個實體類型的所有對象？如何在使用MapReduce API映射到雲存儲之前過濾數據存儲區數據？

在下面的映射器管道定義中，唯一的一個輸入讀取器參數是要處理的實體類型，並且我無法在InputReader類中看到類型爲filter的其他參數可以提供幫助。

output = yield mapreduce_pipeline.MapperPipeline(
    "Datastore Mapper %s" % entity_type, 
    "main.datastore_map", 
    "mapreduce.input_readers.DatastoreInputReader", 
    output_writer_spec="mapreduce.output_writers.FileOutputWriter", 
    params={ 
     "input_reader":{ 
      "entity_kind": entity_type, 
      }, 
     "output_writer":{ 
      "filesystem": "gs", 
      "gs_bucket_name": GS_BUCKET, 
      "output_sharding":"none", 
      } 
     }, 
     shards=100)

由於谷歌的BigQuery與unormalized數據模型，對於更好，這將是很好，能夠從多個數據存儲實體種類建立一個表（合併），但我看不出這樣做呢？根據您的應用

來源

2012-08-07 Charles

，你也許能夠通過傳遞濾波器參數這是解決這個「過濾器的可選列表應用到查詢每個過濾器是一個元組：（<property_name_as_str>, <query_operation_as_str>, <value>」

所以在您輸入讀卡器參數：

"input_reader":{ 
      "entity_kind": entity_type, 
      "filters": [("datastore_property", "=", 12345), 
         ("another_datastore_property", ">", 200)] 
}

來源

2012-08-07 18:24:51

謝謝邁克爾 1 /你需要做的'因爲根據SVN日誌，這個功能已經在8月1日推出了svn的update'上map_reduce – Charles 2012-08-08 09:56:14

2 /它。似乎現在已被竊聽，因爲您的元組列表將被轉換爲列表引發BadRead erParamsError異常**過濾器應該是一個元組**（即'[（「datastore_property」，「=」，12345），（「another_datastore_property」，「>」，200）]''''''datastore_property「，」=「「，12345]，[」another_datastore_property「，」>「，200]]'）。仍然在挖掘，找到原因。 – Charles 2012-08-08 10:03:45

似乎[此問題]（http://code.google.com/p/appengine-mapreduce/issues/detail?id=138&q=filter&colspec=ID%20Type%20Status%20Priority%20Component%20Owner%20Summary）已經已被實際報道 – Charles 2012-08-08 14:16:11

如何在使用MapReduce API映射到雲存儲之前過濾數據存儲區數據？

回答

相關問題