對錶進行分區

Bigquery目前只允許按日期進行分區。對錶進行分區

Lets supose我有一個與inserted_timestamp字段的10億錶行。讓我們說這個領域有1年前的日期。

將現有數據移動到新的分區表的正確方法是什麼？

編輯

我看到有Java的一個優雅的解決方案與版本< 2.0 Sharding BigQuery output tables還闡述了在BigQuery partitioning with Beam streams那就是參數化窗口數據表名（或分區後綴）。

但是我錯過了012x在2.x光束項目也沒有關於從python序列化函數獲取窗口時間的示例。

我試圖在管道上進行分區，但是如果大量分區失敗（運行100，但失敗1000）。

這是我的代碼，就我：

   ( p 
       | 'lectura' >> beam.io.ReadFromText(input_table) 
       | 'noheaders' >> beam.Filter(lambda s: s[0].isdigit()) 
       | 'addtimestamp' >> beam.ParDo(AddTimestampDoFn()) 
       | 'window' >> beam.WindowInto(beam.window.FixedWindows(60)) 
       | 'table2row' >> beam.Map(to_table_row) 
       | 'write2table' >> beam.io.Write(beam.io.BigQuerySink(
         output_table, #<-- unable to parametrize by window 
         dataset=my_dataset, 
         project=project, 
         schema='dia:DATE, classe:STRING, cp:STRING, import:FLOAT', 
         create_disposition=CREATE_IF_NEEDED, 
         write_disposition=WRITE_TRUNCATE, 
            ) 
           ) 
       ) 

p.run()

來源

2017-10-13 danihp

https://stackoverflow.com/questions/38993877/migrating-from-non-partitioned-to-partitioned-tables應該是相關幾個方法。此外，我認爲你應該能夠使用JSON或AVRO而不是CSV來避免使用平面文件。 –

@NhanNguyen，剛剛編輯我的問題更具體。在<2.0存在一個優雅的解決方案，我錯過了> 2.x。感謝你的鏈接，我跟着它，是非常相關的問題。再次感謝。 – danihp

所有必要做這個存在於梁的功能，儘管它目前可能僅限於Java SDK中。您可以使用BigQueryIO。具體而言，您可以使用DynamicDestinations來確定每行的目標表。

從DynamicDestinations的例子：

events.apply(BigQueryIO.<UserEvent>write() 
    .to(new DynamicDestinations<UserEvent, String>() { 
     public String getDestination(ValueInSingleWindow<String> element) { 
      return element.getValue().getUserId(); 
     } 
     public TableDestination getTable(String user) { 
      return new TableDestination(tableForUser(user), 
      "Table for user " + user); 
     } 
     public TableSchema getSchema(String user) { 
      return tableSchemaForUser(user); 
     } 
     }) 
    .withFormatFunction(new SerializableFunction<UserEvent, TableRow>() { 
    public TableRow apply(UserEvent event) { 
     return convertUserEventToTableRow(event); 
    } 
    }));

來源

2017-10-16 20:40:13

爲什麼他們不是一個python包裝來做到這一點？我應該用Java代替python來支付數據流項目嗎？你知道Google是否在提供Java資源嗎？我的意思是，如果我使用Python工作，我會錯過比這個更多的功能？謝謝！ – danihp

正如這演示的那樣，Java和Python SDK之間有不同的功能。解決這些差距是Apache Beam正在進行的努力的一部分。這個特定問題被追蹤爲[BEAM-2801]（https://issues.apache.org/jira/browse/BEAM-2801）。 –

對錶進行分區

回答

相關問題