我是管道世界和Google API DataFlow的新手。從Google Cloud BigQuery中讀取數據
我想用sqlQuery從BigQuery中讀取數據。當我讀取所有數據庫時,它工作正常。
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<TableRow> qData = p.apply(
BigQueryIO.Read
.named("Read")
.from("test:DataSetTest.data"));
但是,當我使用fromQuery我得到錯誤。
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<TableRow> qData = p.apply(
BigQueryIO.Read
.named("Read")
.fromQuery("SELECT * FROM DataSetTest.data"));
錯誤:
Exception in thread "main" java.lang.IllegalArgumentException: Validation of query "SELECT * FROM DataSetTest.data" failed. If the query depends on an earlier stage of the pipeline, This validation can be disabled using #withoutValidation.
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.dryRunQuery(BigQueryIO.java:449)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.validate(BigQueryIO.java:432)
at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:357)
at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:267)
at com.google.cloud.dataflow.sdk.values.PBegin.apply(PBegin.java:47)
at com.google.cloud.dataflow.sdk.Pipeline.apply(Pipeline.java:151)
at Test.java.packageid.StarterPipeline.main(StarterPipeline.java:72)
Caused by: java.lang.NullPointerException: Required parameter projectId must be specified.
at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229)
at com.google.api.client.util.Preconditions.checkNotNull(Preconditions.java:140)
at com.google.api.services.bigquery.Bigquery$Jobs$Query.(Bigquery.java:1751)
at com.google.api.services.bigquery.Bigquery$Jobs.query(Bigquery.java:1724)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.dryRunQuery(BigQueryIO.java:445)
... 6 more
這裏有什麼問題?
更新:
我通過「options.setProject」設置項目。
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
options.setProject("test");
PCollection<TableRow> qData = p.apply(
BigQueryIO.Read
.named("Read")
.fromQuery("SELECT * FROM DataSetTest.data"));
但是現在我收到了這條消息。表未找到。
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "message" : "Not found: Table test:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832", "reason" : "notFound" } ], "message" : "Not found: Table test:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832" }
我已經使用Google Cloud SDK指定了項目。 – Jan
不幸的是,Google Cloud SDK更改了其填充項目ID的位置。因此,有一個場景以及Cloud SDK和Dataflow SDK版本的組合,其中這些SDK可能不會自動填充。這應該在數據流SDK版本1.4.0及更高版本中得到解決,該版本將在幾天內發佈。同時,請指定'--project''PipelineOption'。 –
我是否需要一個存儲桶來訪問GC BigQuery中的數據? – Jan