您可以使用PCollectionTuples從單個轉換中返回多個PCollections。例如,
TupleTag<String> mainOutput = new TupleTag<>("main");
TupleTag<String> missingData = new TupleTag<>("missing");
TupleTag<String> badValues = new TupleTag<>("bad");
Pipeline p = Pipeline.create(options);
PCollectionTuple all = p
.apply(TextIO.Read.named("ReadMyFile").from(options.getInput()))
.apply(new SomeTransformation());
all.get(mainOutput)
.apply(TextIO.Write.named("WriteMyFile").to(options.getOutput()));
all.get(missingData)
.apply(TextIO.Write.named("WriteMissingData").to(...));
...
PCollectionTuples可以直接建立從現有PCollections的,或者具有側輸出從帕爾操作發射的,例如
PCollectionTuple partitioned = input.apply(ParDo
.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
if (checkOK(c.element()) {
// Shows up in partitioned.get(mainOutput).
c.output(...);
} else if (hasMissingData(c.element())) {
// Shows up in partitioned.get(missingData).
c.sideOutput(missingData, c.element());
} else {
// Shows up in partitioned.get(badValues).
c.sideOutput(badValues, c.element());
}
}
})
.withOutputTags(mainOutput, TupleTagList.of(missingData).and(badValues)));
注意,在一般的各種側輸出不需要具有相同的類型,並且數據可以被髮射的任何次數,以任何數量的側輸出的(而不是嚴格分區,我們在這裏)。然後
你SomeTransformation類可能看起來像
class SomeTransformation extends PTransform<PCollection<String>,
PCollectionTuple> {
public PCollectionTuple apply(PCollection<String> input) {
// Filter into good and bad data.
PCollectionTuple partitioned = ...
// Process the good data.
PCollection<String> processed =
partitioned.get(mainOutput)
.apply(...)
.apply(...)
...;
// Repackage everything into a new output tuple.
return PCollectionTuple.of(mainOutput, processed)
.and(missingData, partitioned.get(missingData))
.and(badValues, partitioned.get(badValues));
}
}