解決apache波束管道導入錯誤[BoundedSource對象大於允許的限制]

我有一堆存儲在谷歌雲存儲上的文本文件（〜1M）。當我讀到這些文件到谷歌雲數據流的管道的處理，我總是得到以下錯誤：解決apache波束管道導入錯誤[BoundedSource對象大於允許的限制]

Total size of the BoundedSource objects returned by BoundedSource.split() operation is larger than the allowable limit

的故障排除頁說：

You might encounter this error if you're reading from a very large number of files via TextIO, AvroIO or some other file-based source. The particular limit depends on the details of your source (e.g. embedding schema in AvroIO.Read will allow fewer files), but it is on the order of tens of thousands of files in one pipeline.

這是否意味着我不得不把文件分割成小批量，而不是一次導入全部？

我正在使用dataflow python sdk開發管道。

來源

2017-08-29 Youxun Shen

我不確定爲什麼人們投票結束這個問題。人們在使用Apache Beam進行編程時經常會遇到一個非常合理的問題。 – jkff