我遵循Cascading的指南在其網站上。我有以下TSV格式輸入:級聯TextDelimited日誌文件
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
我使用下面的代碼來處理它:
Tap docTap = new Hfs(new TextDelimited(true, "\t"), inPath);
...
Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
// only returns "token"
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
它看起來像只分割每條線的所述第二部分(忽略DOC_ID一部分)。 Cascading如何忽略第一個doc_id部分並僅處理第二部分?是因爲TextDelimited?