2013-11-20 25 views
0

我遵循Cascading的指南在其網站上。我有以下TSV格式輸入:級聯TextDelimited日誌文件

doc_id text 
doc01 A rain shadow is a dry area on the lee back side of a mountainous area. 
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover. 
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain. 
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley. 
doc05 Two Women. Secrets. A Broken Land. [DVD Australia] 

我使用下面的代碼來處理它:

Tap docTap = new Hfs(new TextDelimited(true, "\t"), inPath); 
... 
Fields token = new Fields("token"); 
Fields text = new Fields("text"); 
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]"); 
// only returns "token" 
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS); 

它看起來像只分割每條線的所述第二部分(忽略DOC_ID一部分)。 Cascading如何忽略第一個doc_id部分並僅處理第二部分?是因爲TextDelimited?

回答

0

如果你看到管道聲明

Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS); 

第二個參數是要發送到分離器功能的唯一領域。在這裏你正在發送'文字'字段。所以只有文本被髮送到分離器並返回令牌。

下面說明各個方法的清楚。

Each

@ConstructorProperties(value={"name","argumentSelector","function","outputSelector"}) 
public Each(String name, 
            Fields argumentSelector, 
            Function function, 
            Fields outputSelector) 

Only pass argumentFields to the given function, only return fields selected by the outputSelector. 

Parameters: 
    name - name for this branch of Pipes 
    argumentSelector - field selector that selects Function arguments from the input Tuple 
    function - Function to be applied to each input Tuple 
    outputSelector - field selector that selects the output Tuple from the input and Function results Tuples 
0

答案是在這兩條線

1.點擊的創建方法,程序被告知,第一行包含標題( 「真」)。

Tap docTap = new Hfs(new TextDelimited(true, "\t"), docPath);  

2.第二,在此行中列名的「文本」提供。如果仔細查看輸入文件,「文本」就是您試圖根據自己的字數據而設置的數據的列名。

Fields text = new Fields("text");