2013-05-22 36 views
0

我是Apache Pig的新用戶,我有一個問題需要解決。Apache Pig - 如何獲得多個袋子之間匹配元素的數量?

我想用阿帕奇豬做一個小搜索引擎。這個想法很簡單:我有一個文件,它是多個文件的連接(每行一個文件)。下面是三個文件的例子:

1,word1 word4 word2 word1 
2,word2 word6 word1 word5 word3 
3,word1 word3 word4 word5 

然後,我創建的每個文檔的單詞一袋,使用這些代碼行:

docs = LOAD '$documents' USING PigStorage(',') AS (id:int, line:chararray); 
B = FOREACH docs GENERATE line; 
C = FOREACH B GENERATE TOKENIZE(line) as gu; 

然後,我刪除重複的條目上袋:

filtered = FOREACH C { 
    uniq = DISTINCT gu; 
    GENERATE uniq; 
} 

下面是此代碼的結果:

DUMP filtered; 

({(word1), (word4), (word2)}) 
({(word2), (word6), (word1), (word5), (word3)}) 
({(word1), (word3), (word4), (word5)}) 

所以我有一個文件袋,就像我想要的。

現在,讓我們考慮用戶查詢的文件:

word2 word7 word5 

我變換查詢詞的袋子:

query = LOAD '$query' AS (line_query:chararray); 
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS quer; 

DUMP bag_query; 

下面是結果:

({(word2), (word7), (word5)}) 

現在,這是我的問題:我想獲得查詢和每個文檔之間的匹配數。有了這個例子,我想有這樣的輸出:

1 
2 
1 

我試圖讓袋之間的聯接,但它沒有奏效。

請問您能幫我嗎?

謝謝。

回答

1

如果您確定不使用任何UDF,則可以通過旋轉袋子並使用所有SQL樣式來完成。

docs = LOAD '/input/search.dat' USING PigStorage(',') AS (id:int, line:chararray); 
C = FOREACH docs GENERATE id, TOKENIZE(line) as gu; 
pivoted = FOREACH C { 
    uniq = DISTINCT gu; 
     GENERATE id, FLATTEN(uniq) as word; 
}; 
filtered = FILTER pivoted BY word MATCHES '(word2|word7|word5)'; 
--dump filtered; 
count_id_matched = FOREACH (GROUP filtered BY id) GENERATE group as id, COUNT(filtered) as count; 

dump count_id_matched; 

count_word_matched_in_docs = FOREACH (GROUP filtered BY word) GENERATE group as word, COUNT(filtered) as count; 

dump count_word_matched_in_docs; 
+0

我試過你的解決方案,它工作完美。謝謝 ! :) –

1

嘗試使用SetIntersect(Datafu UDF - https://github.com/linkedin/datafu)和SIZE來獲取結果包中元素的數量。

+0

感謝您的回覆,但它不起作用。事實上,我的行李處於單獨的變量中,似乎SetIntersect要求行李處於相同的變量中。 –

0

正如SNeumann指出的那樣,您可以使用DataFu的SetIntersect作爲例子。

建立了你的例子,因爲這些文件:

1,word1 word4 word2 word1 
2,word2 word6 word1 word5 word3 word7 
3,word1 word3 word4 word5 

而鑑於此查詢:

word2 word7 word5 

那麼這段代碼給你你想要的東西:

define SetIntersect datafu.pig.sets.SetIntersect(); 

docs = LOAD 'docs' USING PigStorage(',') AS (id:int, line:chararray); 
B = FOREACH docs GENERATE id, line; 
C = FOREACH B GENERATE id, TOKENIZE(line) as gu; 

filtered = FOREACH C { 
    uniq = DISTINCT gu; 
    GENERATE id, uniq; 
} 

query = LOAD 'query' AS (line_query:chararray); 
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS query; 
-- sort the bag of tokens, since SetIntersect requires it 
bag_query = FOREACH bag_query { 
    query_sorted = ORDER query BY token; 
    GENERATE query_sorted; 
} 

result = FOREACH filtered { 
    -- sort the tokens, since SetIntersect requires it 
    tokens_sorted = ORDER uniq BY token; 
    GENERATE id, 
      SIZE(SetIntersect(tokens_sorted,bag_query.query_sorted)) as cnt; 
} 

DUMP result; 

值結果:

(1,1) 
(2,3) 
(3,1) 

這裏是一個完全工作的例子,你可以粘貼到位於here的DataFu單元測試SetIntersect:

/** 
register $JAR_PATH 

define SetIntersect datafu.pig.sets.SetIntersect(); 

docs = LOAD 'docs' USING PigStorage(',') AS (id:int, line:chararray); 
B = FOREACH docs GENERATE id, line; 
C = FOREACH B GENERATE id, TOKENIZE(line) as gu; 

filtered = FOREACH C { 
    uniq = DISTINCT gu; 
    GENERATE id, uniq; 
} 

query = LOAD 'query' AS (line_query:chararray); 
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS query; 
-- sort the bag of tokens, since SetIntersect requires it 
bag_query = FOREACH bag_query { 
    query_sorted = ORDER query BY token; 
    GENERATE query_sorted; 
} 

result = FOREACH filtered { 
    -- sort the tokens, since SetIntersect requires it 
    tokens_sorted = ORDER uniq BY token; 
    GENERATE id, 
      SIZE(SetIntersect(tokens_sorted,bag_query.query_sorted)) as cnt; 
} 

DUMP result; 

*/ 
@Multiline 
private String setIntersectTestExample; 

@Test 
public void setIntersectTestExample() throws Exception 
{  
    PigTest test = createPigTestFromString(setIntersectTestExample);  

    writeLinesToFile("docs", 
        "1,word1 word4 word2 word1", 
        "2,word2 word6 word1 word5 word3 word7", 
        "3,word1 word3 word4 word5"); 

    writeLinesToFile("query", 
        "word2 word7 word5"); 

    test.runScript(); 

    super.getLinesForAlias(test, "filtered"); 
    super.getLinesForAlias(test, "query"); 
    super.getLinesForAlias(test, "result"); 
} 

如果您有任何其他類似用途的情況下,我很想聽聽他們:)我們總是希望爲DataFu貢獻更多有用的UDF。

相關問題