2012-04-06 72 views
6

我在學習蜂巢。令人驚訝的是,我找不到如何編寫簡單的字數統計工作的例子。以下是否正確?蜂巢中的字數計算

比方說,我有一個輸入文件input.tsv

hello, world 
this is an example input file 

我在Python中創建一個分離器把每行進言:

import sys 

for line in sys.stdin: 
for word in line.split(): 
    print word 

然後,我已經在我的蜂巢腳本以下:

CREATE TABLE input (line STRING); 
LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input; 

-- temporary table to hold words... 
CREATE TABLE words (word STRING); 

add file splitter.py; 

INSERT OVERWRITE TABLE words 
    SELECT TRANSFORM(text) 
    USING 'python splitter.py' 
    AS word 
    FROM input; 

SELECT word, count(*) AS count FROM words GROUP BY word; 

我不確定我是否錯過了某些東西,或者它是否真的是複雜的。 (特別是,我是否需要臨時words表,我需要寫外部分流功能?)

回答

12

如果你想要一個簡單的看到以下內容:

SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word; 

我用的側視圖以啓用表值函數(explode),該函數接受來自split函數的列表併爲每個值輸出一個新行。在實踐中,我使用UDF來封裝IBM的ICU4J分詞器。我通常不使用轉換腳本,並使用UDF來處理所有事情。你不需要臨時單詞表。

+0

看着您的評論涉及爆炸和HiveQL橫向視圖,能否請您看看這個SO問題,我無法找到該解決方案,[HTTP:// stackoverflow.com/questions/11373543/explode-the-array-of-struct-in-hive](http://stackoverflow.com/questions/11373543/explode-the-array-of-struct-in-hive)。對不起,這樣聯繫你。 – ferhan 2012-07-07 22:44:00

+0

@Steve - 我已經將數據加載到表中,並且當我運行命令時,我得到'FAILED:語義分析錯誤:null'。運行該命令是否有任何先決條件? – 2012-09-03 01:35:28

2
CREATE TABLE docs (line STRING); 
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; 
CREATE TABLE word_counts AS 
SELECT word, count(1) AS count FROM 
(SELECT explode(split(line, '\s')) AS word FROM docs) w 
GROUP BY word 
ORDER BY word; 
1

您可能句子內置UDF在蜂房如下:

1)步驟1:用於指定的數據類型的數組的句子的單個列創建臨時表

create table temp as select sentence from docs lateral view explode(explode(sentences(lcase(line)))) ltable as sentence

2)第二步:從臨時表中選擇你的單詞再次爆炸列句

select words,count(words) CntWords from 
 
(
 
select explode(words) words from temp 
 
) i group by words order by CntWords desc