2014-04-01 32 views
0

我有兩個文件,一個關鍵字/字符串列表:豬:Python的UDF來搜索文本關鍵字/字符串列表

blue fox 
the 
lazy dog 
orange 
of 
file 

另外,與文本:

The blue fox jumped 
over the lazy dog 
this file has nothing important 
lines repeat 
this line does not match 

我想在第一個文件中獲取字符串列表,並從第二個文件中找到與第一個文件中的任何字符串匹配的行。所以我寫了一個Python UDF一個豬腳本:

register match.py using jython as match; 
A = LOAD 'words.txt' AS (word:chararray); 
B = LOAD 'text.txt' AS (line:chararray); 
C = GROUP A ALL; 
D = FOREACH B generate match.match(C.$1,line); 
dump D; 

#match.py 
@outputSchema("str:chararray") 
def match(wordlist,line): 
    linestr = str(line) 
    for word in wordlist: 
      wordstr = str(word) 
      if re.search(wordstr,linestr): 
        return line 

完錯誤:

"2014-04-01 06:22:34,775 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function" 

Detailed Error log: 

Backend error message 
--------------------- 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function 
     at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120) 
     at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337) 
     at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434) 
     at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340) 
     at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372) 
     at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297) 
     at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) 
     at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) 
     at o 

Pig Stack Trace 
--------------- 
ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function 

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function 
     at org.apache.pig.PigServer.openIterator(PigServer.java:828) 
     at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696) 
     at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320) 
     at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194) 
     at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) 
     at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) 
     at org.apache.pig.Main.run(Main.java:538) 
     at org.apache.pig.Main.main(Main.java:157) 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
     at java.lang.reflect.Method.invoke(Method.java:597) 
     at org.apache.hadoop.util.RunJar.main(RunJar.java:208) 
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function 
     at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120) 
     at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337) 
     at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434) 
     at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340) 
     at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372) 
     at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297) 
     at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) 
     at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) 
================================================================================ 
+0

對於在尋找[錯誤1066:無法打開別名的迭代器]時發現此帖的人(http://stackoverflow.com/questions/34495085/error-1066-unable-to-open-iterator-for-alias-in-豬通用解決方案)這裏是[通用解決方案](http://stackoverflow.com/a/34495086/983722)。 –

回答

1

我懷疑「再」模塊是不可用的Jython在我CDH4.x集羣。我沒有花太多時間在Python UDF上。我通過編寫Java UDF來解決它。請原諒我的Java,因爲我是一個的n00b,可能不是最有效的或最漂亮的Java代碼(在有一些錯誤,我相信):

package pigext; 
import java.util.regex.Pattern; 
import java.util.regex.Matcher; 
import java.io.IOException; 
import java.util.*; 
import org.apache.pig.FilterFunc; 
import org.apache.pig.data.Tuple; 
import org.apache.pig.EvalFunc; 
import org.apache.pig.data.DataBag; 
import org.apache.pig.data.DataType; 

public class matchList extends EvalFunc<String> { 

    public String exec(Tuple input) throws IOException { 
try { 
     String line = (String)input.get(0); 
     DataBag bag = (DataBag)input.get(1); 
     Iterator it = bag.iterator(); 
     String output = ""; 
     while (it.hasNext()){ 
       Tuple t = (Tuple)it.next(); 
       if (t != null && t.size() > 0 && t.get(0) != null && line != null) 
         { 
          String cmd = t.get(0).toString(); 
          if (line.toLowerCase().matches(cmd.toLowerCase())) { 
           return (line + "," + cmd); 
           }       
         } 
     } 
     return output; 
     } catch (Exception e) { 
      throw new IOException("Failed to process row", e); 
     } 

} } 

使用它是有充滿文件的方式正則表達式,每行一個,你要搜索,顯然你的目標文本文件。所以,正則表達式文件 「wordstext.txt」 爲:

.*?this +blah.*? 

而且,你的文本文件,的text.txt是:

this blah starts with blah 
this blah has way too many spaces 
that won't match 
thisblahshouldnotmatch 
thisblah should not match either 
the line here is this blah 
line here has this blah in the middle 
line here has this blah with extra spaces 
only has blah 
only has this 

豬腳本是:

REGISTER pigext.jar; 
A = LOAD 'wordstest.txt' AS (cmd:chararray); 
B = LOAD 'text.txt' AS (line:chararray); 
C = GROUP A ALL; 
D = FOREACH B generate pigext.matchList(line,C.$1); 
dump D; 
+0

UDF可能更好地寫作爲擴展FilterFunc的工作,所以你不必做一個生成(特別是如果你有一長串字段)。作爲練習留給讀者或者當我找到時間重新做UDF時:) –