2012-08-16 117 views
0

我正在使用Pig解析我的應用程序日誌,以瞭解哪些暴露的方法已被上個月未被調用的用戶調用過(由同一用戶)。hadoop pig bag subtraction

我設法得到方法調用上個月之前用戶和上個月組合後:

前的最後一個月的關係樣品

u1  {(m1),(m2)} 
u2  {(m3),(m4)} 

上個月關係樣品後

u1  {(m1),(m3)} 
u2  {(m1),(m4)} 

我想要的是由用戶找到AFTER中哪些方法不在之前,即

NEWLY_CALLED預期的結果

u1  {(m3)} 
u2  {(m1)} 

問:我怎麼能做到這一點的豬?是否可以減去包包?

我已經嘗試過DIFF函數,但它不執行預期的減法。

問候,

喬爾

回答

2

我認爲你需要寫一個UDF,那麼你可以使用

Set<T> setA ... 
Set<T> setB ... 
Set<T> setAminusB = setA.subtract(setB); 
+0

我剛剛做了幾分鐘前:)感謝馬克的建議! 我要提議我對豬/豬崽做了什麼,因爲我瘦了它可能會幫助其他人。 – 2012-08-17 09:00:29

+1

@JoelCostigliola有內置函數['SUBTRACT'](http://search-hadoop.com/c/Pig:src/org/apache/pig/builtin/SUBTRACT.java%7C%7C+%252B%2528private +靜態%2529)。這是你需要的嗎? – wenlong 2013-07-22 07:40:34

+0

@文龍在豬0.11.1中被減支持? – 2014-01-22 17:46:02

2

對於那些誰可能會感興趣,這裏是我寫的減法功能並提出它豬(PIG-2881):

/** 
* Subtract takes two bags as arguments returns a new bag composed of tuples of first bag not in the second bag.<br> 
* If null bag arguments are replaced by empty bags. 
* <p> 
* The implementation assumes that both bags being passed to this function will fit entirely into memory simultaneously. 
* </br> 
* If that is not the case the UDF will still function, but it will be <strong>very</strong> slow. 
*/ 
public class Subtract extends EvalFunc<DataBag> { 

    /** 
    * Compares the two bag fields from input Tuple and returns a new bag composed of elements of first bag not in the second bag. 
    * @param input a tuple with exactly two bag fields. 
    * @throws IOException if there are not exactly two fields in a tuple or if they are not {@link DataBag}. 
    */ 
    @Override 
    public DataBag exec(Tuple input) throws IOException { 
    if (input.size() != 2) { 
     throw new ExecException("Subtract expected two inputs but received " + input.size() + " inputs."); 
    } 
    DataBag bag1 = toDataBag(input.get(0)); 
    DataBag bag2 = toDataBag(input.get(1)); 
    return subtract(bag1, bag2); 
    } 

    private static String classNameOf(Object o) { 
    return o == null ? "null" : o.getClass().getSimpleName(); 
    } 

    private static DataBag toDataBag(Object o) throws ExecException { 
    if (o == null) { 
     return BagFactory.getInstance().newDefaultBag(); 
    } 
    if (o instanceof DataBag) { 
     return (DataBag) o; 
    } 
    throw new ExecException(format("Expecting input to be DataBag only but was '%s'", classNameOf(o))); 
    } 

    private static DataBag subtract(DataBag bag1, DataBag bag2) { 
    DataBag subtractBag2FromBag1 = BagFactory.getInstance().newDefaultBag(); 
    // convert each bag to Set, this does make the assumption that the sets will fit in memory. 
    Set<Tuple> set1 = toSet(bag1); 
    // remove elements of bag2 from set1 
    Iterator<Tuple> bag2Iterator = bag2.iterator(); 
    while (bag2Iterator.hasNext()) { 
     set1.remove(bag2Iterator.next()); 
    } 
    // set1 now contains all elements of bag1 not in bag2 => we can build the resulting DataBag. 
    for (Tuple tuple : set1) { 
     subtractBag2FromBag1.add(tuple); 
    } 
    return subtractBag2FromBag1; 
    } 

    private static Set<Tuple> toSet(DataBag bag) { 
    Set<Tuple> set = new HashSet<Tuple>(); 
    Iterator<Tuple> iterator = bag.iterator(); 
    while (iterator.hasNext()) { 
     set.add(iterator.next()); 
    } 
    return set; 
    } 

}