2014-10-09 42 views
1

我有豬數據集,看起來像這樣:豬 - 計算

6009544 "NY" 6009545 "NY" 
6009544 "NY" 6009545 "NY" 
6009548 "NY" 6009546 "OR" 
6009546 "OR" 6009546 "OR" 
6009545 "NY" 6009546 "OR" 
6009548 "NY" 6009547 "AZ" 
6009547 "AZ" 6009547 "AZ" 
6009547 "AZ" 6009548 "NY" 
6009544 "NY" 6009548 "NY" 

的第一行被讀取,像這樣:「專利6009544起源於紐約,並引用專利6009545起源於紐約。 「對於每個州,我試圖找到源自相同州的專利的百分比。所以我的期望輸出應該是

NY: .5 
OR: 1 
AZ: .5 

因爲專利6項,起源於紐約,3引用的專利也起源於紐約。源自俄勒岡州的1項專利引用了也起源於紐約的專利。在源自亞利桑那州的2項專利中,1引用了也起源於亞利桑那州的專利。

任何人都可以提出一個很好的方式來執行這個豬嗎?

回答

1

你可以試試嗎?

input.txt 
6009544 "NY" 6009545 "NY" 
6009544 "NY" 6009545 "NY" 
6009548 "NY" 6009546 "OR" 
6009546 "OR" 6009546 "OR" 
6009545 "NY" 6009546 "OR" 
6009548 "NY" 6009547 "AZ" 
6009547 "AZ" 6009547 "AZ" 
6009547 "AZ" 6009548 "NY" 
6009544 "NY" 6009548 "NY" 

PigScript: 
A = LOAD 'input.txt' AS line; 
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(\\d+)\\s+"(\\w+)"\\s+(\\d+)\\s+"(\\w+)"')) AS (f1:int,f2:chararray,f3:int,f4:chararray); 
C = GROUP B BY f2; 
D = FOREACH C { 
       FilterByPatent = FILTER B BY f2==f4; 
       CityPatentCount = COUNT(B.f2); 
       GENERATE group,((float)COUNT(FilterByPatent)/(float)CityPatentCount); 
       } 
DUMP D; 

Output: 
(AZ,0.5) 
(NY,0.5) 
(OR,1.0) 
+0

這種方法的偉大工程 - 謝謝! – Luke 2014-10-09 14:33:55

0

我利用空間的樣本數據和獨立的數據更改:

A = load '/padata' using PigStorage(' ') as (pno:int,pcity:chararray,pci:int,pccity:chararray); 

b = group A by pcity ; 

r = foreach b { 

       copcity= COUNT(A.pcity) ; 

       samdata = FILTER A by pcity==pccity; 

       csamdata = COUNT(samdata); 

       percent = (float)csamdata/(float)copcity; 

       generate group,percent ; 

       } 

dump r ; 

輸出: -

("AZ",0.5) 

("NY",0.5) 

("OR",1.0)