2014-11-14 29 views
1

如何根據dml驗證我的輸入數據是否正確。如何驗證我的豬輸入數據是根據dml

輸入數據: Jorge Posada | Yankees | {(Catcher,2000),(Designated_hitter,2001)} | [games#1594,hit_by_pitch#65,grand_slams#7] Landon Powell |奧克蘭| {(Catcher,2000),(First_baseman,2001)} | [on_base_percentage# 0.297,#26,home_runs#7] Martin Prado |亞特蘭大| {(Second_baseman,2002),(內野手,2003),(Left_fielder)} | [遊戲#258,#hit_by_pitch 3]

在加粗部分看,我已經錯過了今年場。 bfile = LOAD'basketball1.txt'使用PigStorage('|')作爲(名稱:chararray,team:chararray,pos:bag {t:tuple(point:chararray,year:int)},bat:map [] );

dump bfile; (Landcher Powell,Oakland,{(Catcher,2000),((Catcher,2000),(Designated_hitter,2001)},[遊戲#1594,hit_by_pitch#65,grand_slams#7] First_baseman,2001)},[on_base_percentage#0.297,遊戲#26,#home_runs 7]) (馬丁拉多,亞特蘭大,[遊戲#258,#hit_by_pitch 3])

問候 Sanjeeb

+0

您可以添加更多樣本來驗證輸入嗎?有效和無效。 – 2014-11-15 18:59:44

回答

1

這是您的模式的正則表達式腳本,主要是我驗證所有的字段。請針對您的輸入運行,並讓我知道您是否需要其他驗證。

正則表達式:

'^ 
    ([A-Za-z]+\\s+[A-Za-z]+)\\s*\\|\\s* 
    ([A-Za-z]+)\\s*\\|\\s* 
    (\\{(?:\\([A-Za-z_]+,[0-9]+\\))(?:,\\([A-Za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s* 
    (\\[(?:[A-Za-z_]+#[0-9\\.]+)(?:,[A-Za-z_]+#[0-9\\.]+)*\\]) 
$' 

input.txt中
我已標記的每個下面的輸入是有效還是無效

Jorge Posada |Yankees| {(Catcher,2000),(Designated_hitter,2001)}|[games#1594,hit_by_pitch#65,grand_slams#7] -->Valid 
Landon Powell |Oakland|{(Catcher,2000),(First_baseman,2001)}|[on_base_percentage#0.297,games#26,home_runs#7] ->Valid 
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003),(Left_fielder)}|[games#258,hit_by_pitch#3] -->Invalid year missing 
Martin Prado |Atlanta| {(Second_baseman,2002)(Infielder,2003)}|[games#258,hit_by_pitch#3] ->Invalid no comma between two tuples 
Martin Prado |Atlanta| {,(Second_baseman,2002),(Infielder,2003)}|[games#258,hit_by_pitch#3] --> Invalid comma in the start of tuple 
Martin Prado |Atlanta| {(Second_baseman,2002),(,2003)}|[games#258,hit_by_pitch#3] -->Invalid position is missing 
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003)}[games#258,hit_by_pitch#3] --> Invalid Demiiter | is missing 
Martin Prado || {(Second_baseman,2002),(Infielder,2003)}[games#258,hit_by_pitch#3] --> Invalid Team name is missing 
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003)}[games#,hit_by_pitch#3] --> Invalid Key value is missing for games 
Landon Powell |Oakland|{(Catcher,2000)}|[on_base_percentage#0.297] --> Valid 
Landon Powell |Oakland|{(Catcher,2000),(First_baseman,2001),(test,3000)}|[on_base_percentage#0.297,games#26,home_runs#7,test#1.2] -->valid 

PigScript:

A = LOAD 'input.txt' AS line; 
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^([A-Za-z]+\\s+[A-Za-z]+)\\s*\\|\\s*([A-Za-z]+)\\s*\\|\\s*(\\{(?:\\([A-Za-z_]+,[0-9]+\\))(?:,\\([A-Za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s*(\\[(?:[A-Za-z_]+#[0-9\\.]+)(?:,[A-Za-z_]+#[0-9\\.]+)*\\])$')) AS (name:chararray,team:chararray,pos:bag{t:(p:chararray)},bat:map[]);; 
DUMP B; 

輸出:如果輸入與模式不匹配,則會將輸出打印爲空。

(Jorge Posada,Yankees,{(Catcher,2000),(Designated_hitter,2001)},[games#1594,hit_by_pitch#65,grand_slams#7]) -->Valid 
(Landon Powell,Oakland,{(Catcher,2000),(First_baseman,2001)},[on_base_percentage#0.297,games#26,home_runs#7]) -->Valid 
() -->Invalid,Year missing 
() -->Invalid,No comma between two tuples 
() -->Invalid,Comma in the start of tuple 
() -->Invalid,Position is missing 
() -->Invalid,Demiiter | is missing 
() -->Invalid Team name is missing 
() -->Invalid Key value is missing for games 
(Landon Powell,Oakland,{(Catcher,2000)},[on_base_percentage#0.297]) -->Valid 
(Landon Powell,Oakland,{(Catcher,2000),(First_baseman,2001),(test,3000)},[on_base_percentage#0.297,games#26,home_runs#7,test#1.2]) -->valid