2017-07-05 24 views
0

我正在創建一個外部客戶,其中包含客戶ID,姓名和配偶姓名。

CREATE TABLE customer(cust id, name struct<fname:string,lname:string>,spouse_name struct<fname:string,lname:string> 
    )row format delimited 
    fields terminated by ',' 
    collection items terminated by '$'; 

我想知道,如果傳入的數據來源是這樣的

1,FNAME1$LNAME1,SPOUSE_FNAME1#SPOUSE_LNAME1 
2,FNAME2$LNAME2,SPOUSE_FNAME2#SPOUSE_LNAME2 

我不能在「收集項目」的語句兩個分隔符。 '$'分隔符只會分隔FNAME *和LNAME *。它不會對SPOUSE_FNAME *和SPOUSE_LNAME *做任何事情。我們是否需要爲此編寫一個定製的serde?我不確定數據在真實世界中的樣子,但很可能在某個時間點我們可以得到這樣的數據。

+0

數據處理是應該仔細規劃和管理。文本字段也可能包含符號'','''''或'#'。 「將得到我們得到的並且在時間到來時處理它」的方法,不會讓你走得太遠。 –

回答

0

一種可能的方法是將結構加載爲簡單的字符串並在視圖中執行數據操作。

create external table customer 
(
    cust_id  int 
    ,name  string 
    ,spouse_name string 
) 
    row format delimited 
    fields terminated by ',' 
; 

select * from customer 
; 

+---------+---------------+-----------------------------+ 
| cust_id |  name  |   spouse_name   | 
+---------+---------------+-----------------------------+ 
|  1 | FNAME1$LNAME1 | SPOUSE_FNAME1#SPOUSE_LNAME1 | 
|  2 | FNAME2$LNAME2 | SPOUSE_FNAME2#SPOUSE_LNAME2 | 
+---------+---------------+-----------------------------+ 

create view customer_v 
as 
select cust_id 
     ,named_struct('fname',name[0]  ,'lname',name[1])  as name 
     ,named_struct('fname',spouse_name[0],'lname',spouse_name[1]) as spouse_name 

from (select cust_id 
       ,split(name,'\\$')  as name 
       ,split(spouse_name,'#') as spouse_name 

     from customer 
     ) c 
; 

select * from customer_v 
; 

+---------+-------------------------------------+---------------------------------------------------+ 
| cust_id |    name     |     spouse_name     | 
+---------+-------------------------------------+---------------------------------------------------+ 
|  1 | {"fname":"FNAME1","lname":"LNAME1"} | {"fname":"SPOUSE_FNAME1","lname":"SPOUSE_LNAME1"} | 
|  2 | {"fname":"FNAME2","lname":"LNAME2"} | {"fname":"SPOUSE_FNAME2","lname":"SPOUSE_LNAME2"} | 
+---------+-------------------------------------+---------------------------------------------------+ 
0

試試這個

CREATE TABLE customer(cust_id int, name String, spouse_name string) row format delimited fields terminated by ',' stored as textfile; 
load data inpath '<hdfs path of input file>' overwrite into table customer; 

CREATE external TABLE customer_tmp(cust_id int, name string,spouse_name string) 
row format delimited 
fields terminated by ',' 
stored as textfile location '/hdfs_location_of_customer_tmp'; 

insert overwrite table customer_tmp 
select cust_id,regexp_replace(name,'\\W\\b',':') as name,regexp_replace(spouse_name,'\\W\\b',':') as spouse_name from customer; 

CREATE TABLE customer_final(cust_id int, name struct<fname:string,lname:string>,spouse_name struct<fname:string,lname:string>) 
row format delimited 
fields terminated by ',' 
collection items terminated by ':' 
stored as textfile; 

load data inpath '/hdfs_location_of_customer_tmp/*' overwrite into table customer_final; 

請不要忘了讓我們知道它的工作:)