我試圖從我上傳到HDFS目錄的CSV中的Impala中創建表格。 CSV包含帶引號內的逗號的值。從CSV創建表格,其中包含引號括起來的逗號值
實施例:
1.66.96.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.66.128.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.0.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.128.0/18,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.192.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
的Impala documentation說,這可以與ESCAPED BY
子句來解決。這裏是我當前的代碼:
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
我也使用ESCAPED BY '"'
子句嘗試。在這兩種情況下,Impala都在引號內使用逗號,並將其用作分隔符,將值分成兩列。
有關如何修復代碼以避免這種情況發生的任何想法?
EDIT(2015年6月9日)
所以,我已經通過以下變化了的基礎上,從@K小號Nidhin和@JTUP建議。然而,每一個變化返回相同結果作爲查詢刻錄而不SERDEPROPERTIES
運營商,逗號仍然導致值出現在錯誤的列:
變化1
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES ("quoteChar" = "'", "escapeChar" = "\\")
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
變化2
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
WITH SERDEPROPERTIES ('quoteChar' = '"', 'escapeChar' = '\\')
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
變形例3
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
有沒有其他的想法,或者SERDEPROPERTIES
運營商的其他變種試試?
EDIT(2016年6月10日)
我能得到使用SERDE
和SERDEPROPERTIES
運營商在蜂房的工作(基於Hive Documentation提供的代碼)查詢的不同變化,與正在創建正確的表:
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4(network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
由於SERDE
經營者不得在因帕拉提供,該解決方案將在那裏工作。我很好地在Hive中創建表格,但是我仍然無法在Impala中找到可行的解決方案。
嘗試增加SERDE性能隨SERDEPROPERTIES( 「quoteChar」= 「'」, 「escapeChar」= 「\\」 ) –