1
堆棧:使用Ambari 2.1Sqoop進口:複合主鍵和文字主鍵
安裝HDP-2.3.2.0-2950源DB模式是SQL服務器上,並且它包含多個表,其或者具有主鍵爲:
- 一個VARCHAR
- 複合 - 2個varchar列或一個VARCHAR + INT一個列或 兩個int列。有一張大桌子?
Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
第一個問題是:其在PK 3分 列一個INT +兩臺VARCHAR列
由於每Sqoop文檔行什麼是預計「手動選擇拆分列」 - 我該如何犧牲pk,只使用一列,還是缺少一些概念?
在SQL Server表(只有兩列,它們形成一個複合主鍵):
ChassiNo varchar(8) Unchecked
ECU_Name nvarchar(15) Unchecked
我用進口的進行,源表有7909097條記錄:
sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusname --password somepass --as-textfile --fields-terminated-by '|&|' --table ChassiECU --num-mappers 8 --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose
令人擔憂的警告和不正確的映射器輸入和記錄:
16/05/13 10:59:04 WARN manager.CatalogQueryManager: The table ChassiECU contains a multi-column primary key. Sqoop will default to the column ChassiNo only for this job.
16/05/13 10:59:08 WARN db.TextSplitter: Generating splits for a textual index column.
16/05/13 10:59:08 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/05/13 10:59:08 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/05/13 10:59:38 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1168400
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1128
HDFS: Number of bytes written=209961941
HDFS: Number of read operations=32
HDFS: Number of large read operations=0
HDFS: Number of write operations=16
Job Counters
Launched map tasks=8
Other local map tasks=8
Total time spent by all maps in occupied slots (ms)=62785
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=62785
Total vcore-seconds taken by all map tasks=62785
Total megabyte-seconds taken by all map tasks=128583680
Map-Reduce Framework
Map input records=15818167
Map output records=15818167
Input split bytes=1128
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=780
CPU time spent (ms)=45280
Physical memory (bytes) snapshot=2219433984
Virtual memory (bytes) snapshot=20014182400
Total committed heap usage (bytes)=9394716672
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=209961941
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Transferred 200.2353 MB in 32.6994 seconds (6.1235 MB/sec)
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Retrieved 15818167 records.
創建的表:
CREATE EXTERNAL TABLE IF NOT EXISTS ChassiECU(`ChassiNo` varchar(8),
`ECU_Name` varchar(15)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/dataload/tohdfs/reio/odpdw/may2016/ChassiECU';
可怕的結果(沒有錯誤)--PROBLEM:15818167 VS 7909097(SQL Server)的記錄:
> select count(1) from ChassiECU;
Query ID = hive_20160513110313_8e294d83-78aa-4e52-b90f-b5640268b8ac
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1446726117927_0059)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 14 14 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.12 s
--------------------------------------------------------------------------------
OK
_c0
15818167
出人意料的是,我得到了無論是準確或更少的不匹配如果組合鍵由int組成(用於分割),但是我仍然對這些問題感到擔憂,那麼這個記錄要比10條記錄要好!
我應該怎麼做?
嗨,我從你的要求中瞭解到,你只是想將ChassiECU的內容移動到Hive中以使用多種類型的複合鍵。代替'--table
其中一種選擇是在表格頂部創建一個視圖,並在其中包含一個包含連接值的新鍵列的所有關鍵列放入一個列中,然後可以在sqoop導入中使用。 – Kfactor21
添加了表格信息,我不確定您提供的建議是否適用於表格,您可以檢查一下嗎? –
回答
手動指定分割列。分割列不一定等於PK。你可以有複雜的PK和一些int分割列。您可以指定任何整數列或甚至簡單的函數(一些簡單的函數,如子字符串或強制類型,而不是聚合或分析)。分列最好應該均勻分佈整數。
例如,如果你的分列包含值-1和10M行幾行使用值10000 - 10000000且NUM映射器= 8,然後sqoop不會平均分配映射器之間的數據集:
,這將導致數據歪斜和第8映射器將[R聯合國永遠或 甚至失敗。當使用非整數 拆分列與MS-SQL時,我也有重複。所以,使用整數分割列。在你的情況 與表只有兩個VARCHAR列,你可以
(1)補充替代INT PK,並用它也可以作爲一個拆分或
(2)使用自定義查詢與
WHERE
條款手動分割數據並使用num-mappers = 1運行sqoop幾次或(3)應用一些確定性整數非聚合函數向您varchar列,例如cast -柱。
來源
2016-05-23 10:56:35 leftjoin
相關問題