Sqoop進口：複合主鍵和文字主鍵

安裝HDP-2.3.2.0-2950源DB模式是SQL服務器上，並且它包含多個表，其或者具有主鍵爲：

一個VARCHAR

複合 - 2個varchar列或一個VARCHAR + INT一個列或兩個int列。有一張大桌子？

Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.

第一個問題是：其在PK 3分列一個INT +兩臺VARCHAR列

由於每Sqoop文檔行什麼是預計「手動選擇拆分列」 - 我該如何犧牲pk，只使用一列，還是缺少一些概念？

在SQL Server表（只有兩列，它們形成一個複合主鍵）：

ChassiNo varchar(8) Unchecked 
ECU_Name nvarchar(15) Unchecked

我用進口的進行，源表有7909097條記錄：

sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusname --password somepass --as-textfile --fields-terminated-by '|&|' --table ChassiECU --num-mappers 8 --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose

令人擔憂的警告和不正確的映射器輸入和記錄：

16/05/13 10:59:04 WARN manager.CatalogQueryManager: The table ChassiECU contains a multi-column primary key. Sqoop will default to the column ChassiNo only for this job. 
16/05/13 10:59:08 WARN db.TextSplitter: Generating splits for a textual index column. 
16/05/13 10:59:08 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records. 
16/05/13 10:59:08 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column. 
16/05/13 10:59:38 INFO mapreduce.Job: Counters: 30 
     File System Counters 
       FILE: Number of bytes read=0 
       FILE: Number of bytes written=1168400 
       FILE: Number of read operations=0 
       FILE: Number of large read operations=0 
       FILE: Number of write operations=0 
       HDFS: Number of bytes read=1128 
       HDFS: Number of bytes written=209961941 
       HDFS: Number of read operations=32 
       HDFS: Number of large read operations=0 
       HDFS: Number of write operations=16 
     Job Counters 
       Launched map tasks=8 
       Other local map tasks=8 
       Total time spent by all maps in occupied slots (ms)=62785 
       Total time spent by all reduces in occupied slots (ms)=0 
       Total time spent by all map tasks (ms)=62785 
       Total vcore-seconds taken by all map tasks=62785 
       Total megabyte-seconds taken by all map tasks=128583680 
     Map-Reduce Framework 
       Map input records=15818167 
       Map output records=15818167 
       Input split bytes=1128 
       Spilled Records=0 
       Failed Shuffles=0 
       Merged Map outputs=0 
       GC time elapsed (ms)=780 
       CPU time spent (ms)=45280 
       Physical memory (bytes) snapshot=2219433984 
       Virtual memory (bytes) snapshot=20014182400 
       Total committed heap usage (bytes)=9394716672 
     File Input Format Counters 
       Bytes Read=0 
     File Output Format Counters 
       Bytes Written=209961941 
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Transferred 200.2353 MB in 32.6994 seconds (6.1235 MB/sec) 
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Retrieved 15818167 records.

創建的表：

CREATE EXTERNAL TABLE IF NOT EXISTS ChassiECU(`ChassiNo` varchar(8), 
`ECU_Name` varchar(15)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/dataload/tohdfs/reio/odpdw/may2016/ChassiECU';

可怕的結果（沒有錯誤）--PROBLEM：15818167 VS 7909097（SQL Server）的記錄：

> select count(1) from ChassiECU; 
Query ID = hive_20160513110313_8e294d83-78aa-4e52-b90f-b5640268b8ac 
Total jobs = 1 
Launching Job 1 out of 1 
Tez session was closed. Reopening... 
Session re-established. 
Status: Running (Executing on YARN cluster with App id application_1446726117927_0059) 
-------------------------------------------------------------------------------- 
     VERTICES  STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED 
-------------------------------------------------------------------------------- 
Map 1 .......... SUCCEEDED  14   14  0  0  0  0 
Reducer 2 ...... SUCCEEDED  1   1  0  0  0  0 
-------------------------------------------------------------------------------- 
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.12 s 
-------------------------------------------------------------------------------- 
OK 
_c0 
15818167

出人意料的是，我得到了無論是準確或更少的不匹配如果組合鍵由int組成（用於分割），但是我仍然對這些問題感到擔憂，那麼這個記錄要比10條記錄要好！

我應該怎麼做？

來源

2016-05-13 Kaliyug Antagonist

嗨，我從你的要求中瞭解到，你只是想將ChassiECU的內容移動到Hive中以使用多種類型的複合鍵。代替'--table

option'，你可以使用'--query select * from ChassiECU where \ $ CONDITIONS「'並在'--split-by 中選擇一個關鍵列（最好是基數最低的那一列） 'option.Also請確認你的列分隔符在sqoop導入中和在配置單元DDL中使用的列分隔符 – Kfactor21

其中一種選擇是在表格頂部創建一個視圖，並在其中包含一個包含連接值的新鍵列的所有關鍵列放入一個列中，然後可以在sqoop導入中使用。 – Kfactor21

添加了表格信息，我不確定您提供的建議是否適用於表格，您可以檢查一下嗎？ –

回答

手動指定分割列。分割列不一定等於PK。你可以有複雜的PK和一些int分割列。您可以指定任何整數列或甚至簡單的函數（一些簡單的函數，如子字符串或強制類型，而不是聚合或分析）。分列最好應該均勻分佈整數。

例如，如果你的分列包含值-1和10M行幾行使用值10000 - 10000000且NUM映射器= 8，然後sqoop不會平均分配映射器之間的數據集：

1映射器會得到幾行用-1，
2日至7日映射器將得到0行，
8映射器將獲得近1000萬行，

，這將導致數據歪斜和第8映射器將[R聯合國永遠或甚至失敗。當使用非整數拆分列與MS-SQL時，我也有重複。所以，使用整數分割列。在你的情況與表只有兩個VARCHAR列，你可以

（1）補充替代INT PK，並用它也可以作爲一個拆分或

（2）使用自定義查詢與WHERE條款手動分割數據並使用num-mappers = 1運行sqoop幾次或

（3）應用一些確定性整數非聚合函數向您varchar列，例如cast -柱。

來源

2016-05-23 10:56:35 leftjoin

相關問題