6

問題陳述輸出的順序不是在給定的輸入

合併兩個數據集,這是在Apache的火花具有不同列名,則isin()功能之後的順序在數據集改變順序。

即使我嘗試了sort,orderby但沒有奏效。

輸入數據1:

RowFactory.create("405-048011-62815", "CRC Industries"), 
RowFactory.create("630-0746","Dixon value"), 
RowFactory.create("4444-444","3M INdustries"), 
RowFactory.create("555-55","Dixon coupling valve") 

輸入DATA2:

RowFactory.create("222-2222-5555", "Tata"), 
RowFactory.create("7777-88886","WestSide"), 
RowFactory.create("22222-22224","Reliance"), 
RowFactory.create("33333-3333","V industries") 


List<Row> data = Arrays.asList(
RowFactory.create("405-048011-62815", "CRC Industries"), 
RowFactory.create("630-0746","Dixon value"), 
RowFactory.create("4444-444","3M INdustries"), 
RowFactory.create("555-55","Dixon coupling valve")); 

StructType schema = new StructType(new StructField[] { 
new StructField("label1", DataTypes.StringType, false,Metadata.empty()), 
new StructField("sentence1", DataTypes.StringType, false,Metadata.empty()) }); 

Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema); 

List<String> listStrings = new ArrayList<String>(); 
listStrings.add("405-048011-62815"); 
listStrings.add("630-0746"); 
listStrings.add("4444-444"); 
listStrings.add("555-55"); 

Dataset<Row> matchFound1 = sentenceDataFrame.filter(col("label1").isin(listStrings.stream().toArray(String[]::new))); 
matchFound1.show(); 


listStrings.clear(); 
listStrings.add("222-2222-5555"); 
listStrings.add("7777-88886"); 
listStrings.add("22222-22224"); 
listStrings.add("33333-3333"); 
StringIndexer indexer = new StringIndexer() 
    .setInputCol("label1") 
    .setOutputCol("label1Index1"); 
Dataset<Row> Dataset1 = indexer.fit(matchFound1).transform(matchFound1); 
Dataset1.show(); 


List<Row> data2 = Arrays.asList(
    RowFactory.create("222-2222-5555", "Tata"), 
    RowFactory.create("7777-88886","WestSide"), 
    RowFactory.create("22222-22224","Reliance"), 
    RowFactory.create("33333-3333","V industries")); 
StructType schema2 = new StructType(new StructField[] { 
new StructField("label2", DataTypes.StringType, false,Metadata.empty()), 
new StructField("sentence2", DataTypes.StringType, false,Metadata.empty()) }); 

Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2); 

Dataset<Row> matchFound2 = sentenceDataFrame2.filter(col("label2").isin(listStrings.stream().toArray(String[]::new))); 
matchFound2.show(); 

StringIndexer indexer1 = new StringIndexer() 
    .setInputCol("label2") 
    .setOutputCol("label2Index1"); 
Dataset<Row> Dataset2 = indexer1.fit(matchFound2).transform(matchFound2); 
Dataset2.show(); 

Dataset<Row> Finalresult = Dataset1.join(Dataset2 , Dataset1.col("label1Index1").equalTo(Dataset2.col("label2Index1"))).drop(Dataset1.col("label1Index1")).drop(Dataset2.col("label2Index1")); 
Finalresult.show(); 

實際輸出:

+----------------+--------------------+-------------+------------+ 
    |   label1|   sentence1|  label2| sentence2| 
    +----------------+--------------------+-------------+------------+ 
    |405-048011-62815|  CRC Industries| 33333-3333|V industries| 
    |  630-0746|   Dixon value|222-2222-5555|  Tata| 
    |  4444-444|  3M INdustries| 7777-88886| WestSide| 
    |   555-55|Dixon coupling valve| 22222-22224| Reliance| 
    +----------------+--------------------+-------------+------------+ 

預期輸出:

+----------------+--------------------+-------------+------------+ 
    |   label1|   sentence1|  label2| sentence2| 
    +----------------+--------------------+-------------+------------+ 
    |405-048011-62815|  CRC Industries|222-2222-5555|V industries| 
    |  630-0746|   Dixon value| 7777-88886 |  Tata| 
    |  4444-444|  3M INdustries| 22222-22224| WestSide| 
    |   555-55|Dixon coupling valve| 33333-3333 | Reliance| 
    +----------------+--------------------+-------------+------------+ 

回答

7

,而不是做字符串索引可以使用monotonically_increasing_id()添加獨特的序列號的持久列並重新創建數據框如下:

Dataset<Row> Test2=Dataset2.withColumn("rowId2", monotonically_increasing_id()) ; 
Dataset<Row> Test1=Dataset1.withColumn("rowId1", monotonically_increasing_id()) ; 

然後加入兩個數據集:

Dataset<Row> Finalresult = Test1.join(Test2 , Test1.col("rowId1").equalTo(Test2.col("rowId2")));