我們需要在Apache Spark中實現跨字符串的Jaro-Winkler距離計算數據集。我們是新來的火花,並在網絡搜索後,我們無法找到很多東西。如果你能指導我們,那將是非常棒的。我們認爲使用flatMap然後意識到它不會幫助,那麼我們試圖使用foreach循環,但不能夠如何前進。因爲每個字符串必須與所有字符串進行比較。就像在下面的數據集中一樣。Apache Spark中的Jaro-Winkler分數計算
RowFactory.create(0, "Hi I heard about Spark"),
RowFactory.create(1,"I wish Java could use case classes"),
RowFactory.create(2,"Logistic,regression,models,are,neat"));
示例jaro winkler在上述數據框中找到的所有字符串中得分。標籤之間
距離分值,0,1 - > 0.56
距離得分標籤之間 ,0,2 - 標籤之間> 0.77
距離分值,0,3 - 標籤之間> 0.45
距離分數, 1,2 - > 0.77
距離得分標籤之間 ,2,3 - > 0.79
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import info.debatty.java.stringsimilarity.JaroWinkler;
public class JaroTestExample {
public static void main(String[] args)
{
System.setProperty("hadoop.home.dir", "C:\\winutil");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder()
.appName("JavaTokenizerExample").getOrCreate();
JaroWinkler jw = new JaroWinkler();
// substitution of s and t
System.out.println(jw.similarity("My string", "My tsring"));
// substitution of s and n
System.out.println(jw.similarity("My string", "My ntrisg"));
List<Row> data = Arrays.asList(
RowFactory.create(0, "Hi I heard about Spark"),
RowFactory.create(1,"I wish Java could use case classes"),
RowFactory.create(2,"Logistic,regression,models,are,neat"));
StructType schema = new StructType(new StructField[] {
new StructField("label", DataTypes.IntegerType, false,
Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false,
Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
sentenceDataFrame.foreach();
}
}