0
我需要按特定列對一組csv行進行分組,並對每個組進行一些處理。使用分組處理火花數據
JavaRDD<String> lines = sc.textFile
("somefile.csv");
JavaPairRDD<String, String> pairRDD = lines.mapToPair(new SomeParser());
List<String> keys = pairRDD.keys().distinct().collect();
for (String key : keys)
{
List<String> rows = pairRDD.lookup(key);
noOfVisits = rows.size();
country = COMMA.split(rows.get(0))[6];
accessDuration = getAccessDuration(rows,timeFormat);
Map<String,Integer> counts = getCounts(rows);
whitepapers = counts.get("whitepapers");
tutorials = counts.get("tutorials");
workshops = counts.get("workshops");
casestudies = counts.get("casestudies");
productPages = counts.get("productpages");
}
private static long dateParser(String dateString) throws ParseException {
SimpleDateFormat format = new SimpleDateFormat("MMM dd yyyy HH:mma");
Date date = format.parse(dateString);
return date.getTime();
}
dateParser is called for each row. Then min and max for the group is calculated to get the access duration. Others are string matches.
pairRDD.lookup是非常緩慢..有沒有更好的方法來做到這一點火花。
我已經試過..它甚至更慢..一個操作是解析每個組的日期列以計算持續時間。 – lochi 2014-10-28 15:40:33
您可以在每個鍵的值上執行操作的問題上添加詳細信息嗎? 'reduceByKey'比'groupByKey'更高效,可能是更好的選擇。 – maasg 2014-10-28 15:42:53
見上面..謝謝。 – lochi 2014-10-28 15:52:51