我正在做這個MOOC以進行火花刷新,並遇到了這個問題 「在先前創建的數據框中找到唯一主機的號碼」 Apache日誌分析)獲取列中唯一行數的最優化方法火花數據幀
數據幀看起來像這樣
--------------------+--------------------+------+------------+--------------------+
| host| path|status|content_size| time|
+--------------------+--------------------+------+------------+--------------------+
| in24.inetnebr.com |/shuttle/missions...| 200| 1839|1995-08-01 00:00:...|
| uplherc.upl.com | /| 304| 0|1995-08-01 00:00:...|
| uplherc.upl.com |/images/ksclogo-m...| 304| 0|1995-08-01 00:00:...|
| uplherc.upl.com |/images/MOSAIC-lo...| 304| 0|1995-08-01 00:00:...|
| uplherc.upl.com |/images/USA-logos...| 304| 0|1995-08-01 00:00:...|
|ix-esc-ca2-07.ix....|/images/launch-lo...| 200| 1713|1995-08-01 00:00:...|
| uplherc.upl.com |/images/WORLD-log...| 304| 0|1995-08-01 00:00:...|
|slppp6.intermind....|/history/skylab/s...| 200| 1687|1995-08-01 00:00:...|
|piweba4y.prodigy....|/images/launchmed...| 200| 11853|1995-08-01 00:00:...|
|slppp6.intermind....|/history/skylab/s...| 200| 9202|1995-08-01 00:00:...|
|slppp6.intermind....|/images/ksclogosm...| 200| 3635|1995-08-01 00:00:...|
|ix-esc-ca2-07.ix....|/history/apollo/i...| 200| 1173|1995-08-01 00:00:...|
|slppp6.intermind....|/history/apollo/i...| 200| 3047|1995-08-01 00:00:...|
| uplherc.upl.com |/images/NASA-logo...| 304| 0|1995-08-01 00:00:...|
| 133.43.96.45 |/shuttle/missions...| 200| 10566|1995-08-01 00:00:...|
|kgtyk4.kj.yamagat...| /| 200| 7280|1995-08-01 00:00:...|
|kgtyk4.kj.yamagat...|/images/ksclogo-m...| 200| 5866|1995-08-01 00:00:...|
| d0ucr6.fnal.gov |/history/apollo/a...| 200| 2743|1995-08-01 00:00:...|
現在,我已經試過3種方法找到沒有唯一的主機
from pyspark.sql import functions as func
unique_host_count = logs_df.agg(func.countDistinct(col("host"))).head()[0]
這個運行在約0.72秒
unique_host_count = logs_df.select("host").distinct().count()
這個運行在0.57秒
unique_host = logs_df.groupBy("host").count()
unique_host_count = unique_host.count()
這個運行在0.62秒
所以我的問題是有沒有什麼比第二個更好的選擇,我認爲distinct
是一個昂貴的操作,但它原來是最快的
The data frame I am using have
1043177 rows
spark version - 1.6.1
cluster -6 gb memory