Spark計數在其中包含特定單詞的行數

-1

我有一個包含單詞「error」的行的日誌文件。我如何計算在apache spark中包含這個術語的行的總數？Spark計數在其中包含特定單詞的行數

到目前爲止，我正在使用這種方法。

from pyspark import SparkConf, SparkContext 

conf = SparkConf().setMaster("local").setAppName("WordCount") 
sc = SparkContext(conf = conf) 

input = sc.textFile("errors.txt") 
words = input.flatMap(lambda x: x for x if "errors" in input) 
wordCounts = input.countByValue() 

for word, count in wordCounts.items(): 
    print str(count)

但是這種方法不起作用。任何人都可以告訴我如何獲得計數？

編輯：Scala的等效是

lines = spark.textFile("hdfs://...") 
errors = lines.filter(_.startsWith("ERROR")) 
errors.persist()

什麼是Python相當於此行。

來源

2017-07-13 Sid

'rdd.count'應該工作 – philantrovert

請使用下面的代碼片段：

from pyspark import SparkConf, SparkContext 

conf = SparkConf().setMaster("local").setAppName("errors") 
sc = SparkContext(conf = conf) 

lines = sc.textFile("errors.txt") 
rdd = lines.filter(lambda x: "error" in x) 
print rdd.count

來源

2017-07-13 11:17:04

謝謝你的代碼 – Sid

非常歡迎你.. –

input.filter(lambda line : "error" in line).count()應該工作。

來源

2017-07-13 10:59:43 neilron

感謝您的解決方案。我能解決它以另一種方式

input = sc.textFile("errors.txt") 
wordCounts = input.countByValue() 

for word, count in wordCounts.items(): 
    if "error" in word: 
     print count

來源

2017-07-13 11:22:17 Sid

Spark計數在其中包含特定單詞的行數

回答

相關問題