2017-07-02 72 views
1

我正在學習Spark,通過學習Spark中的一些示例:Lightning Fast Data Analysis,然後添加自己的開發。RDD.saveAsTextFile之後的空文件是什麼?

我創建了這個類來查看基本轉換和操作。

/** 
* Find errors in a log file 
*/ 

package com.oreilly.learningsparkexamples.mini.java; 

import org.apache.spark.SparkConf; 
import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.JavaSparkContext; 
import org.apache.spark.api.java.function.Function; 

public class FindErrors { 
    public static void main(String args[]){ 
     String inputFile = args[0]; 
     String outputFile = args[1]; 
     //Create a Spark context 
     SparkConf conf = new SparkConf().setAppName("findErrors"); 
     JavaSparkContext sc = new JavaSparkContext(conf); 
     //Load input data 
     JavaRDD<String> input = sc.textFile(inputFile); 
     //Split up into words 
     JavaRDD<String> errorsRDD = input.filter(
      new Function<String, Boolean>() { 
       public Boolean call(String x) { 
        return x.contains("error"); 
       } 
      }); 
     //Transform into word and count 
     //errorsRDD.saveAsTextFile(outputFile); 

     JavaRDD<String> warningsRDD = input.filter(
      new Function<String, Boolean>() { 
       public Boolean call(String x) { 
        return x.contains("warning"); 
       } 
      }); 

     JavaRDD<String> badLinesRDD = errorsRDD.union(warningsRDD); 

     badLinesRDD.saveAsTextFile(outputFile); 

     System.out.println("I had " + badLinesRDD.count() + " concerning lines."); 
     System.out.println("Here are 10 examples:"); 
     for(String line: badLinesRDD.take(10)){ 
      System.out.println(line); 
     } 

    } 
} 

這是我用來運行它的命令:

$SPARK_HOME/bin/spark-submit --class com.oreilly.learningsparkexamples.mini.java.FindErrors ./target/learning-spark-mini-example-0.0.1.jar ../files/fake_logs/log1.log ./errorLog 

這裏的日誌文件的內容:

66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /favicon.ico HTTP/1.1" 200 1713 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET/HTTP/1.1" 200 18785 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET /jobmineimg.php?q=m HTTP/1.1" 200 222 "http://www.holdenkarau.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET /jobmineimg.php?q=m HTTP/1.1" 200 222 "http://www.holdenkarau.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 

一件事我注意到的是,輸出創建一些文件,而比我預期的一個文件。

的文件有:

_SUCCESS 


part-00000 
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 

part-00001 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 

part-00002 


part-00003 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 

它看起來好像每個警告/錯誤的「分組」創建文件。什麼是空白文件雖然?

此外,這可能是我的代碼中,我還沒有找到的東西,或者它是一個星火的特徵?

回答

1

這是一項功能。使用saveAsTextFile Spark爲每個分區寫入一個輸出文件,無論它是否包含數據。由於您應用了filter,原先包含數據的某些輸入分區最終可能爲空。因此空文件。

+0

乾杯user6910411。 – runnerpaul

相關問題