2016-07-25 123 views
0

我正在處理一個政治競選捐款的數據集,該數據集最終成爲一個大約500MB的JSON文件(最初是一個124MB CSV)。在Firebase網絡界面導入(嘗試崩潰Google Chrome上的標籤之前)太大了。我試圖手動上傳對象,因爲它們是從CSV製作的(使用CSVtoJSON轉換器,每一行都成爲JSON對象,然後我會將該對象上傳到Firebase)。將大量數據導入Firebase數據庫的正確方法是什麼?

這是我使用的代碼。

var firebase = require('firebase'); 
var Converter = require("csvtojson").Converter; 
firebase.initializeApp({ 
    serviceAccount: "./credentials.json", 
    databaseURL: "url went here" 
}); 
var converter = new Converter({ 
    constructResult:false, 
    workerNum:4 
}); 
var db = firebase.database(); 
var ref = db.ref("/"); 

var lastindex = 0; 
var count = 0; 
var section = 0; 
var sectionRef; 
converter.on("record_parsed",function(resultRow,rawRow,rowIndex){ 
    if (rowIndex >= 0) { 
     sectionRef = ref.child("reports" + section); 
     var reportRef = sectionRef.child(resultRow.Report_ID); 
     reportRef.set(resultRow); 
     console.log("Report uploaded, count at " + count + ", section at " + section); 
     count += 1; 
     lastindex = rowIndex; 
     if (count >= 1000) { 
      count = 0; 
      section += 1; 
     } 
     if (section >= 100) { 
      console.log("last completed index: " + lastindex); 
      process.exit(); 
     } 
    } else { 
     console.log("we out of indices"); 
     process.exit(); 
    } 

}); 
var readStream=require("fs").createReadStream("./vUPLOAD_MASTER.csv"); 
readStream.pipe(converter); 

但是,這會遇到內存問題並且無法完成數據集。由於Firebase沒有顯示上傳的所有數據,因此試圖以大塊的方式執行操作並不可行,而且我也不確定從哪裏離開。 (當離開火力地堡數據庫在Chrome中打開,我看到的數據進來,但最終的標籤會崩潰,並在重裝了很多後來的數據的缺失。)

然後我用Firebase Streaming Import試過,但拋出這個錯誤:

started at 1469471482.77 
Traceback (most recent call last): 
    File "import.py", line 90, in <module> 
    main(argParser.parse_args()) 
    File "import.py", line 20, in main 
    for prefix, event, value in parser: 
    File "R:\Python27\lib\site-packages\ijson\common.py", line 65, in parse 
    for event, value in basic_events: 
    File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 185, in basic_parse 
    for value in parse_value(lexer): 
    File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 127, in parse_value 
    raise UnexpectedSymbol(symbol, pos) 
ijson.backends.python.UnexpectedSymbol: Unexpected symbol u'\ufeff' at 0 

迎向了那最後一行(從ijson錯誤),我發現this SO thread,但我只是不知道我應該如何使用它來獲取火力地堡流導入工作。

我刪除使用議會從JSON文件,我要上傳的字節順序標記,現在我一分鐘左右的時間裏運行的進口商得到這個錯誤:

Traceback (most recent call last): 
    File "import.py", line 90, in <module> 
    main(argParser.parse_args()) 
    File "import.py", line 20, in main 
    for prefix, event, value in parser: 
    File "R:\Python27\lib\site-packages\ijson\common.py", line 65, in parse 
    for event, value in basic_events: 
    File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 185, in basic_parse 
    for value in parse_value(lexer): 
    File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 116, in parse_value 
    for event in parse_array(lexer): 
    File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 138, in parse_array 
    for event in parse_value(lexer, symbol, pos): 
    File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 119, in parse_value 
    for event in parse_object(lexer): 
    File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 170, in parse_object 
    pos, symbol = next(lexer) 
    File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 51, in Lexer 
    buf += data 
MemoryError 

的火力地堡流進口商應該能夠處理250MB以上的文件,並且我相當肯定我有足夠的內存來處理這個文件。任何想法爲什麼這個錯誤出現?

如果看到實際的JSON文件,我嘗試使用Firebase Streaming Import上傳會有所幫助,here it is

回答

0

我放棄了Firebase流導入並編寫了我自己的工具,該工具使用csvtojson轉換CSV,然後使用Firebase Node API一次上傳每個對象。

這裏的腳本:

var firebase = require("firebase"); 
firebase.initializeApp({ 
    serviceAccount: "./credentials.json", 
    databaseURL: "https://necir-hackathon.firebaseio.com/" 
}); 

var db = firebase.database(); 
var ref = db.ref("/reports"); 
var fs = require('fs'); 
var Converter = require("csvtojson").Converter; 
var header = "Report_ID,Status,CPF_ID,Filing_ID,Report_Type_ID,Report_Type_Description,Amendment,Amendment_Reason,Amendment_To_Report_ID,Amended_By_Report_ID,Filing_Date,Reporting_Period,Report_Year,Beginning_Date,Ending_Date,Beginning_Balance,Receipts,Subtotal,Expenditures,Ending_Balance,Inkinds,Receipts_Unitemized,Receipts_Itemized,Expenditures_Unitemized,Expenditures_Itemized,Inkinds_Unitemized,Inkinds_Itemized,Liabilities,Savings_Total,Report_Month,UI,Reimbursee,Candidate_First_Name,Candidate_Last_Name,Full_Name,Full_Name_Reverse,Bank_Name,District_Code,Office,District,Comm_Name,Report_Candidate_First_Name,Report_Candidate_Last_Name,Report_Office_District,Report_Comm_Name,Report_Bank_Name,Report_Candidate_Address,Report_Candidate_City,Report_Candidate_State,Report_Candidate_Zip,Report_Treasurer_First_Name,Report_Treasurer_Last_Name,Report_Comm_Address,Report_Comm_City,Report_Comm_State,Report_Comm_Zip,Category,Candidate_Clarification,Rec_Count,Exp_Count,Inkind_Count,Liab_Count,R1_Count,CPF9_Count,SV1_Count,Asset_Count,Savings_Account_Count,R1_Item_Count,CPF9_Item_Count,SV1_Item_Count,Filing_Mechanism,Also_Dissolution,Segregated_Account_Type,Municipality_Code,Current_Report_ID,Location,Individual_Or_Organization,Notable_Contributor,Currently_Accessed" 
var queue = []; 
var count = 0; 
var upload_lock = false; 
var lineReader = require('readline').createInterface({ 
    input: fs.createReadStream('test.csv') 
}); 

lineReader.on('line', function (line) { 
    var line = line.replace(/'/g, "\\'"); 
    var csvString = header + '\n' + line; 
    var converter = new Converter({}); 
    converter.fromString(csvString, function(err,result){ 
     if (err) { 
      var errstring = err + "\n"; 
      fs.appendFile('converter_error_log.txt', errstring, function(err){ 
       if (err) { 
       console.log("Converter: Append Log File Error Below:"); 
       console.error(err); 
       process.exit(1); 
      } else { 
       console.log("Converter Error Saved"); 
      } 
      }); 
     } else { 
      result[0].Location = ""; 
      result[0].Individual_Or_Organization = ""; 
      result[0].Notable_Contributor = ""; 
      result[0].Currently_Accessed = ""; 
      var reportRef = ref.child(result[0].Report_ID); 
      count += 1; 
      reportRef.set(result[0]); 
      console.log("Sent #" + count); 
     } 
    }); 
}); 

唯一需要注意的是,儘管該腳本可以快速發送出去的所有對象,火力地堡顯然需要連接保持,而這是他們節省的,因爲所有對象後關閉腳本被髮送導致很多對象沒有出現在數據庫中。 (我等了20分鐘,但可能會更短)

相關問題