2016-09-13 147 views
5

我擁有的最小文件大於850k行,每行的長度不明。目標是在瀏覽器中從此文件讀取n行。完全閱讀不會發生。閱讀大文本文件的n行

下面是HTML <input type="file" name="file" id="file">和JS我有:

var n = 10; 
var reader = new FileReader(); 
reader.onload = function(progressEvent) { 
    // Entire file 
    console.log(this.result); 

    // By lines 
    var lines = this.result.split('\n'); 
    for (var line = 0; line < n; line++) { 
    console.log(lines[line]); 
    } 
}; 

顯然,這裏的問題是,它會首先嚐試真正的整個文件,然後換行分裂它。所以無論是n,它都會嘗試來讀取整個文件,並且當文件很大時最終什麼都不讀。

我應該怎麼做?

注意:我願意刪除整個函數並從頭開始,因爲我可以每行都讀console.log()


* 「每一道線條都是未知長度的」 - >表示該文件是這樣的:

(0, (1, 2)) 
(1, (4, 5, 6)) 
(2, (7)) 
(3, (8)) 

編輯:

去會是這樣的方式像filereader api on big files,但我看不到我如何修改該文件的n行...

通過使用Uint8Array to string in Javascript也可以從那裏做:

var view = new Uint8Array(fr.result); 
var string = new TextDecoder("utf-8").decode(view); 
console.log("Chunk " + string); 

,但這可能無法讀取最後一行作爲一個整體,所以你怎麼後來確定線路?例如這裏是它印:

((7202), (u'11330875493', u'2554375661')) 
((1667), (u'9079074735', u'6883914476', 
+0

*「......但是這不應該的問題」 *什麼在天堂的名字使你認爲不要緊?如果沒有索引行的開始位置*和*在給定索引處遞增讀取文件的能力,那絕對是重要的。 –

+0

@ T.J.Crowder我通過澄清更新了我的問題,也許我應該刪除該陳述,您是對的! – gsamaras

+0

這裏需要更多的上下文。您正在使用HTML和JavaScript。這是在Web瀏覽器中運行的JavaScript嗎?或者,這個JavaScript是作爲HTML POST之類的迴應執行的嗎? – Alan

回答

7

的邏輯非常相似,我在我的答案寫信給filereader api on big files,除非你需要保持你到目前爲止處理的行數的軌道(也到目前爲止閱讀的最後一行,因爲它可能尚未結束)。下一個示例適用於與UTF-8兼容的任何編碼;如果您需要其他編碼,請查看TextDecoder構造函數的選項。

如果您確定輸入是ASCII(或任何其他單字節編碼),那麼您也可以跳過使用TextDecoder並直接使用FileReader's readAsText method作爲文本讀取輸入。

// This is just an example of the function below. 
 
document.getElementById('start').onclick = function() { 
 
    var file = document.getElementById('infile').files[0]; 
 
    if (!file) { 
 
     console.log('No file selected.'); 
 
     return; 
 
    } 
 
    var maxlines = parseInt(document.getElementById('maxlines').value, 10); 
 
    var lineno = 1; 
 
    // readSomeLines is defined below. 
 
    readSomeLines(file, maxlines, function(line) { 
 
     console.log("Line: " + (lineno++) + line); 
 
    }, function onComplete() { 
 
     console.log('Read all lines'); 
 
    }); 
 
}; 
 

 
/** 
 
* Read up to and including |maxlines| lines from |file|. 
 
* 
 
* @param {Blob} file - The file to be read. 
 
* @param {integer} maxlines - The maximum number of lines to read. 
 
* @param {function(string)} forEachLine - Called for each line. 
 
* @param {function(error)} onComplete - Called when the end of the file 
 
*  is reached or when |maxlines| lines have been read. 
 
*/ 
 
function readSomeLines(file, maxlines, forEachLine, onComplete) { 
 
    var CHUNK_SIZE = 50000; // 50kb, arbitrarily chosen. 
 
    var decoder = new TextDecoder(); 
 
    var offset = 0; 
 
    var linecount = 0; 
 
    var linenumber = 0; 
 
    var results = ''; 
 
    var fr = new FileReader(); 
 
    fr.onload = function() { 
 
     // Use stream:true in case we cut the file 
 
     // in the middle of a multi-byte character 
 
     results += decoder.decode(fr.result, {stream: true}); 
 
     var lines = results.split('\n'); 
 
     results = lines.pop(); // In case the line did not end yet. 
 
     linecount += lines.length; 
 
    
 
     if (linecount > maxlines) { 
 
      // Read too many lines? Truncate the results. 
 
      lines.length -= linecount - maxlines; 
 
      linecount = maxlines; 
 
     } 
 
    
 
     for (var i = 0; i < lines.length; ++i) { 
 
      forEachLine(lines[i] + '\n'); 
 
     } 
 
     offset += CHUNK_SIZE; 
 
     seek(); 
 
    }; 
 
    fr.onerror = function() { 
 
     onComplete(fr.error); 
 
    }; 
 
    seek(); 
 
    
 
    function seek() { 
 
     if (linecount === maxlines) { 
 
      // We found enough lines. 
 
      onComplete(); // Done. 
 
      return; 
 
     } 
 
     if (offset !== 0 && offset >= file.size) { 
 
      // We did not find all lines, but there are no more lines. 
 
      forEachLine(results); // This is from lines.pop(), before. 
 
      onComplete(); // Done 
 
      return; 
 
     } 
 
     var slice = file.slice(offset, offset + CHUNK_SIZE); 
 
     fr.readAsArrayBuffer(slice); 
 
    } 
 
}
Read <input type="number" id="maxlines"> lines from 
 
<input type="file" id="infile">. 
 
<input type="button" id="start" value="Print lines to console">

+0

當'maxlines'不是由用戶提供的時候,我並沒有真正知道這將如何讀取整個文件。除此之外,太棒了! – gsamaras

+1

@gsamaras在將maxlines設置爲任意高的值(例如'Infinity')時,涉及maxlines的所有條件都計算爲false,因此您可以想象包含它們的if塊不存在。然後,它應該很容易看出,只有當它已經讀過文件末尾('offset> = file.size')時'seek'纔會返回。 –

2

流是功能!
whatwg團隊正在研究關於可寫流+可讀流的最新流量,並且很快就緒。但在此之前,您可以使用web-stream-polyfill。 他們正在努力獲得blob的ReadableStream以及[1]。但我還創建了一個方法與獲得以流方式將BLOB已經:Screw-FileReader

昨天我還創建了一個simpel的node-bylineport與網絡的工作流,而不是

所以這可能是因爲這很簡單:

// Simulate a file 
 
var csv = 
 
`apple,1,$1.00 
 
banana,4,$0.20 
 
orange,3,$0.79` 
 

 
var file = new Blob([csv]) 
 

 
var n = 0 
 
var controller 
 
var decoder = new TextDecoder 
 
var stdout = new WritableStream({ 
 
    start(c) { 
 
     controller = c 
 
    }, 
 
    write(chunk, a) { 
 
     // Calling controller.error will also put the byLine in an errored state 
 
     // Causing the file stream to stop reading more data also 
 
     if (n == 1) controller.error("don't need more lines") 
 
     chunk = decoder.decode(chunk) 
 
     console.log(`chunk[${n++}]: ${chunk}`) 
 
    } 
 
}) 
 

 
file 
 
    .stream() 
 
    .pipeThrough(byLine()) 
 
    // .pipeThrough(new TextDecoder) something like this will work eventually 
 
    .pipeTo(stdout)
<script src="https://cdn.rawgit.com/creatorrr/web-streams-polyfill/master/dist/polyfill.min.js"></script> 
 
<script src="https://cdn.rawgit.com/jimmywarting/Screw-FileReader/master/index.js"></script> 
 

 
<!-- after a year or so you only need byLine --> 
 
<script src="https://cdn.rawgit.com/jimmywarting/web-byline/master/index.js"></script>

+1

有趣的做法,neadLess說! :) – gsamaras

+0

謝謝,期待的功能:) – Endless

+1

請不要鼓勵使用'innerHTML'與外部輸入,因爲它可能會引入安全漏洞。另外'document.body.innerHTML + ='不好,因爲它強制重新整理整個文檔。考慮使用'element.insertAdjacentText'或'document.createTextNode' +'element.appendChild'來代替。 –