2016-07-27 68 views
-1

我有以下代碼從here改編,我使用Node.js和Cheerio讀取html文件並將大型源文件拆分爲小塊。該代碼適用於單個文件。Node.js fs cheerio讀取和寫入多個文件

現在我需要讀取多個大型html文件並將它們依次分割並將結果文件輸出到文件夾中。 如何讀取和寫入文件夾中的每個文件然後將其分開?

下面是代碼:

var cheerio = require('cheerio'), 
    fs = require('fs'); 

fs.readFile('./sourceHtml2/testone.html', 'utf8', dataLoaded); 

function dataLoaded(err, data) { 

    $ = cheerio.load(data); 


    $('#toplevel > div').each(function (i, elem) { 

    var id = $(elem).attr('id'), 

     filename = id + '.html', 
     content = $.html(elem); 

    fs.writeFile('./output2/' + filename, content, function (err) { 

     console.log('Written html to ' + filename); 
    }); 
    }); 
} 

這裏是我的示例源文件

<!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
    <head> 
    <title>Lorem Ipsum</title> 
    </head> 
    <body> 
    <div id="toplevel"> 
     <div id="1-1"> 
     <h1>HTML Ipsum Presents One</h1> 
     <p> 
     <strong>Pellentesque habitant morbi tristique</strong>senectus et netus et malesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas semper. 

     <h2>Header Level 2</h2> 
     <ol> 
      <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li> 
      <li>Aliquam tincidunt mauris eu risus.</li> 
     </ol> 
     <h3>Header Level 3</h3> 
     <ul> 
      <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li> 
      <li>Aliquam tincidunt mauris eu risus.</li> 
     </ul> 
     </div> 
     <div id="1-2"> 
     <h1>HTML Ipsum Presents Two</h1> 
     <p> 
     <strong>Pellentesque habitant morbi tristique</strong>senectus et netus et malesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas semper. 

     <h2>Header Level 2</h2> 
     <ol> 
      <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li> 
      <li>Aliquam tincidunt mauris eu risus.</li> 
     </ol> 
     <blockquote> 
      <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus magna. Cras in mi at felis aliquet congue. Ut a est eget ligula molestie gravida. Curabitur massa. Donec eleifend, libero at sagittis mollis, tellus est malesuada tellus, 
      at luctus turpis elit sit amet quam. Vivamus pretium ornare est.</p> 
     </blockquote> 
     <h3>Header Level 3</h3> 
     <ul> 
      <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li> 
      <li>Aliquam tincidunt mauris eu risus.</li> 
     </ul> 
     </div> 
     <div id="1-3"> 
     <h1>HTML Ipsum Presents Three</h1> 
     <p> 
     <strong>Pellentesque habitant morbi tristique</strong>senectus et netus et malesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas semper. 

     <h2>Header Level 2</h2> 
     <ol> 
      <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li> 
      <li>Aliquam tincidunt mauris eu risus.</li> 
     </ol> 
     <blockquote> 
      <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus magna. Cras in mi at felis aliquet congue. Ut a est eget ligula molestie gravida. Curabitur massa. Donec eleifend, libero at sagittis mollis, tellus est malesuada tellus, 
      at luctus turpis elit sit amet quam. Vivamus pretium ornare est.</p> 
     </blockquote> 
     <h3>Header Level 3</h3> 
     <ul> 
      <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li> 
      <li>Aliquam tincidunt mauris eu risus.</li> 
     </ul> 
     </div> 
    </div> 
    </body> 
</html> 

您的幫助將不勝感激。

+0

在看看['fs.readdir'](https://nodejs.org/api/fs.html#fs_fs_readdir_path_options_callback)。它允許你獲取一個文件夾中所有文件的數組,你應該能夠遍歷該數組並傳遞給你的函數。 –

回答

1

您需要將輸入目錄中的文件作爲數組處理,並且還需要防止輸出文件夾中的文件名衝突。

下面提供的代碼爲這兩個問題提供瞭解決方案。從「輸入」子文件夾讀取HTML文件(.htm和.html),並將生成的文件寫入「輸出」子文件夾。

var cheerio = require('cheerio'), 
    fs = require('fs'); 

// process files found in the 'input' folder 
fs.readdir('./input', 'utf8', findHtmlFiles); 

function findHtmlFiles(err, files) { 

    if (files.length) { 
     files.forEach(function (fullFilename) { 
      var pattern = /\.[0-9a-z]{1,5}$/i; 
      var ext = (fullFilename).match(pattern); 
      // only process '.htm' and '.html' files 
      if (ext[0] == '.htm' || ext[0] == '.html') { 
       fs.readFile('./input/' + fullFilename, 'utf8', function (err, data) { 
        if (err) 
         throw err 
        else { 
         // add the file name to prevent collisions 
         // in the output folder 
         var fileData = { 
          file: fullFilename.slice(0, (ext[0].length * -1)), 
          data: data 
         }; 
         dataLoaded(null, fileData); 
        } 
       }); 
      } 
     }); 
    } 

} 

function dataLoaded(err, fd) { 

    $ = cheerio.load(fd.data); 

    $('#toplevel > div').each(function (i, elem) { 

     var id = $(elem).attr('id'), 
      filename = fd.file + '_' + id + '.html', 
      content = $.html(elem); 

     fs.writeFile('./output/' + filename, content, function (err) { 

      console.log('Written html to ' + filename); 
     }); 
    }); 
} 

樣品控制檯輸出:

Written html to testone_1-1.html 
Written html to testone_1-2.html 
Written html to testone_1-3.html 
Written html to testtwo_1-1.html 
Written html to testtwo_1-2.html 
Written html to testtwo_1-3.html 
+0

非常感謝Dan Nagle。它工作得很好。 – EBamba