合併N個日誌文件，保持按時間順序

我有N個不同的日誌文件來自我們的設備上運行的N個不同的服務。我想將N個文件合併到一個文件中，保持時間順序。文件大小可以從幾KB到GB不等。合併N個日誌文件，保持按時間順序

N個日誌文件具有相同的格式，它是這樣的：

********** LOGGING SESSION STARTED ************ 
* Hmsoa Version: 2.4.0.12 
* Exe Path: c:\program files (x86)\silicon biosystems\deparray300a_driver\deparray300a_driver.exe 
* Exe Version: 1.6.0.154 
************************************************ 


TIME = 2017/02/01 11:12:12,180 ; THID = 4924; CAT = ; LVL = 1000; LOG = API 'Connect'->Enter; 
TIME = 2017/02/01 11:12:12,196 ; THID = 4924; CAT = ; LVL = 1000; LOG = API 'Connect'->Exit=0; 
TIME = 2017/02/01 11:12:12,196 ; THID = 4924; CAT = ; LVL = 1000; LOG = API 'CCisProxyLocal CONNECT - ok'->Enter; 
TIME = 2017/02/01 11:12:12,196 ; THID = 4924; CAT = ; LVL = 1000; LOG = API 'CRecoveryAxesProxyLocal CONNECT - ok'->Enter; 
TIME = 2017/02/01 11:12:12,196 ; THID = 4924; CAT = ; LVL = 1000; LOG = API 'CAmplifierProxyLocalV3 CONNECT - ok'->Enter; 
TIME = 2017/02/01 11:12:12,196 ; THID = 4924; CAT = ; LVL = 1000; LOG = API 'SYSTEM_DIAGNOSIS_GET'->Enter; 
TIME = 2017/02/01 11:12:12,211 ; THID = 4924; CAT = ; LVL = 1000; LOG = API 'SYSTEM_DIAGNOSIS_GET'->Exit=0; 
TIME = 2017/02/01 11:12:12,211 ; THID = 4924; CAT = ; LVL = 1000; LOG = API 'LBL_SQUARE_SET'->Enter; 
TIME = 2017/02/01 11:12:12,219 ; THID = 4924; CAT = ; LVL = 1000; LOG = API 'LBL_SQUARE_SET'->Exit=0;

因爲我已經有N個不同的文件，我做了什麼至今施加外部排序算法閱讀單行每個文件：

#include "stdafx.h" 
#include "boost/regex.hpp" 
#include "boost/lexical_cast.hpp" 
#include "boost\filesystem.hpp" 
#include <string> 
#include <fstream> 
#include <iostream> 
#include <algorithm> 
#include <sstream> 
#include <climits> 
#include <ctime> 
namespace fs = boost::filesystem; 

static const boost::regex expression(R"(^(?:(?:TIME\s=\s\d{4}\/\d{2}\/\d{2}\s)|(?:@))([0-9:.,]+))"); 
static const boost::regex nameFileEx(R"(^[\d\-\_]+(\w+\s?\w+|\w+))"); 
static const std::string path("E:\\2017-02-01"); 
//static const std::string path("E:\\TestLog"); 

unsigned long time2Milleseconds(const std::string & time) 
{ 
    int a, b, c, d; 
    if (sscanf_s(time.c_str(), "%d:%d:%d,%d", &a, &b, &c, &d) >= 3) 
     return a * 3600000 + b * 60000 + c * 1000 + d; 
} 

void readAllFilesUntilLine7(std::vector<std::pair<std::ifstream, std::string>> & vifs) 
{ 
    std::string line; 
    for (int i = 0; i < vifs.size(); ++i) 
    { 
     int lineNumber = 0; 
     while (lineNumber != 7 && std::getline(vifs[i].first, line)) 
     { 
      ++lineNumber; 
     } 
    } 
} 

void checkRegex(std::vector<std::pair<std::ifstream, std::string>> & vifs, std::vector<unsigned long> & logTime, std::vector<std::string> & lines, int index, int & counter) 
{ 
    std::string line; 
    boost::smatch what; 
    if (std::getline(vifs[index].first, line)) 
    { 
     if (boost::regex_search(line, what, expression)) 
     { 
      logTime[index] = time2Milleseconds(what[1]); 
     } 
     lines[index] = line; 
    } 
    else 
    { 
     --counter; 
     logTime[index] = ULONG_MAX; 
    } 
} 

void mergeFiles(std::vector<std::pair<std::ifstream, std::string>> & vifs, std::vector<unsigned long> & logTime, std::vector<std::string> & lines, std::ofstream & file, int & counter) 
{ 
    std::string line; 
    boost::smatch what; 
    int index = 0; 
    for (int i = 0; i < vifs.size(); ++i) 
    { 
     checkRegex(vifs, logTime, lines, i, counter); 
    } 
    index = min_element(logTime.begin(), logTime.end()) - logTime.begin(); 
    file << lines[index] << " --> " << vifs[index].second << "\n"; 
    while (true) 
    { 
     checkRegex(vifs, logTime, lines, index, counter); 
     index = min_element(logTime.begin(), logTime.end()) - logTime.begin(); 
     if (0 == counter) 
      break; 
     file << lines[index] << " --> " << vifs[index].second << "\n"; 
    } 
} 

int main() 
{ 
    clock_t begin = clock(); 
    int cnt = std::count_if(fs::directory_iterator(path),fs::directory_iterator(),static_cast<bool(*)(const fs::path&)>(fs::is_regular_file)); 
    std::vector<std::pair<std::ifstream, std::string>> vifs(cnt); 
    int index = 0; 
    boost::smatch what; 
    std::string file; 
    for (fs::directory_iterator d(path); d != fs::directory_iterator(); ++d) 
    { 
     if (fs::is_regular_file(d->path())) 
     { 
      file = d->path().filename().string(); 
      if (boost::regex_search(file, what, nameFileEx)) 
      { 
       vifs[index++] = std::make_pair(std::ifstream(d->path().string()), what[1]); 
      } 
     } 
    } 
    std::vector<unsigned long> logTime(cnt, ULONG_MAX); 
    std::vector<std::string> lines(cnt); 
    std::ofstream filename(path + "\\TestLog.txt"); 
    readAllFilesUntilLine7(vifs); 
    mergeFiles(vifs, logTime, lines, filename, cnt); 
    filename.close(); 
    clock_t end = clock(); 
    double elapsed_secs = double(end - begin)/CLOCKS_PER_SEC; 
    std::cout << "Elapsed time = " << elapsed_secs << "\n"; 
    return 0; 
}

它確實是它應該做的，但它很慢。要合併82個大小範圍從1 KB到250 MB的文件，並創建一個包含超過6000000行的最終文件，需要70分鐘。

如何加快算法？任何幫助是極大的讚賞！

UPDATE

我已經實現了版本堆，太：

Data.h：

#pragma once 

#include <string> 

class Data 
{ 
public: 
    Data(DWORD index, 
     const std::string & line, 
     ULONG time); 
    ~Data(); 
    inline const ULONG getTime() const {return time; } 
    inline const DWORD getIndex() const { return index; } 
    inline const std::string getLine() const { return line; } 
private: 
    DWORD index; 
    std::string line; 
    ULONG time; 
}; 

class Compare 
{ 
public: 
    bool operator()(const Data & lhs, const Data & rhs) { return lhs.getTime() > rhs.getTime(); }; 
};

Data.cpp：

#include "stdafx.h" 
#include "Data.h" 


Data::Data(DWORD i_index, 
      const std::string & i_line, 
      ULONG i_time) 
    : index(i_index) 
    , line(i_line) 
    , time(i_time) 
{ 
} 


Data::~Data() 
{ 
}

主。 cpp：

#include "stdafx.h" 
#include "boost/regex.hpp" 
#include "boost/lexical_cast.hpp" 
#include "boost\filesystem.hpp" 
#include <string> 
#include <fstream> 
#include <iostream> 
#include <algorithm> 
#include <sstream> 
#include <climits> 
#include <ctime> 
#include <queue> 
#include "Data.h" 
namespace fs = boost::filesystem; 

static const boost::regex expression(R"(^(?:(?:TIME\s=\s\d{4}\/\d{2}\/\d{2}\s)|(?:@))([0-9:.,]+))"); 
static const boost::regex nameFileEx(R"(^[\d\-\_]+(\w+\s?\w+|\w+))"); 
static const std::string path("E:\\2017-02-01"); 
//static const std::string path("E:\\TestLog"); 

unsigned long time2Milleseconds(const std::string & time) 
{ 
    int a, b, c, d; 
    if (sscanf_s(time.c_str(), "%d:%d:%d,%d", &a, &b, &c, &d) >= 3) 
     return a * 3600000 + b * 60000 + c * 1000 + d; 
} 

void initializeHeap(std::ifstream & ifs, std::priority_queue<Data, std::vector<Data>, Compare> & myHeap, const int index) 
{ 
    ULONG time; 
    std::string line; 
    boost::smatch what; 
    bool match = false; 
    while (!match && std::getline(ifs, line)) 
    { 
     if (boost::regex_search(line, what, expression)) 
     { 
      time = time2Milleseconds(what[1]); 
      myHeap.push(Data(index, line, time)); 
      match = true; 
     } 
    } 
} 

void checkRegex(std::vector<std::pair<std::ifstream, std::string>> & vifs, std::priority_queue<Data, std::vector<Data>, Compare> & myHeap, ULONG time, const int index) 
{ 
    std::string line; 
    boost::smatch what; 
    if (std::getline(vifs[index].first, line)) 
    { 
     if (boost::regex_search(line, what, expression)) 
     { 
      time = time2Milleseconds(what[1]); 
     } 
     myHeap.push(Data(index, line, time)); 
    } 
} 

void mergeFiles(std::vector<std::pair<std::ifstream, std::string>> & vifs, std::priority_queue<Data, std::vector<Data>, Compare> & myHeap, std::ofstream & file) 
{ 
    int index = 0; 
    ULONG time = 0; 
    while (!myHeap.empty()) 
    { 
     index = myHeap.top().getIndex(); 
     time = myHeap.top().getTime(); 
     file << myHeap.top().getLine() << " --> " << vifs[index].second << "\n"; 
     myHeap.pop(); 
     checkRegex(vifs, myHeap, time, index); 
    } 
} 

int main() 
{ 
    clock_t begin = clock(); 
    int cnt = std::count_if(fs::directory_iterator(path), fs::directory_iterator(), static_cast<bool(*)(const fs::path&)>(fs::is_regular_file)); 
    std::priority_queue<Data, std::vector<Data>, Compare> myHeap; 
    std::vector<std::pair<std::ifstream, std::string>> vifs(cnt); 
    int index = 0; 
    boost::smatch what; 
    std::string file; 
    for (fs::directory_iterator d(path); d != fs::directory_iterator(); ++d) 
    { 
     if (fs::is_regular_file(d->path())) 
     { 
      file = d->path().filename().string(); 
      if (boost::regex_search(file, what, nameFileEx)) 
      { 
       vifs[index] = std::make_pair(std::ifstream(d->path().string()), what[1]); 
       initializeHeap(vifs[index].first, myHeap, index); 
       ++index; 
      } 
     } 
    } 
    std::ofstream filename(path + "\\TestLog.txt"); 
    mergeFiles(vifs, myHeap, filename); 
    filename.close(); 
    clock_t end = clock(); 
    double elapsed_secs = double(end - begin)/CLOCKS_PER_SEC; 
    std::cout << "Elapsed time = " << elapsed_secs << "\n"; 
    return 0; 
}

畢竟這項工作，我意識到，昨天我在Debug中運行程序。在發行啓動雙方的實現，我得到了以下結果：

向量執行：約25秒
堆實現：約27秒

因此，或者我用堆結構實現未優化或兩個實現在運行時間相等。

我還能做些什麼來加速執行嗎？

來源

2017-04-05 BugsFree

值得一提的是，[GNU sort]（https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html）有一個'-m' /'--merge'選項，合併已排序的文件。使用它可能會比編寫新程序更容易。 – ephemient

我認爲，作爲第一步，您應該嘗試確定IO或處理（CPU）是否是此處的瓶頸。（iotop？） –

'file << lines [index] << std :: endl;'這可能是一個問題。 std :: endl刷新（至少對於C++標準庫）內部緩衝區。更好地使用''\ n''。 –

這可以做得更快，並且內存不足。首先考慮：

從每個文件讀取一行（因此在任何時候只有N行在內存中）。
找到最小的N行，輸出它。
在內存中，將剛剛輸出的值替換爲當前輸出來自的文件中的下一行（請注意EOF情況）。

如果M是你的輸出文件的長度（即所有日誌合併的長度），然後是簡單的實現將是O(N * M)。

但是，以上可以通過使用堆來改進，這會減少到O(M log N)的時間。也就是說，將N內存中的元素放在堆上。從頂部彈出以輸出最小的元素。然後，當你閱讀新的一行時，只需將該行重新放回堆中即可。

來源

2017-04-05 22:07:43 TheGreatContini

除了堆，這正是代碼已經做的，不是？ –

@DanielJour速度優化是堆。內存優化只在內存中有'N'行，對吧？ – TheGreatContini

我懷疑堆會帶來真正的改善...這個問題中的例子有82個文件。線性搜索最小的82行不會比構建堆和管理插入和刪除慢得多。問題中的代碼已經使用了'N'行：'std :: vector lines（cnt）;''其中'cnt'是文件的數量，'N'。 –

合併N個日誌文件，保持按時間順序

回答

相關問題