字符串操作性能問題

最近我們遇到了一段代碼生成XML的性能問題。想在這裏分享經驗。這有點長，請耐心等待。字符串操作性能問題

我們編寫一個簡單的XML與一些項目。每個項目可以有5-10個元素。該結構是這樣的：

<Root> 
    <Item> 
     <Element1Key>Element1Val</Element1Key> 
     <Element2Key>Element2Val</Element2Key> 
     <Element3Key>Element3Val</Element3Key> 
     <Element4Key>Element4Val</Element4Key> 
     <Element5Key>Element5Val</Element5Key> 
    <Item> 
    <Item> 
     <Element1Key>Element1Val</Element1Key> 
     <Element2Key>Element2Val</Element2Key> 
     <Element3Key>Element3Val</Element3Key> 
     <Element4Key>Element4Val</Element4Key> 
     <Element5Key>Element5Val</Element5Key> 
    <Item> 
</Root>

產生爲（以簡化的形式作爲全局函數）的XML代碼：

void addElement(std::string& aStr_inout, const std::string& aKey_in, const std::string& aValue_in) 
{ 
    aStr_inout += "<"; 
    aStr_inout += aKey_in; 
    aStr_inout += ">"; 
    aStr_inout += "Elemem1Val"; 
    aStr_inout += "<"; 
    aStr_inout += aValue_in; 
    aStr_inout += ">"; 
} 

void PrepareXML_Original() 
{ 
    clock_t commence,complete; 
    commence=clock(); 

    std::string anXMLString; 
    anXMLString += "<Root>"; 

    for(int i = 0; i < 200; i++) 
    { 
     anXMLString += "<Item>"; 
     addElement(anXMLString, "Elemem1Key", "Elemem1Value"); 
     addElement(anXMLString, "Elemem2Key", "Elemem2Value"); 
     addElement(anXMLString, "Elemem3Key", "Elemem3Value"); 
     addElement(anXMLString, "Elemem4Key", "Elemem4Value"); 
     addElement(anXMLString, "Elemem5Key", "Elemem5Value"); 
     anXMLString += "</Item>"; 


     replaceAll(anXMLString, "&", "&amp;"); 
     replaceAll(anXMLString, "'", "&apos;"); 
     replaceAll(anXMLString, "\"", "&quot;"); 
     replaceAll(anXMLString, "<", "&lt;"); 
     replaceAll(anXMLString, ">", "&gt;"); 
    } 
    anXMLString += "</Root>"; 

    complete=clock(); 
    LONG lTime=(complete-commence); 
    std::cout << "Time taken for the operation is :"<< lTime << std::endl; 
}

所述的replaceAll（）代碼將與編碼替換特殊字符形成。這在下面給出。

void replaceAll(std::string& str, const std::string& from, const std::string& to) 
{ 
    size_t start_pos = 0; 
    while((start_pos = str.find(from, start_pos)) != std::string::npos) 
    { 
     str.replace(start_pos, from.length(), to); 
     start_pos += to.length(); 
    } 
}

在最小的例子中，我編碼了200項。但是，在實際情況下，這可能更多。上面的代碼花費了大約20秒來創建XML。這遠遠超出了任何可接受的限度。可能是什麼問題呢？如何提高這裏的表現？

注：string類的使用並沒有太大的區別。我使用MFC CString的另一個字符串實現測試了相同的邏輯，並且我得到了類似的（更糟糕的）觀察。另外，我不想在這裏使用任何DOM XML解析器以更好的方式準備XML。這個問題不是特定於XML。

來源

2012-07-09 PermanentGuest

什麼是你運行了分析器的輸出，正是它指向爲瓶頸？分配？數據的副本？ – PlasmaHH 2012-07-09 11:23:55

@PlasmaHH：我沒有使用任何分析器，只是從功能輸入時間，我能夠得出結論，每個項目增加需要時間。請參閱下面的答案。通過以下修改，我能夠大幅提升性能。 – PermanentGuest 2012-07-09 13:14:41

如果你可以估算的創建內容之前，結果字符串（anXMLString）的長度，那麼你可以爲字符串分配足夠的緩衝空間。當緩衝區足夠大時，重新分配和複製（目標字符串的）不會發生。

這樣：

std::string anXMLString; 
anXMLString.reserve(size);

我不知道有關的std :: string，它需要搜索附加點，或者處於字符串的長度保持在內存中。

來源

2012-07-09 11:52:40 SKi

我意識到問題可能是由於一個事實，即相同的字符串是越來越長，這導致了以下內容：作爲字符串增長 2.字符替換 1.字符串連接變得更加昂貴隨着循環的進行而發生在更大的字符串中，並且變得越來越慢。

爲了解決這個問題，我使用的臨時字符串來獲得編碼的個別項目XML和對循環結束時，我追加這個小XML的主要原因之一。修改後的方法如下。

for(int i = 0; i < 200; i++) 
{ 
    std::string anItemString; // Create a new string for the individual Item entry 
    anItemString += "<Item>"; 
    addElement(anItemString, "Elemem1Key", "Elemem1Value"); 
    addElement(anItemString, "Elemem2Key", "Elemem2Value"); 
    addElement(anItemString, "Elemem3Key", "Elemem3Value"); 
    addElement(anItemString, "Elemem4Key", "Elemem4Value"); 
    addElement(anItemString, "Elemem5Key", "Elemem5Value"); 
    anItemString += "</Item>"; 


    replaceAll(anItemString, "&", "&amp;"); 
    replaceAll(anItemString, "'", "&apos;"); 
    replaceAll(anItemString, "\"", "&quot;"); 
    replaceAll(anItemString, "<", "&lt;"); 
    replaceAll(anItemString, ">", "&gt;"); 

    anXMLString += anItemString; // Do all the operations on the new string and finally append to the main string. 
}

這改善了XML創建的性能，所需時間僅爲17毫秒！

因此，我學到的教訓是，當創建一個更大的結果時，將其拆分成子操作，將子操作的結果收集到新的字符串中，並追加一次到全局結果。我不確定這是否已經是一種模式或名稱。

由於計算器提供了一個分享Q中&一個方面的經驗進行選擇，我想利用它。歡迎任何意見/改進。

來源

2012-07-09 11:16:03 PermanentGuest

我不確定，但我認爲當你打算做很多字符串連接時，使用stringstream而不是字符串的+運算符會更高效。但我從來沒有做過檢查差異的測試。 – 2012-07-09 11:21:05

@ W.Goeman它顯然取決於每一個的實現，但是我熟悉的'stringstream'的實現通過附加到'std :: string'實際插入，所以它們不會提高性能。 – 2012-07-09 12:36:04

另一件事_might_改進的東西是使用'std :: vector '而不是'std :: string'。 'std :: vector'需要指數增長模式，以便'push_back'具有分期不變的複雜性;很多'std :: string'的早期實現使用線性增長，並且在字符串變得非常大時變得非常緩慢。（正如我所說，這個_might_有所作爲，或者可能不會：我不知道當前字符串實現中發生了什麼太多。） – 2012-07-09 12:39:46

字符串操作性能問題

回答

相關問題