散列表的性能，爲什麼C++最慢？

對散列進行了簡單的性能測試，看來C++版本比perl版本和golang版本都慢。散列表的性能，爲什麼C++最慢？

perl的版本耗時約200毫秒，
C++版本了280毫秒。
golang版本耗時56毫秒。

我與酷睿（TM）i7-2670QM CPU @ 2.20GHz，Ubuntu的14.04.3LTS PC，

任何想法？

perl的版本

use Time::HiRes qw(usleep ualarm gettimeofday tv_interval nanosleep 
         clock_gettime clock_getres clock_nanosleep clock 
         stat); 
sub getTS { 
    my ($seconds, $microseconds) = gettimeofday; 
    return $seconds + (0.0+ $microseconds)/1000000.0; 
} 
my %mymap; 
$mymap{"U.S."} = "Washington"; 
$mymap{"U.K."} = "London"; 
$mymap{"France"} = "Paris"; 
$mymap{"Russia"} = "Moscow"; 
$mymap{"China"} = "Beijing"; 
$mymap{"Germany"} = "Berlin"; 
$mymap{"Japan"} = "Tokyo"; 
$mymap{"China"} = "Beijing"; 
$mymap{"Italy"} = "Rome"; 
$mymap{"Spain"} = "Madrad"; 
$x = ""; 
$start = getTS(); 
for ($i=0; $i<1000000; $i++) { 
    $x = $mymap{"China"}; 
} 
printf "took %f sec\n", getTS() - $start;

C++版本

#include <iostream> 
#include <string> 
#include <unordered_map> 
#include <sys/time.h> 

double getTS() { 
    struct timeval tv; 
    gettimeofday(&tv, NULL); 
    return tv.tv_sec + tv.tv_usec/1000000.0; 
} 
using namespace std; 
int main() { 
    std::unordered_map<std::string,std::string> mymap; 

    // populating container: 
    mymap["U.S."] = "Washington"; 
    mymap["U.K."] = "London"; 
    mymap["France"] = "Paris"; 
    mymap["Russia"] = "Moscow"; 
    mymap["China"] = "Beijing"; 
    mymap["Germany"] = "Berlin"; 
    mymap["Japan"] = "Tokyo"; 
    mymap["China"] = "Beijing"; 
    mymap["Italy"] = "Rome"; 
    mymap["Spain"] = "Madrad"; 

    double start = getTS(); 
    string x; 
    for (int i=0; i<1000000; i++) { 
     mymap["China"]; 
    } 
    printf("took %f sec\n", getTS() - start); 
    return 0; 
}

Golang版本

package main 

import "fmt" 
import "time" 

func main() { 
    var x string 
    mymap := make(map[string]string) 
    mymap["U.S."] = "Washington"; 
    mymap["U.K."] = "London"; 
    mymap["France"] = "Paris"; 
    mymap["Russia"] = "Moscow"; 
    mymap["China"] = "Beijing"; 
    mymap["Germany"] = "Berlin"; 
    mymap["Japan"] = "Tokyo"; 
    mymap["China"] = "Beijing"; 
    mymap["Italy"] = "Rome"; 
    mymap["Spain"] = "Madrad"; 
    t0 := time.Now() 
    sum := 1 
    for sum < 1000000 { 
     x = mymap["China"] 
     sum += 1 
    } 
    t1 := time.Now() 
    fmt.Printf("The call took %v to run.\n", t1.Sub(t0)) 
    fmt.Println(x) 
}

更新1

要IMPRO ve C++版本，將x = mymap["China"];更改爲mymap["China"];，但性能差異很小。 g++ -std=c++11 unorderedMap.cc：

更新2

我編譯沒有任何優化時，得到了原來的結果。隨着「-02」的優化，這3

要刪除的可能char*到string構造函數調用，我創建了一個字符串常量成本只有大約一半的時間（150毫秒）

更新。時間降低到約220毫秒（編譯時沒有優化）。感謝@ neil-kirk的建議，優化（-O2標誌），時間約爲80ms。

double start = getTS(); string x = "China"; for (int i=0; i<1000000; i++) { mymap[x]; }

更新4

感謝@斯蒂芬 - 烏爾裏希誰指出存在的Perl版本語法錯誤。我改變了它。表演編號大約是150ms。

更新5

看來執行的指令的數量至關重要。使用命令valgrind --tool=cachegrind <cmd>

對於圍棋版本

$ valgrind --tool=cachegrind ./te1 ==2103== Cachegrind, a cache and branch-prediction profiler ==2103== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al. ==2103== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info ==2103== Command: ./te1 ==2103== --2103-- warning: L3 cache found, using its data for the LL simulation. The call took 1.647099s to run. Beijing ==2103== ==2103== I refs: 255,763,381 ==2103== I1 misses: 3,709 ==2103== LLi misses: 2,743 ==2103== I1 miss rate: 0.00% ==2103== LLi miss rate: 0.00% ==2103== ==2103== D refs: 109,437,132 (77,838,331 rd + 31,598,801 wr) ==2103== D1 misses: 352,474 ( 254,714 rd + 97,760 wr) ==2103== LLd misses: 149,260 ( 96,250 rd + 53,010 wr) ==2103== D1 miss rate: 0.3% ( 0.3% + 0.3% ) ==2103== LLd miss rate: 0.1% ( 0.1% + 0.1% ) ==2103== ==2103== LL refs: 356,183 ( 258,423 rd + 97,760 wr) ==2103== LL misses: 152,003 ( 98,993 rd + 53,010 wr) ==2103== LL miss rate: 0.0% ( 0.0% + 0.1% )

對於C++優化版本（無優化標誌）

$ valgrind --tool=cachegrind ./a.out ==2180== Cachegrind, a cache and branch-prediction profiler ==2180== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al. ==2180== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info ==2180== Command: ./a.out ==2180== --2180-- warning: L3 cache found, using its data for the LL simulation. took 64.657681 sec ==2180== ==2180== I refs: 5,281,474,482 ==2180== I1 misses: 1,710 ==2180== LLi misses: 1,651 ==2180== I1 miss rate: 0.00% ==2180== LLi miss rate: 0.00% ==2180== ==2180== D refs: 3,170,495,683 (1,840,363,429 rd + 1,330,132,254 wr) ==2180== D1 misses: 12,055 ( 10,374 rd + 1,681 wr) ==2180== LLd misses: 7,383 ( 6,132 rd + 1,251 wr) ==2180== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==2180== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==2180== ==2180== LL refs: 13,765 ( 12,084 rd + 1,681 wr) ==2180== LL misses: 9,034 ( 7,783 rd + 1,251 wr) ==2180== LL miss rate: 0.0% ( 0.0% + 0.0% )

對於C++優化版本

$ valgrind --tool=cachegrind ./a.out ==2157== Cachegrind, a cache and branch-prediction profiler ==2157== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al. ==2157== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info ==2157== Command: ./a.out ==2157== --2157-- warning: L3 cache found, using its data for the LL simulation. took 9.419447 sec ==2157== ==2157== I refs: 1,451,459,660 ==2157== I1 misses: 1,599 ==2157== LLi misses: 1,549 ==2157== I1 miss rate: 0.00% ==2157== LLi miss rate: 0.00% ==2157== ==2157== D refs: 430,486,197 (340,358,108 rd + 90,128,089 wr) ==2157== D1 misses: 12,008 ( 10,337 rd + 1,671 wr) ==2157== LLd misses: 7,372 ( 6,120 rd + 1,252 wr) ==2157== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==2157== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==2157== ==2157== LL refs: 13,607 ( 11,936 rd + 1,671 wr) ==2157== LL misses: 8,921 ( 7,669 rd + 1,252 wr) ==2157== LL miss rate: 0.0% ( 0.0% + 0.0% )

來源

2015-11-27 packetie

C++實現是否有可能在每次查找時都構造一個新的「std :: string」？ – dreamlax

是將密鑰緩存在for循環之外的本地字符串變量中。 –

您可以檢查C++和Go版本的彙編輸出嗎？編譯器完全有可能是聰明的，並且做一些類似於消除循環的東西，因爲它沒有副作用。 –

這是一個猜測：

unordered_map :: operator []需要一個字符串參數。你正在提供一個char *。如果不進行優化，C++版本可能會調用std :: string（char *）構造函數一百萬次，以便將「中國」變成std :: string。 Go的語言規範可能會將「字符串文字」與字符串的類型相同，因此不需要調用構造函數。

通過優化，字符串構造函數將被掛起並且不會看到相同的問題。或者很可能不會生成任何代碼，除了兩次系統調用來獲得時間和系統調用來打印差異之外，因爲最終它們都不起作用。

要確認這一點，您必須真正查看正在生成的組件。這將是一個編譯器選項。請參閱this問題以瞭解GCC所需的標誌。

來源

2015-11-27 05:22:45

刪除額外的構造函數調用幫助，但似乎C++版本仍然相當緩慢。謝謝！ – packetie

我仍然很想知道Go程序集和C++程序集有什麼不同。 –

從你的Perl碼（前你試圖修復它）：

@mymap =(); 
$mymap["U.S."] = "Washington"; 
$mymap["U.K."] = "London";

這不是一個地圖，一個數組。哈希映射的語法是：

%mymap; 
    $mymap{"U.S."} = ....

因此您有效要做的就是創建一個數組，而不是一個哈希表和訪問元素0所有的時間。請使用use strict;和use warnings;所有的時間用Perl和甚至警告簡單的語法檢查會顯示你的問題：

perl -cw x.pl 
Argument "U.S." isn't numeric in array element at x.pl line 9. 
Argument "U.K." isn't numeric in array element at x.pl line 10.

除此之外基準的主要部分有效地沒有任何用處（分配變量和從不使用它），一些編譯器可以檢測到它並簡單地優化它。

如果你想查詢你的Perl程序生成的代碼，你會看到：

$ perl -MO=Deparse x.pl 
@mymap =(); 
$mymap[0] = 'Washington'; 
$mymap[0] = 'London'; 
... 
for ($i = 0; $i < 1000000; ++$i) { 
    $x = $mymap[0]; 
}

也就是說它檢測到的問題在編譯時間和到數組索引訪問來替換它0

因此，無論何時你需要做基準測試：

檢查你的程序是否確實按照它應該做的。
檢查編譯器是否不會優化它，並且在編譯時不執行其他語言在運行時會執行的操作。沒有或可預測結果的任何類型的陳述都傾向於這種優化。
確認您確實衡量了您打算測量的內容，而不是其他內容。即使程序的小改動也會影響運行時間，因爲有一個內存分配需要不在之前等等，這些改變可能與您打算測量的內容不相關。在您的基準測試中，您可以一次又一次測量對相同散列元素的訪問權限，而無需訪問其他元素。這是一個可以與處理器緩存非常好的活動，但可能並不反映真實世界的使用。

，並使用一個簡單的計時器是不現實的基準。系統上還有其他進程，調度程序，高速緩衝存儲器......以及今天的CPU，它高度依賴於系統的負載，因爲CPU可能會以低功耗模式運行一個基準測試，而不是其他基準測試即用不同的CPU時鐘。例如，在我的系統上，在100ms到150ms之間的測量時間內，相同「基準」的多次運行會有所不同。

基準是謊言和像你這樣的微基準。

來源

2015-11-27 06:21:45

謝謝@ steffen-ullrich指出語法。現在它已經修復，但性能數量沒有增加。 – packetie

@codingFun：不要指望這個數字上升，因爲在代碼修復之後，它應該完成更復雜的操作。就像我說的：基準是像你這樣的謊言和微觀基準。閱讀有關CPU，電源模式，調度程序的部分....並根據生成的代碼而不是您的輸入驗證其他代碼的功能。就像我說的，一個聰明的編譯器可能會意識到它沒有任何用處，並將其優化。 –

@codingFun：除此之外，你的代碼仍然是錯誤的，如果你真的會使用像我推薦的警告，你會看到它。 –

這只是表明對於這個特定的用例，Go哈希映射實現進行了優化。

mymap["China"]調用mapaccess1_faststr專門爲字符串鍵優化。特別是對於小型單桶映射，對於短（小於32字節）的字符串甚至不計算哈希碼。

來源

2015-11-27 07:14:05 kostya

我已經修改了你的例子一點點地得到關於哈希表的結構的一些細節：

#include <iostream> 
#include <string> 
#include <unordered_map> 
#include <sys/time.h> 
#include <chrono> 

using namespace std; 
int main() 
{ 
    std::unordered_map<std::string, std::string> mymap; 

    // populating container: 
    mymap["U.S."] = "Washington"; 
    mymap["U.K."] = "London"; 
    mymap["France"] = "Paris"; 
    mymap["Russia"] = "Moscow"; 
    mymap["China"] = "Beijing"; 
    mymap["Germany"] = "Berlin"; 
    mymap["Japan"] = "Tokyo"; 
    mymap["China"] = "Beijing"; 
    mymap["Italy"] = "Rome"; 
    mymap["Spain"] = "Madrad"; 

    std::hash<std::string> h; 
    for (auto const& i : mymap) 
    { 

     printf("hash(%s) = %ud\n", i.first.c_str(), h(i.first)); 
    } 

    for (int i = 0; i != mymap.bucket_count(); ++i) 
    { 
     auto const bucketsize = mymap.bucket_size(i); 
     if (0 != bucketsize) 
     { 
      printf("size of bucket %d = %d\n", i, bucketsize); 
     } 
    } 

    auto const start = std::chrono::steady_clock::now(); 

    string const x = "China"; 
    std::string res; 

    for (int i = 0; i < 1000000; i++) 
    { 
     mymap.find(x); 
    } 

    auto const elapsed = std::chrono::steady_clock::now() - start; 
    printf("%s\n", res); 
    printf("took %d ms\n", 
      std::chrono::duration_cast<std::chrono::milliseconds>(elapsed).count()); 
    return 0; 
}

我的系統上運行，我得到〜68ms與以下輸出的運行時間：

hash(Japan) = 3611029618d 
hash(Spain) = 749986602d 
hash(China) = 3154384700d 
hash(U.S.) = 2546447179d 
hash(Italy) = 2246786301d 
hash(Germany) = 2319993784d 
hash(U.K.) = 2699630607d 
hash(France) = 3266727934d 
hash(Russia) = 3992029278d 
size of bucket 0 = 0 
size of bucket 1 = 0 
size of bucket 2 = 1 
size of bucket 3 = 1 
size of bucket 4 = 1 
size of bucket 5 = 0 
size of bucket 6 = 1 
size of bucket 7 = 0 
size of bucket 8 = 0 
size of bucket 9 = 2 
size of bucket 10 = 3

事實證明，哈希表沒有很好地優化並且包含一些衝突。進一步打印存儲桶中的元素表明，西班牙和中國位於第9桶。存儲桶可能是一個鏈接列表，節點分佈在內存中，解釋了性能下降。

如果您選擇另一個哈希表大小，使得沒有衝突，您可以獲得加速。我通過加入mymap.rehash(1001)進行了測試，並在44-52ms之間獲得了20-30％的加速。

現在，另一點將是計算「中國」的哈希值。該函數在每次迭代中執行。我們可以讓這個消失，當我們切換到恆普通的C字符串：

#include <iostream> 
#include <string> 
#include <unordered_map> 
#include <sys/time.h> 
#include <chrono> 

static auto constexpr us = "U.S."; 
static auto constexpr uk = "U.K."; 
static auto constexpr fr = "France"; 
static auto constexpr ru = "Russia"; 
static auto constexpr cn = "China"; 
static auto constexpr ge = "Germany"; 
static auto constexpr jp = "Japan"; 
static auto constexpr it = "Italy"; 
static auto constexpr sp = "Spain"; 

using namespace std; 
int main() 
{ 
    std::unordered_map<const char*, std::string> mymap; 

    // populating container: 
    mymap[us] = "Washington"; 
    mymap[uk] = "London"; 
    mymap[fr] = "Paris"; 
    mymap[ru] = "Moscow"; 
    mymap[cn] = "Beijing"; 
    mymap[ge] = "Berlin"; 
    mymap[jp] = "Tokyo"; 
    mymap[it] = "Rome"; 
    mymap[sp] = "Madrad"; 

    string const x = "China"; 
    char const* res = nullptr; 
    auto const start = std::chrono::steady_clock::now(); 
    for (int i = 0; i < 1000000; i++) 
    { 
     res = mymap[cn].c_str(); 
    } 

    auto const elapsed = std::chrono::steady_clock::now() - start; 
    printf("%s\n", res); 
    printf("took %d ms\n", 
      std::chrono::duration_cast<std::chrono::milliseconds>(elapsed).count()); 
    return 0; 
}

在我的機器，這50％〜20毫秒降低了運行時間。不同之處在於，不是從字符串內容計算散列值，而是將地址轉換爲更快的散列值，因爲它只是將地址值作爲size_t返回。我們也不需要重新粉刷，因爲與cn之間沒有碰撞。

來源

2015-11-27 10:44:24 Jens

切換到'char *'意味着你使用引用相等，所以如果你傳遞來自不同源的字符串，代碼將不起作用。 – CodesInChaos

@CodesInChaos你是對的，但我不認爲這是微觀基準的重點。 – Jens

感謝@Jens的詳細解釋。碰撞確實是一個大問題。使用不同的關鍵字「法國」，將性能提高了約25％。比較指針確實提高了性能，但這不是我們在這裏需要的。正如james-picone在生成的程序集中指出的那樣，我相信性能與執行的指令數量有很大關係。在這個問題上創建了一個更新5。謝謝。 – packetie

散列表的性能，爲什麼C++最慢？

回答

相關問題