2013-01-31 56 views
15

我正試圖圍繞並行策略包圍我的頭。我想我理解每個組合器都做了什麼,但是每次我嘗試使用它們的核心都不止一個時,程序就會大大減慢。高效的並行策略

例如,一段時間後,我試圖從〜700個文檔計算直方圖(以及從它們的獨特詞彙)。我認爲使用文件級粒度是可以的。用-N4我得到1.70工作餘額。但與-N1相比,它的運行時間縮短了一半,比-N4運行時間縮短了一半。我不確定這個問題真的是什麼,但我想知道如何確定何時/何時/如何並行並獲得一些理解。這將如何並行化,以便速度隨着內核而增加而不是減小?

import Data.Map (Map) 
import qualified Data.Map as M 
import System.Directory 
import Control.Applicative 
import Data.Vector (Vector) 
import qualified Data.Vector as V 
import qualified Data.Text as T 
import qualified Data.Text.IO as TI 
import Data.Text (Text) 
import System.FilePath ((</>)) 
import Control.Parallel.Strategies 
import qualified Data.Set as S 
import Data.Set (Set) 
import GHC.Conc (pseq, numCapabilities) 
import Data.List (foldl') 

mapReduce stratm m stratr r xs = let 
    mapped = parMap stratm m xs 
    reduced = r mapped `using` stratr 
    in mapped `pseq` reduced 

type Histogram = Map Text Int 

rootDir = "/home/masse/Documents/text_conversion/" 

finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"] 
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"] 
isStopWord :: Text -> Bool 
isStopWord x = x `elem` (finnishStop ++ englishStop) 

textFiles :: IO [FilePath] 
textFiles = map (rootDir </>) . filter (not . meta) <$> getDirectoryContents rootDir 
    where meta "." = True 
     meta ".." = True 
     meta _ = False 

histogram :: Text -> Histogram 
histogram = foldr (\k -> M.insertWith' (+) k 1) M.empty . filter (not . isStopWord) . T.words 

wordList = do 
    files <- mapM TI.readFile =<< textFiles 
    return $ mapReduce rseq histogram rseq reduce files 
    where 
    reduce = M.unions 

main = do 
    list <- wordList 
    print $ M.size list 

對於文本文件,我使用的PDF文件轉換爲文本文件,所以我不能爲他們提供,但爲宗旨,幾乎任何一本書/來自Project Gutenberg的書應該做的。

編輯:增加進口腳本

+1

'histogram = foldr(\ k - > M.insertWith'(+)k 1)M.empty。過濾器(而不是isStopWord)。 T.words'應該使用'foldl'。 'foldr'在構建'Map'之前很長的時間內就構建了一個很深的列表。 –

+3

如果你提供一個小而完整的例子,回答這樣的問題會容易得多。沒有詳細查看:你確定'rseq'作爲'mapReduce'的第一個參數是否足以強制每一塊工作真的並行? 「parMap」中每個列表元素的工作量是否足夠大以確保並行任務的良好粒度?你有沒有嘗試在你的程序上運行threadscope來查看每個內核上發生了什麼?您是否曾嘗試使用'RTS -s'運行以查看在垃圾回收中花費了多少時間? – kosmikus

+0

kosmikus,你是什麼樣的完整例子?除了腳本完全可運行的導入。對於rseq/rdeepseq,我嘗試了其他組合,但沒有運氣。至於parMap,我也嘗試了parListChunk和parListN的映射。至於threadscope,似乎有穩定的行動和gc。 -s說這是60%的工作時間,這比-N1更好。 – Masse

回答

4

實際上,讓並聯組合器很好地擴展可能很困難。 其他人已經提到讓你的代碼更加嚴格,以確保你實際上並行地執行這項工作,這絕對重要。

兩件事情,真正能殺死表現有很多內存遍歷和 垃圾收集。即使你沒有產生大量的垃圾,大量的內存穿越給CPU緩存帶來了更大的壓力,最終你的內存總線將成爲瓶頸。您的isStopWord函數執行字符串比較的很多 ,並且必須遍歷一個相當長的鏈接列表才能這樣做。 您可以使用內置Set類型從unordered-containers包保存了大量的工作,或者甚至更好的 HashSet類型(因爲重複的字符串 比較可以是昂貴的,特別是如果他們共享公共前綴)。

import   Data.HashSet    (HashSet) 
import qualified Data.HashSet    as S 

... 

finnishStop :: [Text] 
finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"] 
englishStop :: [Text] 
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"] 

stopWord :: HashSet Text 
stopWord = S.fromList (finnishStop ++ englishStop) 

isStopWord :: Text -> Bool 
isStopWord x = x `S.member` stopWord 

與此版本替換您的isStopWord功能進行更好 和規模更好(雖然絕對不是1-1)。你也可以考慮 使用HashMap(來自同一個包),而不是Map出於同樣的原因, ,但我沒有得到明顯的改變。

另一種選擇是增加默認堆大小,以便消除GC上的一些壓力併爲其提供更多空間來移動物體。給 編譯的代碼默認堆大小爲1GB(-H1G標誌),我在4個內核上得到約爲50%的GC餘額 ,而我僅在沒有(它也運行~30%的情況下運行約30%)達到約25%。

有了這兩個改變,在四個核(我的機器上)的平均運行 從〜降到10.5s,以3.5秒〜。可以說,有房的基礎上 的GC統計數據的改善(仍然只花費做富有成效的工作時間58%), 但這樣做顯著好可能需要更多的改革,以 你的算法。

+3

我很樂意改變。這對我來說是畢竟的學習:) – Masse

4

我認爲丹尼爾得到它的權利 - Data.Map和列表是一個懶惰的數據結構;你應該使用foldl' insertWith'來確保每個塊的工作都是急切地完成---否則所有的工作都會延遲到順序部分(reduce)。

此外,它並不明顯製造用於每個文件的火花是正確的粒度,尤其是當文件大小顯着不同。如果可能出現這種情況,最好連接單詞列表並拆分爲偶數大小的塊(請參閱parListChunk組合器)。當你在它的時候,我也會看到使用懶惰IO(readFile)打開許多文件的一些陷阱(運行時系統可能會耗盡文件句柄,因爲它持有太久)。

+0

從我的評論可以看出,我嘗試過parMap,parListN和parListChunk。所有都有類似的性能特點。改變foldr'foldl'導致平衡上升到> 2,但總運行時間幾乎翻了一番。 – Masse

+0

我更改爲代碼,以便foldr - > foldl'並將mapreduce從wordList移至直方圖。我將文件分割成幾行,然後在mapreduce中使用parListChunk stratm(100)xs。我成功地將運行時間翻了一番(從大約70秒到300秒) – Masse