我正試圖圍繞並行策略包圍我的頭。我想我理解每個組合器都做了什麼,但是每次我嘗試使用它們的核心都不止一個時,程序就會大大減慢。高效的並行策略
例如,一段時間後,我試圖從〜700個文檔計算直方圖(以及從它們的獨特詞彙)。我認爲使用文件級粒度是可以的。用-N4
我得到1.70工作餘額。但與-N1
相比,它的運行時間縮短了一半,比-N4
運行時間縮短了一半。我不確定這個問題真的是什麼,但我想知道如何確定何時/何時/如何並行並獲得一些理解。這將如何並行化,以便速度隨着內核而增加而不是減小?
import Data.Map (Map)
import qualified Data.Map as M
import System.Directory
import Control.Applicative
import Data.Vector (Vector)
import qualified Data.Vector as V
import qualified Data.Text as T
import qualified Data.Text.IO as TI
import Data.Text (Text)
import System.FilePath ((</>))
import Control.Parallel.Strategies
import qualified Data.Set as S
import Data.Set (Set)
import GHC.Conc (pseq, numCapabilities)
import Data.List (foldl')
mapReduce stratm m stratr r xs = let
mapped = parMap stratm m xs
reduced = r mapped `using` stratr
in mapped `pseq` reduced
type Histogram = Map Text Int
rootDir = "/home/masse/Documents/text_conversion/"
finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"]
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]
isStopWord :: Text -> Bool
isStopWord x = x `elem` (finnishStop ++ englishStop)
textFiles :: IO [FilePath]
textFiles = map (rootDir </>) . filter (not . meta) <$> getDirectoryContents rootDir
where meta "." = True
meta ".." = True
meta _ = False
histogram :: Text -> Histogram
histogram = foldr (\k -> M.insertWith' (+) k 1) M.empty . filter (not . isStopWord) . T.words
wordList = do
files <- mapM TI.readFile =<< textFiles
return $ mapReduce rseq histogram rseq reduce files
where
reduce = M.unions
main = do
list <- wordList
print $ M.size list
對於文本文件,我使用的PDF文件轉換爲文本文件,所以我不能爲他們提供,但爲宗旨,幾乎任何一本書/來自Project Gutenberg的書應該做的。
編輯:增加進口腳本
'histogram = foldr(\ k - > M.insertWith'(+)k 1)M.empty。過濾器(而不是isStopWord)。 T.words'應該使用'foldl'。 'foldr'在構建'Map'之前很長的時間內就構建了一個很深的列表。 –
如果你提供一個小而完整的例子,回答這樣的問題會容易得多。沒有詳細查看:你確定'rseq'作爲'mapReduce'的第一個參數是否足以強制每一塊工作真的並行? 「parMap」中每個列表元素的工作量是否足夠大以確保並行任務的良好粒度?你有沒有嘗試在你的程序上運行threadscope來查看每個內核上發生了什麼?您是否曾嘗試使用'RTS -s'運行以查看在垃圾回收中花費了多少時間? – kosmikus
kosmikus,你是什麼樣的完整例子?除了腳本完全可運行的導入。對於rseq/rdeepseq,我嘗試了其他組合,但沒有運氣。至於parMap,我也嘗試了parListChunk和parListN的映射。至於threadscope,似乎有穩定的行動和gc。 -s說這是60%的工作時間,這比-N1更好。 – Masse