優化Haskell代碼

我正在努力學習Haskell，並且在關於馬爾可夫文本鏈的reddit中的一篇文章中，我決定首先在Python中實現Markov文本生成，現在在Haskell中。但是我注意到我的python實現比Haskell版本快，即使Haskell編譯爲本地代碼。我想知道我應該怎麼做才能讓Haskell代碼運行得更快，現在我相信由於使用Data.Map而不是hashmaps，所以速度非常慢，但我不確定優化Haskell代碼

我將發佈Python代碼還有Haskell。使用相同的數據，Python需要大約3秒，而Haskell接近16秒。

不言而喻，我會採取任何建設性的批評:)。

import random 
import re 
import cPickle 
class Markov: 
    def __init__(self, filenames): 
     self.filenames = filenames 
     self.cache = self.train(self.readfiles()) 
     picklefd = open("dump", "w") 
     cPickle.dump(self.cache, picklefd) 
     picklefd.close() 

    def train(self, text): 
     splitted = re.findall(r"(\w+|[.!?',])", text) 
     print "Total of %d splitted words" % (len(splitted)) 
     cache = {} 
     for i in xrange(len(splitted)-2): 
      pair = (splitted[i], splitted[i+1]) 
      followup = splitted[i+2] 
      if pair in cache: 
       if followup not in cache[pair]: 
        cache[pair][followup] = 1 
       else: 
        cache[pair][followup] += 1 
      else: 
       cache[pair] = {followup: 1} 
     return cache 

    def readfiles(self): 
     data = "" 
     for filename in self.filenames: 
      fd = open(filename) 
      data += fd.read() 
      fd.close() 
     return data 

    def concat(self, words): 
     sentence = "" 
     for word in words: 
      if word in "'\",?!:;.": 
       sentence = sentence[0:-1] + word + " " 
      else: 
       sentence += word + " " 
     return sentence 

    def pickword(self, words): 
     temp = [(k, words[k]) for k in words] 
     results = [] 
     for (word, n) in temp: 
      results.append(word) 
      if n > 1: 
       for i in xrange(n-1): 
        results.append(word) 
     return random.choice(results) 

    def gentext(self, words): 
     allwords = [k for k in self.cache] 
     (first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache])) 
     sentence = [first, second] 
     while len(sentence) < words or sentence[-1] is not ".": 
      current = (sentence[-2], sentence[-1]) 
      if current in self.cache: 
       followup = self.pickword(self.cache[current]) 
       sentence.append(followup) 
      else: 
       print "Wasn't able to. Breaking" 
       break 
     print self.concat(sentence) 

Markov(["76.txt"])

module Markov 
(train 
, fox 
) where 

import Debug.Trace 
import qualified Data.Map as M 
import qualified System.Random as R 
import qualified Data.ByteString.Char8 as B 


type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int) 

train :: [B.ByteString] -> Database 
train (x:y:[]) = M.empty 
train (x:y:z:xs) = 
    let l = train (y:z:xs) 
    in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l 

main = do 
    contents <- B.readFile "76.txt" 
    print $ train $ B.words contents 

fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."

來源

2010-05-26 Masse

有趣的是，也在尋找答案。 16秒對3秒是一個很大的區別。 – wvd 2010-05-26 17:32:09

順便說一下，縮進似乎已經被Python代碼弄壞了...... – 2010-05-26 17:54:53

我不認爲你的Haskell代碼能夠完成你想要的東西。如果你檢查輸出，你會發現'M.Map String Int'映射中沒有大於2的值。你的意思是'n + o'還是'o + 1'而不是'n + 1'？ – 2010-05-26 18:18:56

我試圖避免做任何奇特或微妙的事情。這只是兩種方法來進行分組;第一個強調模式匹配，第二個不強調。

import Data.List (foldl') 
import qualified Data.Map as M 
import qualified Data.ByteString.Char8 as B 

type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int) 

train2 :: [B.ByteString] -> Database2 
train2 words = go words M.empty 
    where go (x:y:[]) m = m 
      go (x:y:z:xs) m = let addWord Nothing = Just $ M.singleton z 1 
           addWord (Just m') = Just $ M.alter inc z m' 
           inc Nothing = Just 1 
           inc (Just cnt) = Just $ cnt + 1 
          in go (y:z:xs) $ M.alter addWord (x,y) m 

train3 :: [B.ByteString] -> Database2 
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words)) 
    where update m (x,y,z) = M.alter (addWord z) (x,y) m 
      addWord word = Just . maybe (M.singleton word 1) (M.alter inc word) 
      inc = Just . maybe 1 (+1) 

main = do contents <- B.readFile "76.txt" 
      let db = train3 $ B.words contents 
      print $ "Built a DB of " ++ show (M.size db) ++ " words"

我認爲它們都比原始版本更快，但是我承認我只是試圖對付我發現的第一個合理的語料庫。

編輯按特拉維斯布朗的非常有效的一點，

train4 :: [B.ByteString] -> Database2 
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words)) 
    where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m 
      inc k _ = M.insertWith (+) k 1

來源

2010-05-26 18:45:20 Anthony

就我而言，我認爲最好在這裏使用比'alter'更具體的內容。我們知道在這種情況下我們永遠不需要刪除，而且必須像這樣添加'Just'會削弱可讀性。 – 2010-05-26 19:25:52

對不起，以延遲迴復。你能否解釋_爲什麼這是一個更快的解決方案？基本上都是一樣的，除了壓縮和下降。 – Masse 2010-08-21 14:55:06

一）你是如何編制的？（ghc -O2？）

b）哪個版本的GHC？ c）Data.Map是非常高效的，但你可以被誘騙到懶惰的更新 - 使用insertWith'，而不是insertWithKey。

d）不要將字符串轉換爲字符串。按照字符串的形式保存它們，並將它們存儲在地圖

來源

2010-05-26 17:26:29

版本是6.12.1。在你的幫助下，我能夠從運行時中擠出1秒，但仍然遠離python版本。 – Masse 2010-05-26 17:38:26

按照Don的建議，使用函數的更嚴格的版本：insertWithKey'（和M.insertWith'，因爲您無論如何都會忽略關鍵字參數）。

它看起來像你的代碼可能會建立很多thunk直到它到達[String]的末尾。

退房：http://book.realworldhaskell.org/read/profiling-and-optimization.html

...尤其是試圖（通過章大約一半）作圖堆。有興趣看看你想要什麼。

來源

2010-05-26 17:40:39 jberryman

我做了Don Stewart建議的更改。以前代碼花費了41-44兆字節的內存，現在只花費了29分鐘。對內存進行圖形顯示，TSO佔用了大部分內存，然後是GHC.types，然後是代碼中使用的其他數據類型。所有部分的記憶在一秒鐘內迅速增加。之後，一秒鐘TSO和GHC.types不斷增加，所有其他開始緩慢後退。（如果我正在讀圖） – Masse 2010-05-26 18:09:06

1）我不清楚你的代碼。 a）你定義「狐狸」，但不要使用它。你是否意味着我們試圖幫助你使用「狐狸」而不是閱讀文件？ b）你聲明這是「模塊馬爾科夫」，然後在模塊中有一個「主」。 c）System.Random不是必需的。如果您在發佈之前清理一下代碼，它的確幫助我們提供幫助。

2）使用ByteStrings和一些嚴格的操作，如唐說的。

3）使用-O2編譯並使用-fforce-recomp確保您實際重新編譯了代碼。

4）嘗試這個輕微的轉換，它的工作速度非常快（0.005秒）。顯然，輸入是荒謬的小，所以你需要提供你的文件或只是自己測試它。

{-# LANGUAGE OverloadedStrings, BangPatterns #-} 
module Main where 

import qualified Data.Map as M 
import qualified Data.ByteString.Lazy.Char8 as B 


type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int) 

train :: [B.ByteString] -> Database 
train xs = go xs M.empty 
    where 
    go :: [B.ByteString] -> Database -> Database 
    go (x:y:[]) !m = m 
    go (x:y:z:xs) !m = 
    let m' = M.insertWithKey' (\key new old -> M.insertWithKey' (\_ n o -> n + 1) z 1 old) (x, y) (M.singleton z 1) m 
    in go (y:z:xs) m' 

main = print $ train $ B.words fox 

fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."

來源

2010-05-26 18:17:58

嗯，我是一個初學者，喜歡標籤說：P。我沒有意識到將模塊命名爲Main之外的結果。而狐狸是我用來測試算法。檢查小型輸入比輸入整本書更容易 – Masse 2010-05-26 19:06:41

這裏有一個foldl'基礎的版本，這似乎是約快兩倍，你train：

train' :: [B.ByteString] -> Database 
train' xs = foldl' (flip f) M.empty $ zip3 xs (tail xs) (tail $ tail xs) 
    where 
    f (a, b, c) = M.insertWith (M.unionWith (+)) (a, b) (M.singleton c 1)

我試了一下Gutenberg項目Huckleberry Finn（我認爲是你的76.txt），它產生與你的函數相同的輸出。我的時間比較是非常不科學的，但這種方法可能值得一看。

來源

2010-05-26 18:55:42

Data.Map是根據類別Ord比較需要一定時間的假設而設計的。對於字符串鍵，這可能不是—的情況，並且當字符串相等時，它從來就不是這種情況。您可能會也可能不會碰到這個問題，具體取決於您的語料庫有多大以及有多少詞語具有共同前綴。

我會試圖嘗試一個數據結構，該數據結構被設計爲與序列鍵一起操作，例如Don Stewart友好建議的bytestring-trie包。

來源

2010-05-26 22:30:21

一個字節串trie？ http://hackage.haskell.org/package/bytestring-trie – 2010-05-27 02:31:04

@don：感謝您的更新。我相信你知道至少有60％的名字是hackage的內容:-) – 2010-05-27 15:11:26

優化Haskell代碼

回答

相關問題