我正在努力學習Haskell,並且在關於馬爾可夫文本鏈的reddit中的一篇文章中,我決定首先在Python中實現Markov文本生成,現在在Haskell中。但是我注意到我的python實現比Haskell版本快,即使Haskell編譯爲本地代碼。我想知道我應該怎麼做才能讓Haskell代碼運行得更快,現在我相信由於使用Data.Map而不是hashmaps,所以速度非常慢,但我不確定優化Haskell代碼
我將發佈Python代碼還有Haskell。使用相同的數據,Python需要大約3秒,而Haskell接近16秒。
不言而喻,我會採取任何建設性的批評:)。
import random
import re
import cPickle
class Markov:
def __init__(self, filenames):
self.filenames = filenames
self.cache = self.train(self.readfiles())
picklefd = open("dump", "w")
cPickle.dump(self.cache, picklefd)
picklefd.close()
def train(self, text):
splitted = re.findall(r"(\w+|[.!?',])", text)
print "Total of %d splitted words" % (len(splitted))
cache = {}
for i in xrange(len(splitted)-2):
pair = (splitted[i], splitted[i+1])
followup = splitted[i+2]
if pair in cache:
if followup not in cache[pair]:
cache[pair][followup] = 1
else:
cache[pair][followup] += 1
else:
cache[pair] = {followup: 1}
return cache
def readfiles(self):
data = ""
for filename in self.filenames:
fd = open(filename)
data += fd.read()
fd.close()
return data
def concat(self, words):
sentence = ""
for word in words:
if word in "'\",?!:;.":
sentence = sentence[0:-1] + word + " "
else:
sentence += word + " "
return sentence
def pickword(self, words):
temp = [(k, words[k]) for k in words]
results = []
for (word, n) in temp:
results.append(word)
if n > 1:
for i in xrange(n-1):
results.append(word)
return random.choice(results)
def gentext(self, words):
allwords = [k for k in self.cache]
(first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
sentence = [first, second]
while len(sentence) < words or sentence[-1] is not ".":
current = (sentence[-2], sentence[-1])
if current in self.cache:
followup = self.pickword(self.cache[current])
sentence.append(followup)
else:
print "Wasn't able to. Breaking"
break
print self.concat(sentence)
Markov(["76.txt"])
-
module Markov
(train
, fox
) where
import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) =
let l = train (y:z:xs)
in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l
main = do
contents <- B.readFile "76.txt"
print $ train $ B.words contents
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
有趣的是,也在尋找答案。 16秒對3秒是一個很大的區別。 – wvd 2010-05-26 17:32:09
順便說一下,縮進似乎已經被Python代碼弄壞了...... – 2010-05-26 17:54:53
我不認爲你的Haskell代碼能夠完成你想要的東西。如果你檢查輸出,你會發現'M.Map String Int'映射中沒有大於2的值。你的意思是'n + o'還是'o + 1'而不是'n + 1'? – 2010-05-26 18:18:56