閱讀字典時循環的優化

你好，每一個這是我的第一個問題在這裏，我使用了一個叫做MElt的開源程序，它引用了詞（給出詞法例子：give - > give）。 MElt在linux上工作，並且在Perl和Python中編程。到目前爲止，它工作良好，但它需要太多的時間來給出結果。我調查了代碼和位於負責此循環：閱讀字典時循環的優化

while (<LEFFF>) { 
    chomp; 
    s/ /_/g; 
# s/(\S)-(\S)/\1_-_\2/g; 
    /^(.*?)\t(.*?)\t(.*?)(\t|$)/ || next; 
    $form = $1; $cats = $2; $lemma = $3; 
    #print "$form \n"; 
    #print "$cats \n"; 
    #print "$lemma \n"; 
    if ($lower_case_lemmas) { 
    $lemma = lc($lemma); 
    } 
    if ($it_mapping) { 
    next if ($form =~ /^.+'$/); 
    next if ($form eq "dato" && $lemma eq "datare"); # bourrin 
    next if ($form eq "stato" && $lemma eq "stare"); # bourrin 
    next if ($form eq "stata" && $lemma eq "stare"); # bourrin 
    next if ($form eq "parti" && $lemma eq "parto"); # bourrin 
    if ($cats =~ /^(parentf|parento|poncts|ponctw)$/) {$cats = "PUNCT"} 
    if ($cats =~ /^(PRO)$/) {$cats = "PRON"} 
    if ($cats =~ /^(ARTPRE)$/) {$cats = "PREDET"} 
    if ($cats =~ /^(VER|ASP|AUX|CAU)$/) {$cats = "VERB"} 
    if ($cats =~ /^(CON)$/) {$cats = "CONJ"} 
    if ($cats =~ /^(PRE)$/) {$cats = "PREP"} 
    if ($cats =~ /^(DET)$/) {$cats = "ADJ"} 
    if ($cats =~ /^(WH)$/) {$cats = "PRON|CONJ"} 
    next if ($form =~ /^(una|la|le|gli|agli|ai|al|alla|alle|col|dagli|dai|dal|dalla|dalle|degli|dei|del|della|delle|dello|nei|nel|nella|nelle|nello|sul|sulla)$/ && $cats eq "ART"); 
    next if ($form =~ /^quest[aei]$/ && $cats eq "ADJ"); 
    next if ($form =~ /^quest[aei]$/ && $cats eq "PRON"); 
    next if ($form =~ /^quell[aei]$/ && $cats eq "ADJ"); 
    next if ($form =~ /^quell[aei]$/ && $cats eq "PRON"); 
    next if ($form =~ /^ad$/ && $cats eq "PREP"); 
    next if ($form =~ /^[oe]d$/ && $cats eq "CONJ"); 
    } 
    $qmlemma = quotemeta ($lemma); 
    for $cat (split /\|/, $cats) { 
    if (defined ($cat_form2lemma{$cat}) && defined ($cat_form2lemma{$cat}{$form}) && $cat_form2lemma{$cat}{$form} !~ /(^|\|)$qmlemma(\||$)/) { 
     $cat_form2lemma{$cat}{$form} .= "|$lemma"; 
    } else { 
     $cat_form2lemma{$cat}{$form} = "$lemma"; 
     $form_lemma_suffs = "@".$form."###@".$lemma; 
     while ($form_lemma_suffs =~ s/^(.)(.+)###\1(.+)/\2###\3/) { 
    if (length($2) <= 8) { 
     $cat_formsuff_lemmasuff2count{$cat}{$2}{$3}++; 
     if ($multiple_lemmas) { 
     $cat_formsuff_lemmasuff2count{$cat}{$2}{__ALL__}++; 
     } 
    } 
     } 
    } 
    } 
}

可變LEFFF是490489線構成的字典。因此循環將逐字比較所有字典行。這真的很重要。任何想法如何優化？謝謝。 Med。

來源

2013-08-27 MEd

你最好在這裏發表：http://codereview.stackexchange.com/ – Toto

OP現已[張貼在代碼審查（http://codereview.stackexchange.com/questions/30312/優化-的-A-while循環的搜索換詞-IN-A-詞典/ 30317）。 – amon

嘗試這條線/^(.*?)\t(.*?)\t(.*?)(\t|$)/ || next;更改爲：

/^([^\t]++)\t([^\t]++)\t([^\t]++)(\t|$)/ || next;

下一個正則表達式，刪除所有uneeded捕獲括號。

/^(parentf|parento|poncts|ponctw)$/到

/^parent[fo]|ponct[sw]$/ or why not /^p(?>arent[fo]|onct[sw])$/

/^(una|la|le|gli|agli|ai|al|alla|alle|col|dagli|dai|dal|dalla|dalle|degli|dei|del|della|delle|dello|nei|nel|nella|nelle|nello|sul|sulla)$/到

/^(?>una|l[ae]|a(?>i|l(?>l[ae])?)|col|d(?>ello|[ae](?>i|l(?l[ae])?|gli))|ne(?>i|l(?>ll[aeo])?)|sul(?>la)?)$/

（注意：您可以改善這條線，通過重新排序，把最頻繁的決定/ articolo在開始時）

嘗試改變此行：

while ($form_lemma_suffs =~ s/^(.)(.+)###\1(.+)/\2###\3/)

通過

while ($form_lemma_suffs =~ s/^(.)([^#]++)###\1(.++)/\2###\3/)

可以逆條件：

next if ($form =~ /^quest[aei]$/ && $cats eq "ADJ");

到

next if ($cats eq "ADJ" && $form =~ /^quest[aei]$/);

（實驗）可以更換以下兩行：

next if ($form eq "stato" && $lemma eq "stare"); # bourrin 
next if ($form eq "stata" && $lemma eq "stare"); # bourrin

通過

next if ($lemma eq "stare" && ($form eq "stato" || $form eq "stata"));

重要提示：使用Perl，你可以編譯你的正則表達式，它可以因爲你在while循環使用相同的正則表達式是你的情況是有用的。如果你這樣做，不要忘記把正則表達式定義放在循環之外！例如：

my $regex = qr/^parent[fo]|ponct[sw]$/; 
while (<LEFFF>) { 
... 
if ($cats =~ $regex) {$cats = "PUNCT"}

來源

2013-08-27 15:17:14

謝謝，會做。我會告訴你 – MEd

我試過了，但它仍然是一樣的東西（我的意思是時間，正則表達式不再識別單詞）。我認爲問題來自這樣一個事實，即該程序正在比較孔490489單詞與該句子的每個單詞（490489 * 5單詞=大約2500萬次迭代）。句子越大，花費的時間越多。 – MEd

Re：'/^una | l [ae] | a（？> i | l（？> l [a ...]這實際上會減慢匹配，常量字符串的簡單交替被優化爲trie數據結構它允許非常快速的查找，結果類似於你寫的正則表達式，但運行的開銷要小得多，唯一正確的優化是刪除'（...）'捕獲組（通過將其更改爲非捕獲' （？：...）'） – amon

閱讀字典時循環的優化

回答

相關問題