2013-03-02 141 views
1

我有大量的單詞。 我想要計算兩個特定單詞出現次數少於某個給定距離的次數。計算兩個數組元素出現在一起的次數

例如,如果「時間」和「遲到」之間的距離不超過三個字,那麼我想增加一個計數器。單詞「time」和「late」可以在數組中出現數百次。我怎樣才能找到他們彼此靠近的時間?

+1

的可能重複的[在Perl中,如何可以找到在陣列中一個給定的值的索引?](http://stackoverflow.com/questions/1915746/in-perl-how-can-i查找數組中的給定值的索引) – Quentin 2013-03-02 09:43:30

+0

編輯該問題以匹配下面評論中提出的問題。 – 2013-12-28 17:47:31

回答

3

你沒有問一個問題,所以我認爲你有一個算法。

  1. 遍歷索引。
    1. 如果第一個字是該索引處,
      1. 需要注意的是指數。
    2. 如果第二個字是該索引處,
      1. 需要注意的是指數。
  2. 從另一個減去一個索引。

注:

  • 您可能要添加檢查,以確保每個單詞被發現。
  • 您沒有指定當其中一個單詞出現多次時應該發生的情況。

對於在評論中提出的問題:

  1. 迭代通過索引。
    1. 如果第一個字是該索引處,
      1. 需要注意的是指數。
    2. 如果第二字被該索引處發現,
      1. 如果當前索引和所提到的折射率之間的差是3 ≤,
        1. 遞增該計數器。

注:

  • 假設你只關心第二個字和第一個字的以前實例之間的距離。
+0

此外,如果數組排序,這可以用bsearch完成,所以更快 – PSIAlt 2013-03-02 10:12:43

+0

嗨。如果「時間」和「遲到」之間的距離<= 3,則計數++ 「時間」和「遲到」可以在陣列中出現100次。 我如何添加檢查以便不重述單詞? – user2126344 2013-03-02 10:30:54

+0

@ user2126344,已更新。 – ikegami 2013-03-02 10:40:02

0

使用索引散列將是非常有效的解決方案:腳本

my @words = qw(word1 word2 word3 word4 word5 word6); 

# That can be expensive, but you do it only once 
my %index; 
@index{@words} = (0..$#words); 

# That will be real quick 
my $distance = $index{"word6"} - $index{"word2"} 
print "Distance: $distance \n"; 

輸出上面會:

Distance: 4 

注:在創建索引散列可以是昂貴的。但是如果你打算做很多距離檢查,這可能是值得的,因爲任何查找都很快(恆定時間,而不是事件日誌(n))。

+1

缺點:即使您想要的兩個單詞是前兩個單詞,您也必須處理整個列表。 – ikegami 2013-03-02 10:04:31

+0

的確如此。另一方面,一旦創建了索引散列,它就可以用於快速查找。所以這是一些折衷。我會在答案中註明這一點。 – kamituel 2013-03-02 10:08:01

+0

不需要預留必要數量的散列桶,它可以是O(n log n),所以添加回顯到answ – PSIAlt 2013-03-02 10:16:36

0

是否需要支持重複單詞?

#! /usr/bin/perl 
use strict; 
use warnings; 
use constant DEBUG => 0; 

my @words; 
if($ARGV[0] && -f $ARGV[0]) { 
    open my $fh, "<", $ARGV[0] or die "Could not read $ARGV[0], because: $!\n"; 
    my $hughTestFile = do { local $/; <$fh> }; 
    @words = split /[\s\n]/, $hughTestFile; # $#words == 10M words with my test.log 
    # Test words (below) were manually placed at equal distances (~every 900K words) in test.log 
    # With above, TESTS ran in avg of 15 seconds. Likely test.log was in buffers/cache. 
} else { 
    @words = qw(word1 word2 word3 word4 word5 word6 word7 word8 word4 word9 word0); 
} 

sub IndexOf { 
    my $searchFor = shift; 
    return undef if(!$searchFor); 
    my $Nth = shift || 1; 

    my $length = $#words; 
    my $cntr = 0; 
    for my $word (@words) { 
     if($word eq $searchFor) { 
      $Nth--; 
      return $cntr if($Nth == 0); 
     } 
     $cntr++; 
    } 
    return undef; 
} 

sub Distance { 
# args: <1st word>, <2nd word>, [occurrence_of_1st_word], [occurrence_of_2nd_word] 
# for occurrence counts: 0, 1 & undef - all have the same effect (1st occurrence) 
    my($w1, $w2) = ($_[0], $_[1]); 
    my($n1, $n2) = ($_[2] || undef, $_[3] || undef); 
    die "Missing words\n" if(!$w1); 
    $w2 = $w1 if(!$w2); 

    my($i1, $i2) = (IndexOf($w1, $n1), IndexOf($w2, $n2)); 
    if(defined($i1) && defined($i2)) { 
     my $offset = $i1-$i2; 
     print " Distance (offset) = $offset\n"; 
     return undef; 
    } elsif(!defined($i1) && !defined($i2)) { 
     print " Neither words were "; 
    } elsif(!defined($i1)) { 
     print " First word was not "; 
    } else { 
     print " Second word was not "; 
    } 
    print "found in list\n"; 

    return undef; 
} 

# TESTS 
print "Your array has ".$#words." words\n"; 
print "When 1st word is AFTER 2nd word:\n"; 
Distance("word7", "word3"); 
print "When 1st word is BEFORE 2nd word:\n"; 
Distance("word2", "word5"); 
print "When 1st word == 2nd word:\n"; 
Distance("word4", "word4"); 
print "When 1st word doesn't exist:\n"; 
Distance("word00", "word6"); 
print "When 2nd word doesn't exist:\n"; 
Distance("word1", "word99"); 
print "When neither 1st or 2nd words exist:\n"; 
Distance("word00", "word99"); 
print "When the 1st word is AFTER the 2nd OCCURRENCE of 2nd word:\n"; 
Distance("word9", "word4", 0, 2); 
print "When the 1st word is BEFORE the 2nd OCCURRENCE of the 2nd word:\n"; 
Distance("word7", "word4", 1, 2); 
print "When the 2nd OCCURRENCE of the 2nd word doesn't exist:\n"; 
Distance("word7", "word99", 0, 2); 
print "When the 2nd OCCURRENCE of the 1st word is AFTER the 2nd word:\n"; 
Distance("word4", "word2", 2, 0); 
print "When the 2nd OCCURRENCE of the 1st word is BEFORE the 2nd word:\n"; 
Distance("word4", "word0", 2, 0); 
print "When the 2nd OCCURRENCE of the 1st word exists, but 2nd doesn't:\n"; 
Distance("word4", "word99", 2, 0); 
print "When neither of the 2nd OCCURRENCES of the words exist:\n"; 
Distance("word00", "word99", 2, 2); 
print "Distance between 2nd and 1st OCCURRENCES of the same word:\n"; 
Distance("word4", "", 2, 1); 
相關問題