PHP找到正克陣列

-2

$excerpts = array(
    'I love cheap red apples', 
    'Cheap red apples are what I love', 
    'Do you sell cheap red apples?', 
    'I want red apples', 
    'Give me my red apples', 
    'OK now where are my apples?' 
);

我想找到的所有正克在這些行得到這樣的結果：

便宜的紅蘋果：3個
紅蘋果：5
蘋果：6

我試圖破解數組然後解析它，但它很愚蠢，因爲可以找到新的n-gram，因爲字符串之間沒有任何可見的連接。

你將如何進行？

來源

2014-10-19 mattspain

爲了繼續，我會查找n-gram算法，然後決定哪個適合在這個數據集上實現。第一次電話：[關於N-grams的維基百科]（http://en.wikipedia.org/wiki/N-gram）。 – 2014-10-19 22:14:58

感謝您的建議，這是我所做的，但我需要任何解決方案或至少具體的例子，它們會給我我提供的最終輸出。 – mattspain 2014-10-20 11:42:22

你好，這個圖書館爲你服務：https://packagist.org/packages/drupol/phpngrams 讓我知道它是怎麼回事！ – 2018-02-05 20:53:04

我想找到一組單詞沒有之前知道他們雖然與功能，我需要什麼

之前提供給他們試試這個：

mb_internal_encoding('UTF-8'); 

$joinedExcerpts = implode(".\n", $excerpts); 
$sentences = preg_split('/[^\s|\pL]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY); 

$wordsSequencesCount = array(); 
foreach($sentences as $sentence) { 
    $words = array_map('mb_strtolower', 
         preg_split('/[^\pL+]/umi', $sentence, -1, PREG_SPLIT_NO_EMPTY)); 
    foreach($words as $index => $word) { 
     $wordsSequence = ''; 
     foreach(array_slice($words, $index) as $nextWord) { 
       $wordsSequence .= $wordsSequence ? (' ' . $nextWord) : $nextWord; 
      if(!isset($wordsSequencesCount[$wordsSequence])) { 
       $wordsSequencesCount[$wordsSequence] = 0; 
      } 
      ++$wordsSequencesCount[$wordsSequence]; 
     } 
    } 
} 

$ngramsCount = array_filter($wordsSequencesCount, 
          function($count) { return $count > 1; });

我假設你只想重複一組單詞。的var_dump($ngramsCount);的輸出中是：

array (size=11) 
    'i' => int 3 
    'i love' => int 2 
    'love' => int 2 
    'cheap' => int 3 
    'cheap red' => int 3 
    'cheap red apples' => int 3 
    'red' => int 5 
    'red apples' => int 5 
    'apples' => int 6 
    'are' => int 2 
    'my' => int 2

的代碼可以被優化，以，例如，使用較少的存儲器。

來源

2014-10-20 13:38:08

這是如此完美，正是我所問的。非常感謝！ – mattspain 2014-10-20 18:07:04

-1

假設你只是想算一筆串出現的次數：

$cheapRedAppleCount = 0; 
$redAppleCount = 0; 
$appleCount = 0; 
for($i = 0; $i < count($excerpts); $i++) 
{ 
    $cheapRedAppleCount += preg_match_all('cheap red apples', $excerpts[$i]); 
    $redAppleCount += preg_match_all('red apples', $excerpts[$i]); 
    $appleCount += preg_match_all('apples', $excerpts[$i]); 
}

preg_match_all返回給定字符串匹配的數量，所以你可以只添加匹配的數量上的計數器。

preg_match_all欲瞭解更多信息。

道歉，如果我誤解了。

來源

2014-10-19 22:24:18 user1849060

我想OP可能想要找到任何字符串集合中的所有n元組，而不僅僅是那些特定字符串中的那三個。：\ – 2014-10-19 22:27:13

我想在不知道他們之前找到一組單詞，但不幸的是，這不符合我的要求。無論如何，感謝您的幫助。 – mattspain 2014-10-20 11:41:16

試試這個（使用implode，因爲這是你提到的企圖）：

$ngrams = array(
    'cheap red apples', 
    'red apples', 
    'apples', 
); 

$joinedExcerpts = implode("\n", $excerpts); 
$nGramsCount = array_fill_keys($ngrams, 0); 
var_dump($ngrams, $joinedExcerpts); 
foreach($ngrams as $ngram) { 
    $regex = '/(?:^|[^\pL])(' . preg_quote($ngram, '/') . ')(?:$|[^\pL])/umi'; 
    $nGramsCount[$ngram] = preg_match_all($regex, $joinedExcerpts); 
}

來源

2014-10-19 23:06:51

重點是：我想在不知道它們的情況下找到一組單詞，儘管使用你的功能我需要在任何事情之前提供它們。無論如何，感謝您的幫助。 – mattspain 2014-10-20 11:44:15

對不起，我誤解了這個問題。如果「I」，「I love」和「are」這兩個詞組被認爲是n-gram，並且不應該重複的組詞被忽略（「Do」，「Do you」等）？ – 2014-10-20 12:05:46

The code provided by Pedro Amaral Couto以上是非常好的。因爲我用它爲法國，我修改了正則表達式如下：

$sentences = preg_split('/[^\s|\pL-\'’]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);

通過這種方式，我們可以分析包含連字符和撇號（「EST-CE闕」的話，「J'AI」等）

來源

2016-04-07 19:49:36 easypronunciation

PHP找到正克陣列

回答

相關問題