如何使用PHP從孟加拉文本中提取關鍵字

我想使用php自動從孟加拉文本文件中提取關鍵字。我有用於閱讀孟加拉文本文件的代碼。如何使用PHP從孟加拉文本中提取關鍵字

<?php 
$target_path = $_FILES['uploadedfile']['name']; 
header('Content-Type: text/plain;charset=utf-8'); 
$fp = fopen($target_path, 'r') or die("Can't open CEDICT."); 
$i = 0; 
while ($line = fgets($fp, 1024)) 
    { 
     print $line; 
     $i++; 
    } 
fclose($fp) or die("Can't close file.");

我發現下面的代碼來提取最常見的10個關鍵字，但它不適用於孟加拉語文本。我應該做什麼改變？

function extractCommonWords($string){ 
     $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www'); 

     $string = preg_replace('/\s\s+/i', '', $string); // replace whitespace 
     $string = trim($string); // trim the string 
     $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too… 
     $string = strtolower($string); // make it lowercase 

     preg_match_all('/\b.*?\b/i', $string, $matchWords); 
     $matchWords = $matchWords[0]; 

     foreach ($matchWords as $key=>$item) { 
      if ($item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3) { 
       unset($matchWords[$key]); 
      } 
     } 
     $wordCountArr = array(); 
     if (is_array($matchWords)) { 
      foreach ($matchWords as $key => $val) { 
       $val = strtolower($val); 
       if (isset($wordCountArr[$val])) { 
        $wordCountArr[$val]++; 
       } else { 
        $wordCountArr[$val] = 1; 
       } 
      } 
     } 
     arsort($wordCountArr); 
     $wordCountArr = array_slice($wordCountArr, 0, 10); 
     return $wordCountArr; 
}

請幫助:(

來源

2016-04-24 N.N.Ontika

你能解釋更多的'，但它不工作孟加拉語texts'。什麼是確切的問題（你沒有得到10個單詞，或不適當的10個單詞或其他）？ –

@ alexander.polomodov孟加拉語是一種語言，他無法獲得用孟加拉語寫成的文本。 –

@ alexander.polomodov喜歡英文示例文本「這是一些文字，這是一些文字，自動售貨機很棒。」它會給下面的輸出 - 一些文字，機，自動售貨機但孟加拉語文字像 - 「টিপবোঝেনা，টোপবোঝেনা টিপবোঝেনা，কেমনবাপুলোক」輸出頁是空白 –

你應該進行簡單的更改：

在$stopWords陣列適當孟加拉語禁用詞
取代停用詞刪除此字符串$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string);，因爲孟加拉sybmols不匹配此圖案

完整的代碼如下所示：

<?php 

function extractCommonWords($string){ 
    // replace array below with proper Bengali stopwords 
    $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www'); 

    $string = preg_replace('/\s\s+/i', '', $string); // replace whitespace 
    $string = trim($string); // trim the string 
    // remove this preg_replace because Bengali sybmols doesn't match this pattern 
    // $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too… 
    $string = strtolower($string); // make it lowercase 

    preg_match_all('/\s.*?\s/i', $string, $matchWords); 
    $matchWords = $matchWords[0]; 

    foreach ($matchWords as $key=>$item) { 
     if ($item == '' || in_array(strtolower(trim($item)), $stopWords) || strlen($item) <= 3) { 
      unset($matchWords[$key]); 
     } 
    } 
    $wordCountArr = array(); 
    if (is_array($matchWords)) { 
     foreach ($matchWords as $key => $val) { 
      $val = trim(strtolower($val)); 
      if (isset($wordCountArr[$val])) { 
       $wordCountArr[$val]++; 
      } else { 
       $wordCountArr[$val] = 1; 
      } 
     } 
    } 
    arsort($wordCountArr); 
    $wordCountArr = array_slice($wordCountArr, 0, 10); 
    return $wordCountArr; 
} 

$string = <<<EOF 
টিপ বোঝে না, টোপ বোঝে না টিপ বোঝে না, কেমন বাপু লোক 
EOF; 
var_dump(extractCommonWords($string), $string);

輸出將是：

array(4) { 
    ["বোঝে"]=> 
    int(2) 
    ["টোপ"]=> 
    int(1) 
    ["না"]=> 
    int(1) 
    ["কেমন"]=> 
    int(1) 
} 
string(127) "টিপ বোঝে না, টোপ বোঝে না টিপ বোঝে না, কেমন বাপু লোক"

來源

2016-04-24 10:21:48

我我的回答早些時候嘗試過。但它給了雖然我包含了頭（'Content-Type：text/plain; charset = utf-8'）; 如果我通過utf8_encode（字符串）編碼輸出它給？ ?? –

嘗試新版本的代碼。我改變模式來將文本按空格分隔符分割成單詞。 –

但我得到陣列（1）{ [ 「」] => INT（2） } 串（127）「টিপবোঝেনা，টোপবোঝেনাটিপবোঝেনা，কেমনবাপুলোক」我不知道它是否有任何配置問題，否則你怎麼得到的答案，但我沒有:( –

如何使用PHP從孟加拉文本中提取關鍵字

回答

相關問題