2010-11-03 93 views
0

我已經設置了在非常有限的時間內在PHP中創建基本文本文件搜索引擎的挑戰,它幾乎沒有以前的編程知識相當的任務!如何返回結果文檔中的字數來計算TF

這是我們到目前爲止,它設法返回文檔((s) - 如果不止一個具有相同的數量)與一個單詞的最高出現次數。

問題是我們所做的方式沒有(至少不容易)讓我們計算TF-IDF分數。 IDF已完成,但我們需要通過獲取返回文檔中的單詞總數來計算TF,這就是我們遇到的問題。另一個問題是,它只返回最高的文件,我們不能得到它返回一個文件列表,每個文件都有他們的分數....即一個文件有3次單詞「airline」,另外兩個文件有它一次他們都忘了,只返回第一個...

(也有一些問題,剝離符號,但我們的工作解決這個問題,儘管拉出的方法...)

下面是我們有:

<?php 
$starttime = microtime(); 
$startarray = explode(" ", $starttime); 
$starttime = $startarray[1] + $startarray[0]; 

if(isset($_GET['search'])) 
{ 
    $searchWord = $_GET['search']; 
} 
else 
{ 
    $searchWord = null; 
} 

?> 
<html> 
<link href="style.css" rel="stylesheet" type="text/css"> 
<body> 
<div id="wrapper"> 
    <div id="searchbar"> 
     <h1>PHP Search</h1> 
     <form name='searchform' id='searchform' action='<?php echo $_SERVER['PHP_SELF']; ?>' method='get'> 
      <input type='text' name='search' id='search' value='<?php echo $_GET['search']; ?>' /> 
      <input type='submit' value='Search' /> 
     </form> 
     <br /> 
     <br /> 
    </div><!-- close searchbar --> 
    <?php 


//path to directory to scan 
$directory = "./files/"; 

//get all image files with a .txt extension. 
$files = glob("" . $directory . "*.txt"); 
$fileList = array(); 
//print each file name 
foreach($files as $file) 
{ 
$fileList[] = $file; 
} 
//$fileList; 


     function indexFile($file){ 
      $filename = $file; 
      $fp = fopen($filename, 'r'); 
      $file_contents = fread($fp, filesize($filename)); 
      fclose($fp); 

      $pat[0] = "/^\s+/"; 
      $pat[1] = "/\s{2,}/"; 
      $pat[2] = "/\s+\$/"; 
      $rep[0] = ""; 
      $rep[1] = " "; 
      $rep[2] = ""; 

      $new_contents = preg_replace("/[^A-Za-z0-9\s\s+]/", "", $file_contents); 
      $new_contents = preg_replace($pat, $rep, $new_contents); 

      //COMMON WORDS WERE HERE 
      include "commonwords.php"; 

      $lines = explode("\n", $new_contents); 
      $lines2 = implode(" ", $lines); //string 
      $lines2 = strtolower($lines2); 

      //echo $lines2 . "<br><br>"; 

      $words = explode(" ", $lines2); //array 
      //$words = $lines; 
      $useful_words = array_diff($words, $commonWords); 
      $useful_words = array_values($useful_words); 
      print_r(count($useful_words)); 

      //echo '<pre>'; 
      $index = array_count_values($useful_words); 
      arsort($index, SORT_NUMERIC); 
      //print_r($index); 
      //echo '</pre>'; 

      return $index; 
     } 
     // $file1 = indexFile ('airlines.txt'); //array 
     // $file2 = indexFile ('africa.txt'); //array 

     function merge_common_keys(){ 
      $arr = func_get_args(); 
      $num = func_num_args(); 

      $keys = array(); 
      $i = 0; 
      for($i=0;$i<$num;++$i){ 
       $keys = array_merge($keys, array_keys($arr[$i])); 
      } 
      $keys = array_unique($keys); 

      $merged = array(); 

      foreach($keys as $key){ 
       $merged[$key] = array(); 
       for($i=0;$i<$num;++$i){ 
        $merged[$key][] = isset($arr[$i][$key])?$arr[$i][$key]:null; 
       } 
      } 
      return $merged; 
     } 


    for ($i = 0; $i < count($fileList); $i++) { 
     $fileArray[$i] = indexFile($fileList[$i]); 
    } 

     $merged = call_user_func_array('merge_common_keys',$fileArray); 

     $searchQ = $merged[$searchWord]; 
     echo '<pre>'; 
     print_r($searchQ); 
     echo '</pre>'; 


     //echo "hello2"; 
    $maxValue = 0; 
    $num_docs = 0; 
    $docID = array(); 
    $n = count($searchQ); 
    for ($i=0 ; $i < $n ; $i++) { 
     if ($searchQ[$i] > $maxValue) { 
      $maxValue = $searchQ[$i]; 
      unset($docID); 
      $docID[] = $i; 
      //print_r(count($fileArray[$i])); 
     } 
     else if($searchQ[$i] == $maxValue){ 
      $docID[] = $i; 
     } 
     if (!empty($searchQ[$i])) { 
      $num_docs++; 
     } 
    } 
    print_r($n); 
    print_r($num_docs); 
     print_r($docID); 
     if(is_array($docID)){ 
     for ($i = 0; $i < count($docID); $i++) { 
      if ($maxValue == 1){$plural = '';}else{$plural = 's';} 
      print_r ('<p><b>'.$searchWord . '</b> found in document <a href="'.$fileList[$docID[$i]].'">'.$fileList[$docID[$i]].'</a> '.$maxValue.' time'.$plural.'.</p>'); 
      $TF = $maxValue; 
      //$TF2 = 1 + log($TF); 
      echo "<br>$TF2<br>"; 
      $DF = $num_docs; 
      $Non = $n/$num_docs; 
      //echo "$Non"; 
      $IDF = (float) log10($Non); 
      $TFxIDF = $TF2 * $IDF; 
      //echo "$TFxIDF"; 
     } 
     } 


//1,2 

//file_put_contents("demo2.txt", implode(" ", $useful_words)); 
if(isset($_GET['search'])) 
{ 
    $endtime = microtime(); 
    $endarray = explode(" ", $endtime); 
    $endtime = $endarray[1] + $endarray[0]; 
    $totaltime = $endtime - $starttime; 
    $totaltime = round($totaltime,5); 
    echo "<div id='timetaken'><p>This page loaded in $totaltime seconds.</p></div>"; 
} 
?> 
    </div><!-- close wrapper --> 
</body> 
</html> 
+0

你能清理你的格式嗎? – 2010-11-03 15:02:04

回答

0

使用str_word_count來計算單詞的數量。

+0

好吧,但我不能訪問包含單詞數量的數組($ useful_words),因爲它在一個正在返回另一個變量的函數內...... – 2010-11-03 15:53:53

+0

使用'$ words = str_word_count($ text,1);'。它將包含$ text中的所有單詞。然後你可以按照你的意願'array_diff'和'count'。 – netcoder 2010-11-03 16:04:01