我已經設置了在非常有限的時間內在PHP中創建基本文本文件搜索引擎的挑戰,它幾乎沒有以前的編程知識相當的任務!如何返回結果文檔中的字數來計算TF
這是我們到目前爲止,它設法返回文檔((s) - 如果不止一個具有相同的數量)與一個單詞的最高出現次數。
問題是我們所做的方式沒有(至少不容易)讓我們計算TF-IDF分數。 IDF已完成,但我們需要通過獲取返回文檔中的單詞總數來計算TF,這就是我們遇到的問題。另一個問題是,它只返回最高的文件,我們不能得到它返回一個文件列表,每個文件都有他們的分數....即一個文件有3次單詞「airline」,另外兩個文件有它一次他們都忘了,只返回第一個...
(也有一些問題,剝離符號,但我們的工作解決這個問題,儘管拉出的方法...)
下面是我們有:
<?php
$starttime = microtime();
$startarray = explode(" ", $starttime);
$starttime = $startarray[1] + $startarray[0];
if(isset($_GET['search']))
{
$searchWord = $_GET['search'];
}
else
{
$searchWord = null;
}
?>
<html>
<link href="style.css" rel="stylesheet" type="text/css">
<body>
<div id="wrapper">
<div id="searchbar">
<h1>PHP Search</h1>
<form name='searchform' id='searchform' action='<?php echo $_SERVER['PHP_SELF']; ?>' method='get'>
<input type='text' name='search' id='search' value='<?php echo $_GET['search']; ?>' />
<input type='submit' value='Search' />
</form>
<br />
<br />
</div><!-- close searchbar -->
<?php
//path to directory to scan
$directory = "./files/";
//get all image files with a .txt extension.
$files = glob("" . $directory . "*.txt");
$fileList = array();
//print each file name
foreach($files as $file)
{
$fileList[] = $file;
}
//$fileList;
function indexFile($file){
$filename = $file;
$fp = fopen($filename, 'r');
$file_contents = fread($fp, filesize($filename));
fclose($fp);
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";
$new_contents = preg_replace("/[^A-Za-z0-9\s\s+]/", "", $file_contents);
$new_contents = preg_replace($pat, $rep, $new_contents);
//COMMON WORDS WERE HERE
include "commonwords.php";
$lines = explode("\n", $new_contents);
$lines2 = implode(" ", $lines); //string
$lines2 = strtolower($lines2);
//echo $lines2 . "<br><br>";
$words = explode(" ", $lines2); //array
//$words = $lines;
$useful_words = array_diff($words, $commonWords);
$useful_words = array_values($useful_words);
print_r(count($useful_words));
//echo '<pre>';
$index = array_count_values($useful_words);
arsort($index, SORT_NUMERIC);
//print_r($index);
//echo '</pre>';
return $index;
}
// $file1 = indexFile ('airlines.txt'); //array
// $file2 = indexFile ('africa.txt'); //array
function merge_common_keys(){
$arr = func_get_args();
$num = func_num_args();
$keys = array();
$i = 0;
for($i=0;$i<$num;++$i){
$keys = array_merge($keys, array_keys($arr[$i]));
}
$keys = array_unique($keys);
$merged = array();
foreach($keys as $key){
$merged[$key] = array();
for($i=0;$i<$num;++$i){
$merged[$key][] = isset($arr[$i][$key])?$arr[$i][$key]:null;
}
}
return $merged;
}
for ($i = 0; $i < count($fileList); $i++) {
$fileArray[$i] = indexFile($fileList[$i]);
}
$merged = call_user_func_array('merge_common_keys',$fileArray);
$searchQ = $merged[$searchWord];
echo '<pre>';
print_r($searchQ);
echo '</pre>';
//echo "hello2";
$maxValue = 0;
$num_docs = 0;
$docID = array();
$n = count($searchQ);
for ($i=0 ; $i < $n ; $i++) {
if ($searchQ[$i] > $maxValue) {
$maxValue = $searchQ[$i];
unset($docID);
$docID[] = $i;
//print_r(count($fileArray[$i]));
}
else if($searchQ[$i] == $maxValue){
$docID[] = $i;
}
if (!empty($searchQ[$i])) {
$num_docs++;
}
}
print_r($n);
print_r($num_docs);
print_r($docID);
if(is_array($docID)){
for ($i = 0; $i < count($docID); $i++) {
if ($maxValue == 1){$plural = '';}else{$plural = 's';}
print_r ('<p><b>'.$searchWord . '</b> found in document <a href="'.$fileList[$docID[$i]].'">'.$fileList[$docID[$i]].'</a> '.$maxValue.' time'.$plural.'.</p>');
$TF = $maxValue;
//$TF2 = 1 + log($TF);
echo "<br>$TF2<br>";
$DF = $num_docs;
$Non = $n/$num_docs;
//echo "$Non";
$IDF = (float) log10($Non);
$TFxIDF = $TF2 * $IDF;
//echo "$TFxIDF";
}
}
//1,2
//file_put_contents("demo2.txt", implode(" ", $useful_words));
if(isset($_GET['search']))
{
$endtime = microtime();
$endarray = explode(" ", $endtime);
$endtime = $endarray[1] + $endarray[0];
$totaltime = $endtime - $starttime;
$totaltime = round($totaltime,5);
echo "<div id='timetaken'><p>This page loaded in $totaltime seconds.</p></div>";
}
?>
</div><!-- close wrapper -->
</body>
</html>
你能清理你的格式嗎? – 2010-11-03 15:02:04