我對大型數據庫進行文本挖掘的文本預處理,我想從數據庫中的所有文章中創建一個camus數據到數組中,但需要長時間處理。PHP - 文本挖掘緩慢過程的文本預處理
$multiMem = memory_get_usage();
$xstart = microtime(TRUE);
$word = "";
$sql = mysql_query("SELECT * FROM tbl_content");
while($data = mysql_fetch_assoc($sql)){
$word = $word."".$data['article'];
}
$preprocess = new preprocess($word);
$word= $preprocess->preprocess($word);
print_r($kata);
$xfinish = microtime(TRUE);
這裏是我的類預處理
class preprocess {
var $teks;
function preprocess($teks){
/*start process segmentation*/
$teks = trim($teks);
//menghapus tanda baca
$teks = str_replace("'", "", $teks);
$teks = str_replace("-", "", $teks);
$teks = str_replace(")", "", $teks);
$teks = str_replace("(", "", $teks);
$teks = str_replace("=", "", $teks);
$teks = str_replace(".", "", $teks);
$teks = str_replace(",", "", $teks);
$teks = str_replace(":", "", $teks);
$teks = str_replace(";", "", $teks);
$teks = str_replace("!", "", $teks);
$teks = str_replace("?", "", $teks);
//remove HTML tags
$teks = strip_tags($teks);
$teks = preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $teks);
/*end proses segmentation*/
/*start case folding*/
$teks = strtolower($teks);
$teks = preg_replace('/[0-9]+/', '', $teks);
/*end case folding*/
/*start of tokenizing*/
$teks = explode(" ", $teks);
/*end of tokenizing*/
/*start of filtering*/
//stopword
$file = file_get_contents('stopword.txt', FILE_USE_INCLUDE_PATH);
$stopword = explode("\n", $file);
//remove stopword
$teks = preg_replace('/\b('.implode('|',$stopword).')\b/','',$teks);
/*end of filtering*/
/*start of stemming*/
require_once('stemming.php');
foreach($teks as $t => $value){
$teks[$t] = stemming($value);
}
/*end of stemming*/
$teks = array_filter($teks);
$teks = array_values($teks);
return $teks;
}
}
任何人有任何想法,使我的計劃快的過程?請幫助
感謝提前
首先比較'PHP'和'mysql'查詢的處理時間,然後更新哪些花費很長時間。 –
對於microtime中的所有預處理,請參閱'659.52643299103'和'2210'返回類別的數組長度 –
可能過多的內存使用情況。我會在每一篇文章的while循環中做這件事,避免重複'file_get_contents'等。 – Deadooshka