2017-04-03 56 views
0

我對大型數據庫進行文本挖掘的文本預處理,我想從數據庫中的所有文章中創建一個camus數據到數組中,但需要長時間處理。PHP - 文本挖掘緩慢過程的文本預處理

$multiMem = memory_get_usage(); 
$xstart = microtime(TRUE); 
$word = ""; 
$sql = mysql_query("SELECT * FROM tbl_content"); 
while($data = mysql_fetch_assoc($sql)){ 
    $word = $word."".$data['article']; 
} 

$preprocess = new preprocess($word); 
$word= $preprocess->preprocess($word); 
print_r($kata); 

$xfinish = microtime(TRUE); 

這裏是我的類預處理

class preprocess { 

    var $teks; 

    function preprocess($teks){ 
    /*start process segmentation*/ 
    $teks = trim($teks); 

    //menghapus tanda baca 
    $teks = str_replace("'", "", $teks); 
    $teks = str_replace("-", "", $teks); 
    $teks = str_replace(")", "", $teks); 
    $teks = str_replace("(", "", $teks); 
    $teks = str_replace("=", "", $teks); 
    $teks = str_replace(".", "", $teks); 
    $teks = str_replace(",", "", $teks); 
    $teks = str_replace(":", "", $teks); 
    $teks = str_replace(";", "", $teks); 
    $teks = str_replace("!", "", $teks); 
    $teks = str_replace("?", "", $teks); 

    //remove HTML tags 
    $teks = strip_tags($teks); 
    $teks = preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $teks); 
    /*end proses segmentation*/ 

    /*start case folding*/ 
    $teks = strtolower($teks); 

    $teks = preg_replace('/[0-9]+/', '', $teks); 
    /*end case folding*/ 

    /*start of tokenizing*/ 
    $teks = explode(" ", $teks); 

    /*end of tokenizing*/ 

    /*start of filtering*/ 
    //stopword 
    $file = file_get_contents('stopword.txt', FILE_USE_INCLUDE_PATH); 
    $stopword = explode("\n", $file); 

    //remove stopword 
    $teks = preg_replace('/\b('.implode('|',$stopword).')\b/','',$teks); 

    /*end of filtering*/ 

    /*start of stemming*/ 
    require_once('stemming.php'); 
    foreach($teks as $t => $value){ 
    $teks[$t] = stemming($value); 
    } 
    /*end of stemming*/ 

    $teks = array_filter($teks); 
    $teks = array_values($teks); 

    return $teks; 
} 
} 

任何人有任何想法,使我的計劃快的過程?請幫助
感謝提前

+0

首先比較'PHP'和'mysql'查詢的處理時間,然後更新哪些花費很長時間。 –

+0

對於microtime中的所有預處理,請參閱'659.52643299103'和'2210'返回類別的數組長度 –

+0

可能過多的內存使用情況。我會在每一篇文章的while循環中做這件事,避免重複'file_get_contents'等。 – Deadooshka

回答

1

療法是一對夫婦的事情,可能會improoved ...

  1. 建立了$word後,你可以自由查詢結果$sqldata

    $word = ''; 
    $sql = mysql_query("SELECT * FROM tbl_content"); 
    while($data = mysql_fetch_assoc($sql)){ 
        $word = $word . $data['article']; 
    } 
    mysql_free_result($sql); 
    unset($sql, $data); 
    
  2. 該區塊:

    $teks = str_replace("'", "", $teks); 
    $teks = str_replace("-", "", $teks); 
    $teks = str_replace(")", "", $teks); 
    $teks = str_replace("(", "", $teks); 
    $teks = str_replace("=", "", $teks); 
    $teks = str_replace(".", "", $teks); 
    $teks = str_replace(",", "", $teks); 
    $teks = str_replace(":", "", $teks); 
    $teks = str_replace(";", "", $teks); 
    $teks = str_replace("!", "", $teks); 
    $teks = str_replace("?", "", $teks); 
    

可以寫成如下:

$teks = str_replace(array('(','-',')',',','.','=',';','!','?'), '', $teks); 
  • 因爲你在後面的代碼與正則表達式替換號碼,可以在上層中添加數str_replace電話,或上字符添加到preg_replace

    $teks = str_replace(array('0','1','2','3','4','5','6','7','8','9','(','-',')',',','.','=',';','!','?'), '', $teks); 
    

    OR

    $teks = preg_replace('/[0-9,\(\)\-\=\.\,\;\!\?]+/', '', $teks); 
    
  • $teks = strip_tags($teks);應該足夠了。如果不是,那麼請使用下面的preg_replace,因爲它的做法是一樣的。

  • 使用file insted的的的file_get_contents的followed by the爆炸since the文件returns an array directly. Also there is no need to explode the $teks

    $stopword = file('stopword.txt'); 
        array_walk($stopword, function(&$item1){ 
        $item1 = '/\b' . $item1 . '\b/'; 
        }); 
        $teks = preg_replace($stopword, '', $teks); 
    
  • 一般不要用""由於處理器將嘗試內容評估,並且需要更長的時間。

  • 如果stopword.txt列表沒有更改,則直接在代碼中將其作爲數組存儲在代碼中,然後訪問文件系統以讀取它。

  • +0

    我改變了我的代碼與你的建議,但它仍然需要很長的迴應..請檢查並幫助我[鏈接](https://pastebin.com/Q7XsrXqM) –

    +0

    也我更改** stopword.txt **單個數組 –

    +0

    @RizaldySetiawanH也向我們展示steam.php;) – bluehipy