2012-09-20 25 views
0

我有一個函數可以將html去掉,並將這些單詞放在一個數組中,然後使用array_count_values。我試圖報告每個詞的出現次數。陣列輸出非常混亂。我試圖清理它,而且我無處可去。我想刪除電話號碼,並且由於某些原因,短語被推在一起。第一個數組似乎也是空的,但isset()或empty()似乎沒有解除它。清理字數組

$body = $this->get_response($domain); 
       $body = preg_replace('/<body(.*?)>/i', '<body>', $body); 
       $body = preg_replace('#</body>#i', '</body>', $body); 

       $openTag = '<body>'; 
       $start = strpos($body, $openTag); 
       $start += strlen($openTag); 

       $closeTag = '</body>'; 
       $end = strpos($body, $closeTag); 

       // Return if cannot cut-out the body 
       if ($end <= $start || $start === false || $end === false) { 
        $this->setValue(''); 
        return; 
       } 

       $body = substr($body, $start, $end - $start); 
       $body = preg_replace(array(
         '@<script[^>]*?>.*?</script>@si', // Strip out javascript 
         '@<style[^>]*?>.*?</style>@siU',  // Strip style tags properly 
         '@<![\s\S]*?--[ \t\n\r]*>@',   // Strip multi-line comments including CDATA 
         '/style=([\"\']??)([^\">]*?)\\1/siU',// Strip inline style attribute 
         ), '', $body); 

       $body = strip_tags($body); 
       $body = array_filter(explode(' ', $body), create_function('$str', 'return strlen($str) > 2;')); 
       $body = array_map('trim', $body); 
       $words = $body; 

       $i = 0; 

       $words = array_count_values($words); 

       foreach($words as $word){ 

        if (empty($word)) unset($words[$i]); 
        $i++; 

       } 

       echo "<pre>"; 
        print_r($words); 
        echo "</pre>"; 

輸出

Array 
(
    [] => 28 
    [333.444.5555] => 1 
    [facebook] => 2 
    [twitter] => 2 
    [linkedin] => 2 
    [youtube 

       googleplus] => 1 
    [About 

    History 
    Our] => 1 
    [Mission 
    Who] => 1 
    [This 
    That 
    Other] => 1 
    [Us 


English 

    FA 
    Football] => 1 
    [Media 
    Pay] => 2 
    [Per] => 4 
    [Think 
    Fast] => 2 
    [Marketing 
    Design] => 1 
    [Consulting 


Case] => 2 

回答

1

恐怕explode(' ', $body)是不夠的,因爲空間是不是唯一的空白字符。改爲嘗試preg_split

$body = array_filter(preg_split('/\s+/', $body), 
      create_function('$str', 'return strlen($str) > 2;')); 
+0

這樣做。真棒。謝謝! – madphp