2012-08-02 26 views
5

我有tb_sentence表:不能存儲DOCUMENT_ID

========================================================================= 
| id_row | document_id | sentence_id |   sentence_content  | 
========================================================================= 
| 1 |  1  | 0  | Introduction to Data Mining. | 
| 2 |  1  | 1  | Describe how data mining.  | 
| 3 |  2  | 0  | The boss is right.    | 
========================================================================= 

我想令牌化的sentence_content,所以tb_tokens表將包含:

========================================================================== 
| tokens_id | tokens_word | tokens_freq | sentence_id | document_id | 
========================================================================== 
|  1  | Introduction |  1 |  0  |  1  | 
|  2  | to   |  1 |  0  |  1  | 
|  3  | Data   |  1 |  0  |  1  | 
|  4  | Mining  |  1 |  0  |  1  | 
|  5  | Describe  |  1 |  1  |  1  | 
etc... 

這裏是我的代碼:

$sentence_clean = array(); 
$q1 = mysql_query("SELECT document_id FROM tb_sentence ORDER BY document_id ") or die(mysql_error()); 
while ($row1 = mysql_fetch_array($q1)) { 
    $doc_id[] = $row1['document_id']; 
} 
$q2 = mysql_query('SELECT sentence_content, sentence_id, document_id FROM tb_sentence ') or die(mysql_error()); 
while ($row2 = mysql_fetch_array($q2)) { 
    $sentence_clean[$row2['document_id']][] = $row2['sentence_content']; 
} 
foreach ($sentence_clean as $kal) { 
    if (trim($kal) === '') 
     continue; 
    tokenizing($kal); 
} 

具有標記功能的是:

function tokenizing($sentence) { 
    foreach ($sentence as $sentence_id => $sentences) { 
     $symbol = array(".", ",", "\\", "-", "\"", "(", ")", "<", ">", "?", ";", ":", "+", "%", "\r", "\t", "\0", "\x0B"); 
     $spasi = array("\n", "/", "\r"); 
     $replace = str_replace($spasi, " ", $sentences); 
     $cleanSymbol = str_replace($symbol, "", $replace); 
     $quote = str_replace("'", "\'", $cleanSymbol); 
     $element = explode(" ", trim($quote)); 
     $elementNCount = array_count_values($element); 

     foreach ($elementNCount as $word => $freq) { 
      if (ereg("([a-z,A-Z])", $word)) { 
       $query = mysql_query(" INSERT INTO tb_tokens VALUES ('','$word','$freq','$sentence_id', '$doc_id')"); 
      } 
     } 
    } 
} 

問題是document_id無法讀取,無法插入tb +令牌表中。如何打電話給那些document_id?謝謝:)

編輯問題: 每個單詞(標記化的結果)有document_idsentence_id。我的問題是不能撥打document_id。如何在每個字詞中調用sentence_iddocument_id

+0

提出問題的好工作。除了......「問題是document_id不能被讀取並且不能被插入到tb +令牌表中 - 你能更精確嗎?出了什麼問題? – Smandoli 2012-08-02 12:25:28

+0

@Smandoli對不起,如果我的英語不好。每個'sentence_content'都有'document_id'。我需要在document_id中插入標記詞,但是我無法讀取document_id – bruine 2012-08-02 12:32:56

+0

沒有'$ row ['document_id']'因爲您沒有在第二個查詢的選擇列表中包含'document_id'。 – 2012-08-02 12:36:03

回答

1

我認爲你不需要這些代碼:

$q1 = mysql_query("SELECT document_id FROM tb_sentence ORDER BY document_id ") or die(mysql_error()); 
while ($row1 = mysql_fetch_array($q1)) { 
    $doc_id[] = $row1['document_id']; 
} 

陣列的$ DOC_ID是從未使用過

if (trim($kal) === '') 
     continue; 

$ KAL是一個數組,也不需要修剪

$sentence_clean[$row2['document_id']][] = $row2['sentence_content']; 

因爲你要記錄sentence_id,它應該是$ row2 ['sentence_id']不是[]

(當然你應該確保,不會有同一DOCUMENT_ID相同sentence_id否則你應該Concat的話)

這是我幾更正:

$sentence_clean = array(); 
$q2 = mysql_query('SELECT sentence_content, sentence_id, document_id FROM tb_sentence ') or die(mysql_error()); 
while ($row2 = mysql_fetch_array($q2)) { 
    $sentence_clean[$row2['document_id']][$row2['sentence_id']] = $row2['sentence_content']; 
} 

foreach ($sentence_clean as $doc_id => $kal) { 
    tokenizing($kal, $doc_id); 
} 

function tokenizing($sentence, $doc_id) { 
    foreach ($sentence as $sentence_id => $sentences) { 
     $symbol = array(".", ",", "\\", "-", "\"", "(", ")", "<", ">", "?", ";", ":", "+", "%", "\r", "\t", "\0", "\x0B"); 
     $spasi = array("\n", "/", "\r"); 
     $replace = str_replace($spasi, " ", $sentences); 
     $cleanSymbol = str_replace($symbol, "", $replace); 
     $quote = str_replace("'", "\'", $cleanSymbol); 
     $element = explode(" ", trim($quote)); 
     $elementNCount = array_count_values($element); 

     foreach ($elementNCount as $word => $freq) { 
      if (ereg("([a-z,A-Z])", $word)) { 
       $query = mysql_query(" INSERT INTO tb_tokens VALUES ('','$word','$freq','$sentence_id', '$doc_id')"); 
      } 
     } 
    } 
} 

我解析document_id到函數

+0

哦,傻我..是的,你是對的..!太棒了!非常感謝你@ivantedja。我從你身上學到很多東西:) – bruine 2012-08-02 22:48:21