2012-10-08 69 views
0

我處理,在下面的代碼代表的一些文本文件:破滅的結果

代碼:

$file = file($files); 
$lines = str_replace("'", '', $file); 
$noMultipleSpace = removeMultipleSpaces($lines); 
$fileContents = array(); 
foreach($noMultipleSpace as $line) { 
    if (isLatin($line) && count(preg_split('/\s+/', $line)) > 25) { 
     $newContent = preg_split('/\\.\\s*/', $line); 
     foreach($newContent as $newsContent) { 
      $pos1 = stripos($newsContent, ':'); 
      if ($pos1 == false && count(preg_split('/\s+/', $newsContent) > 3) && isLatin($newsContent)) { 
       $fileContents[] = $newsContent; 
      } 
     } 
     $content = implode('.', $fileContents); 
    } 
}​ 

與功能:

function isLatin($string) { 
return preg_match('/^\\s*[a-z,A-Z]/', $string) > 0; 
} 

function removeMultipleSpaces($string){ 
return preg_replace('/\s+/', ' ',$string); 
} 

但是,在內爆過程,下一句中的點貼。例如sentence1 .Sentence2。我的期望是sentence1. Sentence2。怎麼了?謝謝:)

輸入的是文本文件,例如:

ChengXiang Zhai 
Department of Computer Science University of Illinois at Urbana Champaign 

ABSTRACT 
Temporal Text Mining (TTM) is concerned with discovering temporal patterns in text 
information collected over time. Since most text information bears some time stamps, TTM has many applications in multiple domains, such as summarizing events in news articles and 
revealing research trends in scientific literature. In this paper, we study a particular TTM 
task ­ discovering and summarizing the evolutionary patterns of themes in a text stream. We 
define this new text mining problem and present general probabilistic methods for solving 
this problem through (1) discovering latent themes from text; (2) constructing an evolution 
graph of themes; and (3) analyzing life cycles of themes. Evaluation of the proposed methods 
on two different domains (i.e., news articles and literature) shows that the proposed 
methods can discover interesting evolutionary theme patterns effectively. Categories and 
Subject Descriptors: H.3.3 [Information Search and Retrieval]: Clustering General Terms: 
Algorithms Keywords: Temporal text mining, evolutionary theme patterns, theme threads, 
clustering 

1. 

INTRODUCTION 

我想要得到的重要句子只,從Temporal Text Mining (TTM)...直到effectively

+1

究竟是什麼你想實現什麼,也請提供例如輸入。 – clentfort

+0

@clentfort我已經添加了,謝謝:) – bruine

回答

2

你中間的句子似乎有一個尾隨空格,導致分解的分隔符出現。

試試這個:

$file = file($files); 
$lines = str_replace("'", '', $file); 
$noMultipleSpace = removeMultipleSpaces($lines); 
$fileContents = array(); 
foreach($noMultipleSpace as $line) { 
    if (isLatin($line) && count(preg_split('/\s+/', $line)) > 25) { 
     $newContent = preg_split('/\\.\\s*/', $line); 
     foreach($newContent as $newsContent) { 
      $pos1 = stripos($newsContent, ':'); 
      if ($pos1 == false && count(preg_split('/\s+/', $newsContent) > 3) && isLatin($newsContent)) { 
       $fileContents[] = $newsContent; 
      } 
     } 
     $fileContents = array_map('trim', $fileContents); 
     $content = implode('.', $fileContents); 
    } 
}​ 
+0

太棒了!非常感謝你 :) – bruine