2015-01-06 155 views
2

我需要改變結構像一個CSV文件:從CSV創建GATE文檔文件

i love iphone \t positive 
i hate iphone \t negative 

到柵極文件,其中包括相關的類:

enter image description here

是什麼最好的方法來做到這一點? jape,groovy?

回答

2

基本上,你必須處理CSV和GATE文件。如果您在CPAN上搜索,您會發現可以輕鬆處理這些類型文檔的模塊。

所以您可以使用文字:: CSV來獲取文本從CSV文件,並使用setText,NLP ::門setAnnotationSet方法::文檔模塊創建,設置文本和註釋GATE文檔。

試一試,如果您遇到任何問題,請再次詢問您嘗試過的代碼以實現您的目標。

-1

可能不是容易的答案,但它與這個perl腳本的工作原理:

use strict; 
use locale; 
use HTML::Entities; 

open (IN,$ARGV[0]) 
    or die "file doesn't exist ! : $!\n"; 

my $i = 0; 

while (my $form = <FICHIER>) { 

    if ($form =~ /^((.+)\t(.+))$/) 

    { 
     my $file = "tweet_".$i.".xml"; 
     # Use the open() function to create the file. 
     unless(open FILE, '>'.$file) { 
     # Die with error message 
     # if we can't open it. 
     die "nUnable to create $file"; 
     }   

     my $sentence =$2; 
     my $encoded_sent = encode_entities($sentence); 

     my $class = $3; 
     my $length_sent = length($sentence); 

     ##head xml 
     print FILE "<?xml version='1.0' encoding='UTF-8'?>"."\n"; 
     print FILE '<GateDocument version="3">'."\n"; 
     print FILE '<GateDocumentFeatures>'."\n"; 
     print FILE '<Feature>'."\n"; 
     print FILE '<Name className="java.lang.String">gate.SourceURL</Name>'."\n"; 
     print FILE '<Value className="java.lang.String">created from String</Value>'."\n"; 
     print FILE '</Feature>'."\n"; 
     print FILE '</GateDocumentFeatures>'."\n"; 

     ##create xml for each line -- here is the content 
     print FILE '<TextWithNodes><Node id="0"/>'.$encoded_sent.'<Node id="'.$length_sent.'"/></TextWithNodes>'."\n"; 

     print FILE '<AnnotationSet Name="Key">'."\n"; 
     print FILE '<Annotation Id="1" Type="Tweet" StartNode="0" EndNode="'.$length_sent.'">'."\n"; 

     print FILE '<Feature>'."\n"; 
     print FILE '<Name className="java.lang.String">class</Name>'."\n"; 
     print FILE '<Value className="java.lang.String">'.$class.'</Value>'."\n"; 
     print FILE '</Feature>'."\n"; 
     print FILE '</Annotation>'."\n"; 
     print FILE '</AnnotationSet>'."\n"; 

     ##end of the document 
     print FILE '</GateDocument>'."\n"; 
     $i++; 
    } 
    close FILE; 
}  
close IN; 
+0

以下是代碼中的幾個問題,但僅添加一條評論:您是否檢查過https://metacpan.org/pod/NLP::GATE什麼是_Handle GATE文檔和註釋_? – kobame