2011-04-29 63 views
18

我想POStag英語句子,並做一些處理。我想使用openNLP。我把它安裝如何在Java中使用OpenNLP?

當我執行命令

I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt 

它使輸出POSTagging的輸入TEXT.TXT

Loading POS Tagger model ... done (4.009s) 
My_PRP$ name_NN is_VBZ Shabab_NNP i_FW am_VBP 22_CD years_NNS old._. 


Average: 66.7 sent/s 
Total: 1 sent 
Runtime: 0.015s 

我希望它安裝是否正確?

現在我該如何從java應用程序內做這個POStagging?我已經將openNLPtools,jwnl,maxent jar添加到項目中,但我如何調用POStagging?

回答

35

這裏的一些(舊)示例代碼我扔在一起,用現代化的代碼如下:

package opennlp; 

import opennlp.tools.cmdline.PerformanceMonitor; 
import opennlp.tools.cmdline.postag.POSModelLoader; 
import opennlp.tools.postag.POSModel; 
import opennlp.tools.postag.POSSample; 
import opennlp.tools.postag.POSTaggerME; 
import opennlp.tools.tokenize.WhitespaceTokenizer; 
import opennlp.tools.util.ObjectStream; 
import opennlp.tools.util.PlainTextByLineStream; 

import java.io.File; 
import java.io.IOException; 
import java.io.StringReader; 

public class OpenNlpTest { 
public static void main(String[] args) throws IOException { 
    POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin")); 
    PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent"); 
    POSTaggerME tagger = new POSTaggerME(model); 

    String input = "Can anyone help me dig through OpenNLP's horrible documentation?"; 
    ObjectStream<String> lineStream = 
      new PlainTextByLineStream(new StringReader(input)); 

    perfMon.start(); 
    String line; 
    while ((line = lineStream.read()) != null) { 

     String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line); 
     String[] tags = tagger.tag(whitespaceTokenizerLine); 

     POSSample sample = new POSSample(whitespaceTokenizerLine, tags); 
     System.out.println(sample.toString()); 

     perfMon.incrementCounter(); 
    } 
    perfMon.stopAndPrintFinalResult(); 
} 
} 

輸出是:

Loading POS Tagger model ... done (2.045s) 
Can_MD anyone_NN help_VB me_PRP dig_VB through_IN OpenNLP's_NNP horrible_JJ documentation?_NN 

Average: 76.9 sent/s 
Total: 1 sent 
Runtime: 0.013s 

這基本上是從POSTaggerTool類的工作包括爲部分的OpenNLP。 sample.getTags()是一個String數組,其自身具有標籤類型。

這需要直接訪問培訓數據,這實際上是非常蹩腳的。

此更新的代碼庫是一個有點不同

首先,一個Maven POM(可能更有用。):

<?xml version="1.0" encoding="UTF-8"?> 
<project xmlns="http://maven.apache.org/POM/4.0.0" 
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 
    <modelVersion>4.0.0</modelVersion> 

    <groupId>org.javachannel</groupId> 
    <artifactId>opennlp-example</artifactId> 
    <version>1.0-SNAPSHOT</version> 
    <dependencies> 
     <dependency> 
      <groupId>org.apache.opennlp</groupId> 
      <artifactId>opennlp-tools</artifactId> 
      <version>1.6.0</version> 
     </dependency> 
     <dependency> 
      <groupId>org.testng</groupId> 
      <artifactId>testng</artifactId> 
      <version>[6.8.21,)</version> 
      <scope>test</scope> 
     </dependency> 
    </dependencies> 
    <build> 
     <plugins> 
      <plugin> 
       <groupId>org.apache.maven.plugins</groupId> 
       <artifactId>maven-compiler-plugin</artifactId> 
       <version>3.1</version> 
       <configuration> 
        <source>1.8</source> 
        <target>1.8</target> 
       </configuration> 
      </plugin> 
     </plugins> 
    </build> 
</project> 

而這裏的代碼寫成一個測試,因此位於./src/test/java/org/javachannel/opennlp/example

package org.javachannel.opennlp.example; 

import opennlp.tools.cmdline.PerformanceMonitor; 
import opennlp.tools.postag.POSModel; 
import opennlp.tools.postag.POSSample; 
import opennlp.tools.postag.POSTaggerME; 
import opennlp.tools.tokenize.WhitespaceTokenizer; 
import org.testng.annotations.DataProvider; 
import org.testng.annotations.Test; 

import java.io.File; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import java.net.URL; 
import java.nio.channels.Channels; 
import java.nio.channels.ReadableByteChannel; 
import java.util.stream.Stream; 

public class POSTest { 
    private void download(String url, File destination) throws IOException { 
     URL website = new URL(url); 
     ReadableByteChannel rbc = Channels.newChannel(website.openStream()); 
     FileOutputStream fos = new FileOutputStream(destination); 
     fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE); 
    } 

    @DataProvider 
    Object[][] getCorpusData() { 
     return new Object[][][]{{{ 
       "Can anyone help me dig through OpenNLP's horrible documentation?" 
     }}}; 
    } 

    @Test(dataProvider = "getCorpusData") 
    public void showPOS(Object[] input) throws IOException { 
     File modelFile = new File("en-pos-maxent.bin"); 
     if (!modelFile.exists()) { 
      System.out.println("Downloading model."); 
      download("http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin", modelFile); 
     } 
     POSModel model = new POSModel(modelFile); 
     PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent"); 
     POSTaggerME tagger = new POSTaggerME(model); 

     perfMon.start(); 
     Stream.of(input).map(line -> { 
      String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line.toString()); 
      String[] tags = tagger.tag(whitespaceTokenizerLine); 

      POSSample sample = new POSSample(whitespaceTokenizerLine, tags); 

      perfMon.incrementCounter(); 
      return sample.toString(); 
     }).forEach(System.out::println); 
     perfMon.stopAndPrintFinalResult(); 
    } 
} 

此代碼實際上並不測試什麼 - 這是一個煙霧測試,如果有的話 - 但它應該成爲一個起點。另一個(可能)好的是,如果你沒有下載模型,它會爲你下載一個模型。

+0

謝謝你非常非常非常非常多的..我終於上軌道? 你能告訴我在哪裏可以找到 - NN MD,VB ...和所有這些標籤的含義? – shababhsiddique 2011-04-30 09:55:44

+0

我不知道!我現在正在研究這個問題,因爲我剛剛意識到 - 感謝您的問題 - OpenNLP對我自己的任務有多大用處。 :) – 2011-04-30 11:13:55

+2

我認爲這應該可以幫助你 http://bulba.sdsu.edu/jeanette/thesis/PennTags.html – shababhsiddique 2011-04-30 14:29:01