Vulcans are a humanoid species in the fictional "Star Trek" universe who evolved on the planet Vulcan and are noted for their attempt to live by reason and logic with no interference from emotion They were the first extraterrestrial species officially to make first contact with Humans and later became one of the founding members of the "United Federation of Planets"

感謝 巴拉


您可能想要在語義分析中考慮各種印刷慣例,而不是剝離標記。如果推斷出您已經明確引用了您想要在其他沒有標記的文本中關聯的短語,是否是正確的? – trashgod 2010-09-04 15:18:01





  • 我想用共發現可以正常工作(不知道在哪裏的雙字母組/卦來但是您應該將WordNet查找視爲混合系統的一部分,而不是查找已命名實體的全部和最終全部,然後,首先應用一些簡單的常識標準(大寫字母順序單詞;試着將經常使用的小寫功能詞(如'of')應用於這些;由「已知標題」加上大寫單詞組成的序列;
  • 尋找統計上你不會偶然出現的單詞序列作爲實體的候選者;
  • 你可以建立動態網頁查找嗎? (您的系統會發現大寫的序列「IBM」,並查看它是否找到例如具有文本模式「IBM is ... [organization | company | ...]」的wikipedia條目
  • 看看這裏和在「信息提取」文學一般給你一些想法:http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html



斯坦福大學NLP命名實體識別器應該是您的第一通。它將在第一次運行中爲您提供很多價值,您可以查看代碼並瞭解如何從此處改進代碼。 – 2017-08-02 11:02:25



String text = "Vulcans are a humanoid species in the fictional \"Star Trek\"" + 
     " universe who evolved on the planet Vulcan and are noted for their " + 
     "attempt to live by reason and logic with no interference from emotion" + 
     " They were the first extraterrestrial species officially to make first" + 
     " contact with Humans and later became one of the founding members of the" + 
     " \"United Federation of Planets\""; 
    String[] entities = new String[10];     // An array to hold matched substrings 
    Pattern pattern = Pattern.compile("[\"](.*?)[\"]"); // The regex pattern to use 
    Matcher matcher = pattern.matcher(text);   // The matcher - our text - to run the regex on 
    int startFrom = text.indexOf('"');    // The index position of the first " character 
    int endAt  = text.lastIndexOf('"');   // The index position of the last " character 
    int count  = 0;        // An index for the array of matches 
    while (startFrom <= endAt) {      // startFrom will be changed to the index position of the end of the last match 
     matcher.find(startFrom);      // Run the regex find() method, starting at the first " character 
     entities[count++] = matcher.group(1);   // Add the match to the array, without its " marks 
     startFrom = matcher.end();      // Update the startFrom index position to the end of the matched region 


int startFrom = text.indexOf('"');        // The index-position of the first " character 
    int nextQuote = text.indexOf('"', startFrom+1);     // The index-position of the next " character 
    int count = 0;             // An index for the array of matches 
    while (startFrom > -1) {          // Keep looping as long as there is another " character (if there isn't, or if it's index is negative, the value of startFrom will be less-than-or-equal-to -1) 
     entities[count++] = text.substring(startFrom+1, nextQuote); // Retrieve the substring and add it to the array 
     startFrom = text.indexOf('"', nextQuote+1);     // Find the next " character after nextQuote 
     nextQuote = text.indexOf('"', startFrom+1);     // Find the next " character after that 



int i = 0; 
    while (i < count) { 


static int countQuoteChars(String text) { 
     int nextQuote = text.indexOf('"');    // Find the first " character 
     int count = 0;         // A counter for " characters found 
     while (nextQuote != -1) {      // While there is another " character ahead 
      count++;         // Increase the count by 1 
      nextQuote = text.indexOf('"', nextQuote+1); // Find the next " character 
     return count;         // Return the result 

    static boolean quoteCharacterParity(int numQuotes) { 
     if (numQuotes % 2 == 0) { // If the number of " characters modulo 2 is 0 
      return true;   // Return true for even 
     return false;    // Otherwise return false 

注意,如果numQuotes恰好是0這種方法仍然返回true(因爲0模任何數字都是0,所以(count % 2 == 0)true)雖然你止跌「不想與解析先走,如果沒有「字,所以你想不想找個地方檢查此條件。



這很有趣......我用雙引號包圍了實體。 – Boolean 2010-09-04 14:47:06


@Algorist:由於我有類似的誤解,因此澄清您關於引號使用的問題可能很有用。 – trashgod 2010-09-04 15:14:13


