2017-04-24 78 views
2

問題是我試圖用php語句分割文本文件。我目前使用以下功能:解析帶引文的文本文件

$results = preg_split('/(?<=[.?!])\s+/', $stringtest, -1, PREG_SPLIT_NO_EMPTY); 

的問題是,有這樣的句子:

In his book The Symposium, Plato wrote 「Those who are halves of a man whole pursue males, and being slices, so to speak, of the male, love men throughout their boyhood, and take pleasure in physical contact with men」 (qtd. in Isay 11). 

它拆分這樣的:

[0] In his book The Symposium, Plato wrote 「Those who are halves of a man whole pursue males, and being slices, so to speak, of the male, love men throughout their boyhood, and take pleasure in physical contact with men」 (qtd. 
[1] in Isay 11). 

另一個例子是:

Dr. Evelyn Hooker, a heterosexual psychologist... 

博士部分將是一個問題。

這些文本全部來自NLP的MASC語料庫。

+4

什麼是你的問題 –

+0

@JayBlanchard:?我猜OP想拆就標點符號但由於他們。也出現在其他地方,造成麻煩。 – Rahul

+0

我不認爲正則表達式是一個很好的工具。 – jrook

回答

1

您可以擴展@ndn's solution以實現您所需的功能。請注意,$before_regexes包含已知縮寫的列表,添加您的語料庫中存在的縮寫。那裏我加了qtd

然後,請注意$before_regexes$after_regexes已配對。我的$is_sentence_boundary陣列中加入'/(?:[」’"\'»])\s*\Z/u'/'/\A(?:\(\p{L})/u'對並標記它作爲非句子邊界(與第一false正則表達式對裝置:找到引號(」’"'»),0 +空格,再接着用((與\( )和任何Unicode字母(\p{L}),那麼就應該是沒有分裂

function sentence_split($text) { 
    $before_regexes = array('/(?:[」’"\'»])\s*\Z/u', 
     '/(?:(?:[\'\"„][\.!?…][\'\"」]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe}))\Z/su', 
     '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su', 
     '/(?:(?:[\[\(]*\.\.\.[\]\)]*))\Z/su', 
     '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|qtd)\.\s))\Z/su', 
     '/(?:(?:\b[Ee]tc\.\s))\Z/su', 
     '/(?:(?:[\.!?…]+\p{Pe})|(?:[\[\(]*…[\]\)]*))\Z/su', 
     '/(?:(?:\b\p{L}\.))\Z/su', 
     '/(?:(?:\b\p{L}\.\s))\Z/su', 
     '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su', 
     '/(?:(?:[\"」\']\s*))\Z/su', 
     '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su', 
     '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su', 
     '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su'); 
    $after_regexes = array('/\A(?:\(\p{L})/u', 
     '/\A(?:)/su', 
     '/\A(?:[\p{N}\p{Ll}])/su', 
     '/\A(?:[^\p{Lu}])/su', 
     '/\A(?:[^\p{Lu}]|I)/su', 
     '/\A(?:[^p{Lu}])/su', 
     '/\A(?:\p{Ll})/su', 
     '/\A(?:\p{L}\.)/su', 
     '/\A(?:\p{L}\.\s)/su', 
     '/\A(?:\p{N})/su', 
     '/\A(?:\s*\p{Ll})/su', 
     '/\A(?:)/su', 
     '/\A(?:\p{Lu}[^\p{Lu}])/su', 
     '/\A(?:\p{Lu}\p{Ll})/su'); 
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, false, true, true, true); 
    $count = 13; 

    $sentences = array(); 
    $sentence = ''; 
    $before = ''; 
    $after = substr($text, 0, 10); 
    $text = substr($text, 10); 

    while($text != '') { 
     for($i = 0; $i < $count; $i++) { 
      if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) { 
       if($is_sentence_boundary[$i]) { 
        array_push($sentences, $sentence); 
        $sentence = ''; 
       } 
       break; 
      } 
     } 

     $first_from_text = $text[0]; 
     $text = substr($text, 1); 
     $first_from_after = $after[0]; 
     $after = substr($after, 1); 
     $before .= $first_from_after; 
     $sentence .= $first_from_after; 
     $after .= $first_from_text; 
    } 

    if($sentence != '' && $after != '') { 
     array_push($sentences, $sentence.$after); 
    } 

    return $sentences; 
} 

PHP demo