2015-03-13 34 views
1

段落我已這個文本,我已經從利用iText一個pdf提取並放置到字符串變量:正則表達式,從網頁中提取以下

(1) A a, — al'-fah; of Hebrew origin; the first letter of the alphabet; 
figurative only (from its use as a numeral) the first: — Alpha. 
Often used (usually ajn an, before a vowel) also in composition 
(as a contraction from (427) (a]neu,)) in the sense of privation; 
so in many words beginning with this letter; occasionally in the 
sense of union (as a contraction of (260) (a[ma)). 
(2) ÆAarw>n, — ah-ar-ohn'; of Hebrew origin [Hebrew {175} 
('Aharown)]; Aaron, the brother of Moses: — Aaron. 
(3) ÆAbaddw>n, — ab-ad-dohn'; of Hebrew origin [Hebrew {11} 
('abaddown)]; a destroying angel: — Abaddon. 
(4) ajbarh>v, — ab-ar-ace'; from (1) (a) (as a negative particle) and (922) 
(ba>rov); weightless, i.e. (figurative) not burdensome: — from 
being burdensome. 
(5) ÆAbba~, — ab-bah'; of Chaldee origin [Hebrew {2} ('ab (Chaldee))]; 
father (as a vocative): — Abba. 
(6) &Abel, — ab'-el; of Hebrew origin [Hebrew {1893} (Hebel)]; Abel, 
the son of Adam: — Abel. 
(7) ÆAbia>, — ab-ee-ah'; of Hebrew origin [Hebrew {29} ('Abiyah)]; 
Abijah, the name of two Israelites: — Abia. 
(8) ÆAbia>qar, — ab-ee-ath'-ar; of Hebrew origin [Hebrew {54} 
('Ebyathar)]; Abiathar, an Israelite: — Abiathar. 
(9) ÆAbilhnh>, — ab-ee-lay-nay'; of foreign origin [compare Hebrew {58} 
('abel)]; Abilene, a region of Syria: — Abilene. 
(10) ÆAbiou>d, — ab-ee-ood'; of Hebrew origin [Hebrew {31} 
('Abiyhuwd)]; Abihud, an Israelite: — Abiud. 

字符串中的各段與([0-9])開始如(9)(5),我想用pagestring.split("regex")提取以此字符序列開頭的每個段落。可以幫助嗎?

回答

0

這樣可以避免在文本中嵌入「(999)」。它基於這樣一種假設,即行結束符指示段落開始的帶括號的數字。還要注意,示例文本從第一個括號內沒有任何文本產生空的「段落」 - 因此是if語句。

String text = ...; 
    String[] paras = text.split("(?<=(^|\\n))\\(\\d+\\)"); 
    for(String para: paras){ 
     if(para.length() > 0){ 
      System.out.println("Para: " + para); 
     } 
    } 
+0

太棒了!有沒有一個教程或指南,你可以推薦,因爲正則表達式真的把我搞砸了? – Lema 2015-03-13 09:01:14

+0

我以前學過正則表達式,所以我不能真正推薦一個教程。但http://regexcrossword.com/提供了一種有趣的學習方式。 – laune 2015-03-13 09:21:59

0

您可以使用下面的正則表達式"[\n|.]\\([0-9]{1,2}\\)"與分割方法,它會提取所有的段落從你的文字(包括從0到99的數字):

String[] parts=st.split("[\n|.]\\([0-9]{1,2}\\)"); 

[\n|.]:考慮只有新段落忽略(n)在pragraphs文本。

\\([0-9]{1,2}\\):以匹配內()任何組的一個2個數字。

這裏是the working DEMO,給出一個包含所有段落的數組。

有關使用正則表達式的更多信息,請參閱Java Regex Pattern