2017-08-14 20 views
0

假設我有純文本,它包含跨多行的純文本文件中的以下有序列表。從正則表達式使用正則表達式和python提取編號列表(多行)

This is a text\n 
that contains an ordered/numbered list\n 
appearing on multiple lines in a plain-text file.\n 
\n 
Item 1. This is a list where each item can span over\n 
multiple lines\n 
Item 2. that I want to extract each separate item from but ONLY in series (order)\n 
Item 3. non-blank text\n 
Item 4. non-blank text\n 
Item 5. non-blank text\n 
Item 6. non-blank text\n 
Item 7. non-blank text\n 
Item 8. non-blank text\n 
Item 9. non-blank text\n 
Item 10. non-blank text\n 
Item 11. The items are in an ordered list, but digits may repeat (11, 22)\n 
or they may be preceded or folowed by another digit (20, 35, 300) with\n 
... 
Item 999. Up to 999 items\n 
in each ordered list\n 
\n 
But, (most annoyingly), any Item n (with up to 3 digits) or Items may be repeated\n 
or back-referenced later in text but not\n 
again as an ordered list (or in series) as the first\n 
instance of each item in the list above. 

希望的捕捉/輸出:

返回每個項目的文本(在多行潛在的),因爲它出現在有序列表。

項目1. [文本] \ n上

項2. [文本] \ n上

[文本可以跨越多行]

項目N(高達999)。 [文本] \ n上

我的當前最佳的正則表達式結構如下:

(Item\s[\d]+\.)(.*?)(?=(Item\s[\d]+\.)|($)) 

上述正則表達式構造不貪婪地包括在從上面的有序列表捕獲的每個「條目」換行符或多行。

我的問題:使用正則表達式在Python中提取只是在有序列表中的項目是否有可能?如果不能使用正則表達式,我將如何最有效地使用Python來定位這樣的文本中的有序列表並提取它?

回答

0

對於Python正則表達式使用DOTALL flag

re.compile('(Item\s[\d]+\.)(.*?)(?=(Item\s[\d]+\.)|($))', re.DOTALL)