如何提取前綴和後綴之間的內容？

我想從大括號內的{內}中提取文本。這些文本之間的差異是前綴，例如\section{或\subsection{以相應地對所有內容進行分類。並且每個末端都需要由下一個封閉的大括號}來設置。如何提取前綴和後綴之間的內容？

file = "This is a string of an \section{example file} used for \subsection{Latex} documents." 

# These are some Latex commands to be considered: 

heading_1 = "\\\\section{" 
heading_2 = "\\\\subsection{" 

# This is my attempt. 

for letter in file: 
    print("The current letter: " + letter + "\n")

我想通過使用Python將它轉換爲我的數據庫來處理Latex文件。

來源

2016-08-28 Liam

1）如果你有類似'\ section {方程$ x_ {1 + 2} = 3}'那麼怎麼辦？這裏名字的結尾是**而不是**下一個'}'。或者'\ section {Name \ label {label}}'在某些文檔中經常出現？ *任何*正則表達式解決方案都很脆弱，請尋找適當的LaTeX解析器。 2）目前還不清楚你想做什麼。你是否只關心部分/小節的標題等，並希望將它們與它們的級別一起收集起來？ – Bakuriu

在我的情況下，確信'{}'僅用於結束某個部分/子部分。我需要處理內容以將Latex文件轉換爲我現有的Neo4j圖形數據庫的Cypher代碼。 – Liam

如果你只想對(section-level, title)所有你可以用一個簡單的正則表達式的文件：

import re 

codewords = [ 
    'section', 
    'subsection', 
    # add other here if you want to 
] 

regex = re.compile(r'\\({})\{{([^}}]+)\}}'.format('|'.join(re.escape(word) for word in codewords)))

使用範例：

In [15]: text = ''' 
    ...: \section{First section} 
    ...: 
    ...: \subsection{Subsection one} 
    ...: 
    ...: Some text 
    ...: 
    ...: \subsection{Subsection two} 
    ...: 
    ...: Other text 
    ...: 
    ...: \subsection{Subsection three} 
    ...: 
    ...: Some other text 
    ...: 
    ...: 
    ...: Also some more text \texttt{other stuff} 
    ...: 
    ...: \section{Second section} 
    ...: 
    ...: \section{Third section} 
    ...: 
    ...: \subsection{Last subsection} 
    ...: ''' 

In [16]: regex.findall(text) 
Out[16]: 
[('section', 'First section'), 
('subsection', 'Subsection one'), 
('subsection', 'Subsection two'), 
('subsection', 'Subsection three'), 
('section', 'Second section'), 
('section', 'Third section'), 
('subsection', 'Last subsection')]

通過改變codewords列表的價值你將能夠匹配更多類型的命令。

若要將此到一個文件只是第一read()它：

with open('myfile.tex') as f: 
    regex.findall(f.read())

如果您有保證，所有這些命令都在同一行，那麼你可以更多的內存效率，做到：

與開放（ 'myfile.tex'）爲f：結果= [] 在F線：0results.extends（regex.findall（線））

或者，如果你想成爲一個有點莫再花哨：

from itertools import chain 

with open('myfile.tex') as f: 
    results = chain.from_iterable(map(regex.findall, f))

不過請注意，如果你有這樣的：

\section{A very 
    long title}

這會失敗，爲什麼使用read()會得到部分過於解決方案。

在你要知道，在格式絲毫的改變將打破這些類型的解決方案，任何情況下。對於更安全的替代方案，您必須尋找適當的LaTeX解析器。

如果你想組一起小節「包含」在一個給定的部分，您可以用上述方案獲得結果後也這樣做。你必須使用類似itertools.groupby的東西。

從itertools進口GROUPBY，計數，鏈

results = regex.findall(text) 

def make_key(counter): 
    def key(match): 
     nonlocal counter 
     val = next(counter) 
     if match[0] == 'section': 
      val = next(counter) 
     counter = chain([val], counter) 
     return val 
    return key 

organized_result = {} 

for key, group in groupby(results, key=make_key(count())): 
    _, section_name = next(group) 
    organized_result[section_name] = section = [] 
    for _, subsection_name in group: 
     section.append(subsection_name)

而最終的結果將是：

In [12]: organized_result 
Out[12]: 
{'First section': ['Subsection one', 'Subsection two', 'Subsection three'], 
'Second section': [], 
'Third section': ['Last subsection']}

哪個文本的結構在文章的開頭相匹配。

如果您想使用codewords列表進行擴展，事情會變得相當複雜。

來源

2016-08-28 18:56:57 Bakuriu

哇。優秀的答案。事情是：我們仍然想維護結構。如果有標題_1，那麼屬於標題2的是什麼？如果有Heading_2，關聯的Headings_3是什麼？等等。如何更改正則表達式以返回嵌套字典？ – Liam

@Liam只是使用正則表達式，你不能，但是你可以迭代結果並將連續的小節分組在一起。現在，如果你只有章節和小節，這是非常簡單的，如果你想要一個可擴展的解決方案（也就是說你也想跟蹤章節），那麼它會變得更復雜一些。我現在編輯我的答案。 – Bakuriu

我想你想使用正則表達式模塊。

import re 

s = "This is a string of an \section{example file} used for \subsection{Latex} documents." 

pattern = re.compile(r'\\(?:sub)?section\{(.*?)\}') 
re.findall(pattern, s) 

#output: 
['example file', 'Latex']

來源

2016-08-28 17:23:46 James

如何提取前綴和後綴之間的內容？

回答

相關問題