在Python中使用正則表達式來分析LaTeX代碼

我正在嘗試編寫一個Python腳本來整理我的LaTeX代碼。我想找到啓動環境的實例，但在下一個換行符之前的聲明後面有非空白字符。例如，我想匹配在Python中使用正則表達式來分析LaTeX代碼

\begin{theorem}[Weierstrass Approximation] \label{wapprox}

，但不能匹配

\begin{theorem}[Weierstrass Approximation] 
\label{wapprox}

我的目標是插入（使用應用re.sub）聲明的末尾和第一個非之間的換行符空白字符。 Sloppily說，我想找到像

(\begin{evn}) ({text} | [text]) ({text2}|[text2]) ... ({textn}|textn]) (\S)

做一個替換。我試過

expr = re.compile(r'\\(begin|end){1}({[^}]+}|\[[^\]]+\])+[^{\[]+$',re.M)

但是這不太合適。作爲最後一組，它僅匹配{，}或[，]的最後一個配對。

來源

2015-08-25 user193070

一個不太複雜的解決方案很可能是編寫LaTeX的一個標記/詞法分析器將輸入分割爲令牌並將它們逐個複製到第二個緩衝區中。在複製它們時，您可以確定是否要插入額外的空格或換行符。在遍歷每個標記時，如果遇到'\ begin {（\ w +）}'標記，請輸入一個狀態，以確保在複製下一個非空白標記之前插入換行符。嘗試使用正則表達式對LaTeX文檔進行全文檔分析可能會很脆弱。 –

一如既往，不要使用正則表達式來分析結構化語言。 – tripleee

你可以這樣說：

import re 

s = r'''\begin{theorem}[Weierstrass Approximation] \label{wapprox} 

but not match 

\begin{theorem}[Weierstrass Approximation] 
\label{wapprox}''' 

p = re.compile(r'(\\(?:begin|end)(?=((?:{[^}]*}|\[[^]]*])*))\2)[^\S\n]*(?=\S)') 

print(p.sub(r'\1\n', s))

圖案的詳細資料：

( # capture group 1 
    \\ 
    (?:begin|end) 
    # trick to emulate an atomic group 
    (?=( # the subpattern is enclosed in a lookahead and a capture group (2) 
     (?:{[^}]*}|\[[^]]*])* 
    )) # the lookahead is naturally atomic 
    \2 # backreference to the capture group 2 
) 
[^\S\n]* # eventual horizontal whitespaces 
(?=\S) # followed by a non whitespace character

說明：如果你寫一個模式像(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S)你無法阻止，之前有一個換行符的情況下下一個令牌。請參閱以下情形：

(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S)比賽：

\begin{theorem}[Weierstrass Approximation]
\label{wapprox}

但由於(?=\S)失敗（因爲下一個字符是一個換行符）發生回溯機制：

(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S)比賽：

\begin{theorem}[Weierstrass Approximation]
\label{wapprox}

和(?=\S)現在成功的[字符相匹配。

原子組是一個非捕獲組，禁止在組中包含的子模式中回溯。符號是(?>subpattern)。不幸的是，重新模塊沒有這個功能，但你可以用技巧(?=(subpattern))\1來模擬它。

注意，您可以使用regex module（具有此功能），而不是重：

import regex 

p = regex.compile(r'(\\(?:begin|end)(?>(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S)')

或

p = regex.compile(r'(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*+[^\S\n]*+(?=\S)')

來源

2015-08-25 11:14:18

在Python中使用正則表達式來分析LaTeX代碼

回答

相關問題