我需要發佈大量的XHTML文件，我沒有生成，所以我無法修復生成它的代碼。我不能使用正則表達式來爆炸整個文件，只是高度選擇性的部分，因爲有鏈接和ID的數字，我不能全局更改。Python：BeautifulSoup修改文本

我簡化了這個例子很多，因爲原始文件有RTL文本。我只想修改可見文本中的數字，而不是標記。似乎有3種不同的情況。

案例1：：
從bk1.xhtml
片段的鏈接交叉引用，數字XT具有嵌入式bookref文本

<aside epub:type='footnote' id="FN96"><p class="x"><a class="notebackref" href="#bk1_21_9"><span class="notemark">*</span>text</a> 
<span class="xt"> <a class='bookref' href='bk50.xhtml#bk50_118_26'>some text with these digits: 26:118</a></span></p></aside>

情況2：無鏈接交叉參考 - 具有與XT沒有數字嵌入式bookref文本

<aside epub:type='footnote' id="FN100"><p class="x"><a class="notebackref" href="#bk1_21_42"><span class="notemark">*</span>text</a> 
<span class="xt">some text with these digits: 26:118</span></p></aside>

案例3：腳註沒有聯繫，但有英尺文本中位數

<aside epub:type='footnote' id="FN107"><p class="f"><a class="notebackref" href="#bk1_22_44"><span class="notemark">§</span>text</a> 
<span class="ft">some text with these digits: 22</span></p></aside>

我試圖找出如何識別文本字符串是可見的用戶部分內，這樣我可以只修改相關數字：

案例1：我需要捕捉剛剛 <a class='bookref' href='bk1.xhtml#bk1_118_26'>some text 26:118</a>將「一些文本26：118」子字符串分配給一個變量並針對該變量運行正則表達式;然後將該子字符串替換回原來的文件中。情況2：我只需要捕獲<span class="xt">some text 26:118</span>並更改「some text 26：118」子字符串中的數字，並針對該變量運行正則表達式;然後將該子字符串替換回原來的文件中。情況3：我只需要捕獲<span class="ft">some text 22</span>，並更改「some text 22」子字符串中的數字，並針對該變量運行正則表達式;然後將該子字符串替換回原來的文件中。

我有成千上萬的這些做跨越很多文件。我知道如何迭代文件。

在處理完一個文件中的所有模式後，我需要寫出已更改的樹。

我只是需要後處理它來修復文本。

我一直在谷歌搜索，閱讀和看很多教程，我感到困惑。

感謝您的任何幫助。

來源

2017-08-09 rmcape

看來你想要的.replaceWith()方法，你必須先找到你要匹配的文本中所有出現：

from bs4 import BeautifulSoup 

cases = ''' 
<aside epub:type='footnote' id="FN96"><p class="x"><a class="notebackref" href="#bk1_21_9"><span class="notemark">*</span>text</a> 
<span class="xt"> <a class='bookref' href='bk50.xhtml#bk50_118_26'>some text with these digits: 26:118</a></span></p></aside> 

<aside epub:type='footnote' id="FN100"><p class="x"><a class="notebackref" href="#bk1_21_42"><span class="notemark">*</span>text</a> 
<span class="xt">some text with these digits: 26:118</span></p></aside> 

<aside epub:type='footnote' id="FN107"><p class="f"><a class="notebackref" href="#bk1_22_44"><span class="notemark">§</span>text</a> 
<span class="ft">some text with these digits: 22</span></p></aside> 
''' 

soup = BeautifulSoup(cases, 'lxml') 

case1 = soup.findAll('a',{'class':'bookref'}) 
case2 = soup.findAll('span',{'class':'xt'}) 
case3 = soup.findAll('span',{'class':'ft'}) 

for match in case1 + case2 + case3: 
    text = match.string 
    print(text) 
    if text: 
     newText = text.replace('some text', 'modified!') # this line is your regex things 
     text.replaceWith(newText)

的print(text)在循環打印：

some text with these digits: 26:118 
None 
some text with these digits: 26:118 
some text with these digits: 22

如果我們再次調用它，現在：

modified! with these digits: 26:118 
None 
modified! with these digits: 26:118 
modified! with these digits: 22

來源

2017-08-09 23:31:22

這是否解決了需求「在我處理完所有o f一個文件中的模式，我需要寫出更改後的樹「？ – LarsH

@LarsH我錯過了這個需求，但我認爲只需將'text'寫入文件就可以輕鬆完成。 –

Python：BeautifulSoup修改文本

案例1：： 從bk1.xhtml 片段的鏈接交叉引用，數字XT具有嵌入式bookref文本

情況2：無鏈接交叉參考 - 具有與XT沒有數字嵌入式bookref文本

案例3：腳註沒有聯繫，但有英尺文本中位數

回答

相關問題

案例1：：
從bk1.xhtml
片段的鏈接交叉引用，數字XT具有嵌入式bookref文本