2016-11-06 55 views
3

我從網站上抓取了幾篇文章,現在我試圖通過從文本中刪除第一部分來使語料庫更具可讀性。 應該刪除的時間間隔在文章開始前的標籤<p>Advertisement和最終標籤</time>內。正如你所看到的,正則表達式應該刪除多行中的幾個單詞。我嘗試了DOTALL序列,但沒有成功。正則表達式|刪除給定單詞前多行的單詞

這是我第一次嘗試:

import re 

text=''' 
<p>Advertisement</p>, <p class="byline-dateline"><span class="byline"itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author" 
data-byline-name="MILAN SCHREUER" itemprop="name">MILAN SCHREUER</span> and </span><span class="byline" 
itemid="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html" 
itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><a href="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html" 
title="More Articles by ALISSA J. RUBIN"><span class="byline-author" data-byline-name="ALISSA J. RUBIN" data-twitter-handle="Alissanyt" itemprop="name">ALISSA J. RUBIN</span></a></span><time class="dateline" content="2016-10-06T01:02:19-04:00" 
datetime="2016-10-06T01:02:19-04:00" itemprop="dateModified">OCT. 5, 2016</time> 
</p>, <p class="story-body-text story-content" data-para-count="163" data-total-count="163">BRUSSELS — A man wounded two police officers with a knife in Brussels around noon on Wednesday in what the authorities called 「a potential terrorist attack.」</p>, <p class="story-body-text story-content" 
data-para-count="231" data-total-count="394">The two officers were attacked on the Boulevard Lambermont in the Schaerbeek district, just north of the city center. A third police officer, who came to their aid, was also injured. None of the three had life-threatening injuries.</p> 
''' 
my_pattern=("(.*)</time>") 
results= re.sub(my_pattern," ", text) 
print(results) 

回答

3

試試這個:

my_pattern=("[\s\S]+\<\/time\>") 

如果你也想還刪除以下標籤</p>,逗號,和空間,你可以使用這個:

my_pattern=("[\s\S]+\<\/time\>[\s\S]\<\/p\>\,\s") 
+0

如果我想省略第一個標籤怎麼辦

廣告

及其屬性? –

+0

@ M.Huntz你嘗試這樣的事情'(?<=\<\/p\>)\ S \ S] + \ <\/time\>'或'這個(?<=\<\/p\>)\ S \ S] + \ <\/time\> [\ S \ S] \ <\/p\> \ ,\ s'演示:https://regex101.com/r/79n80Z/3 – Ibrahim