如何在包含空行的兩個分隔符之間剝離文本？

我試圖刪除這兩個分隔符之間的文本：'<'&'>'。我正在閱讀電子郵件內容，然後將該內容寫入.txt文件。我在這兩個分隔符之間得到了很多垃圾，包括我的.txt文件中的行之間的空格。我如何擺脫這一點？下面是我的腳本已經從寫我的.txt文件讀入數據：如何在包含空行的兩個分隔符之間剝離文本？

First Name</td> 

       <td bgcolor='white' style='padding:5px 

!important;'>Austin</td> 

       </tr><tr> 

       <td bgcolor='#f9f9f9' style='padding:5px !important;' 

valign='top' width=170>Last Name</td>

下面是我目前從它剝離了空行.txt文件讀取代碼：

# Get file contents 
    fd = open('emailtext.txt','r') 
    contents = fd.readlines() 
    fd.close() 

    new_contents = [] 

    # Get rid of empty lines 
    for line in contents: 
     # Strip whitespace, should leave nothing if empty line was just  "\n" 
     if not line.strip(): 
      continue 
     # We got something, save it 
     else: 
      new_contents.append(line) 

    for element in new_contents: 
     print element

這裏是預計什麼：

First Name  Austin  


Last Name  Jones

來源

2016-11-29 E_R

您可以爲您的示例發佈您的預期輸出嗎？ –

同上@ Farhan.K，但增加了一些輸入/預期/有doohickeys（技術術語） – Blacksilver

名\t \t奧斯汀\t \t 姓\t \t瓊斯 –

markup = '<td bgcolor='#f9f9f9' style='padding:5px !important;' 

valign='top' width=170>Last Name</td>' 
soup = BeautifulSoup(markup) 
soup.get_text()

可以使用BeautifulSoup

來源

2016-11-29 15:13:12 Backtrack

您應該考慮使用正則表達式和re.sub功能：

import re 
print re.sub(r'<.*?>', '', text, re.DOTALL)

即使建議「不使用自定義解析器來解析HTML」始終是有效的。

來源

2016-11-29 15:17:16

您需要將line.strip（）的結果分配給一個變量並將其添加到其他內容。否則，您只需保存未剝離的線。

for line in contents: 

    line = line.strip() 

    if not line: 
     continue 
    # We got something, save it 
    else: 
     new_contents.append(line)

來源

2016-11-29 15:19:17 MrLeeh

它看起來像你正試圖從文本中刪除所有的HTML標籤。你可以手動，但標籤可能很複雜，甚至可以使用多行。

我的建議是使用BeautifulSoup是專門寫來處理XML和HTML：

import bs4 

# extract content... then 
new_content = bs4.BeautifoulSoup(content, 'html.parser').text 
print new_content

BS4模塊已經廣泛的測試，科佩斯與許多其他的情況，高度降低自己的代碼...

來源

2016-11-29 15:19:44

我會嘗試了這一點。感謝您的輸入。 –

如何在包含空行的兩個分隔符之間剝離文本？

回答

相關問題