無法從BeautifulSoup文本輸出中刪除換行符（Python 2.7.5）

我正在嘗試編寫一個程序來解析一系列HTML文件，並將結果數據存儲在.csv電子表格中，這令人難以置信地依賴於換行符恰到好處。我已經嘗試了我可以找到的每一種方法去除某些文本片段的斷行，但都無濟於事。相關代碼如下所示：無法從BeautifulSoup文本輸出中刪除換行符（Python 2.7.5）

soup = BeautifulSoup(f) 
ID = soup.td.get_text() 
ID.strip() 
ID.rstrip() 
ID.replace("\t", "").replace("\r", "").replace("\n", "") 
dateCreated = soup.td.find_next("td").get_text() 
dateCreated.replace("\t", "").replace("\r", "").replace("\n", "") 
dateCreated.strip() 
dateCreated.rstrip() 
# debug 
print('ID:' + ID + 'Date Created:' + dateCreated)

而產生的代碼看起來是這樣的：

ID: 
FOO 
Date Created: 
BAR

這並用相同的程序一直在推動我逼瘋了另一個問題。幫助將是太棒了。謝謝。

編輯：想通了，這是一個非常愚蠢的錯誤。而不是僅僅做

ID.replace("\t", "").replace("\r", "").replace("\n", "")

我應該做的

ID = ID.replace("\t", "").replace("\r", "").replace("\n", "")

來源

2014-07-22 Ben Forde

嘗試打印'repr（ID）'來查看可能存在哪些字節？否則，也許嘗試字符串格式而不是串聯？ –

打印編號（ID）和編號（dateCreated）給了我u'\ nFOO \ n' u'\ nBAR \ n'。我已經嘗試將替換設置爲（u「\ n」，u「」），但這並沒有做任何事情。 –

您的問題在於您期待從返回新值的實際操作中進行就地操作。

ID.strip() # returns the rstripped value, doesn't change ID. 
ID = ID.strip() # Would be more appropriate.

你可以使用正則表達式，但正則表達式是矯枉過正了這個過程。實際上，尤其是如果它開始和結束字符，只是將它們傳遞給帶材：

ID = ID.strip('\t\r\n')

來源

2014-07-22 04:52:05

儘管這個問題已經那種已經回答了，我只是想通過的是有沒有一個很大的原因做了替換那冗長的方式，你可以真正做到這一點：

import re 

ID = re.sub(r'[\t\r\n]', '', ID)

即使regex通常是要避免的東西。

來源

2014-07-22 04:18:48

有剝離字符串爲BeautifulSoup4

內部實現的

這些字符串往往有很多額外的空格，您可以通過使用.stripped_strings發生器，而不是刪除： BS4 Doc stripped_strings

html_doc="""<div class="path"> <a href="#"> abc</a> <a href="#"> def</a> <a href="#"> ghi</a> </div>""" from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, "html.parser") result_list = [] for s in soup.select("div.path"): result_list.extend(s.stripped_strings) print " ".join(result_list)

Output: abc def ghi

來源

2016-05-25 17:59:53 pymen

無法從BeautifulSoup文本輸出中刪除換行符（Python 2.7.5）

回答

相關問題