2016-10-05 76 views
0

我有刮網頁從BeautifulSoup如何剝離換行符獲取文本方法

 text 
Out[50]: 
['\nAbsolute FreeBSD, 2nd Edition\n', 
'\nAbsolute OpenBSD, 2nd Edition\n', 
'\nAndroid Security Internals\n', 
'\nApple Confidential 2.0\n', 
'\nArduino Playground\n', 
'\nArduino Project Handbook\n', 
'\nArduino Workshop\n', 
'\nArt of Assembly Language, 2nd Edition\n', 
'\nArt of Debugging\n', 
'\nArt of Interactive Design\n',] 

我需要從上面所列內容剝離時\ n迭代之後下一輸出。以下是我的代碼

text = [] 
for name in web_text: 
    a = name.get_text() 
    text.append(a) 

回答

1

就像你會strip任何其他字符串:

text = [] 
for name in web_text: 
    a = name.get_text().strip() 
    text.append(a) 
0

您可以使用列表理解:

stripedText = [ t.strip() for t in text ] 

,輸出:

>>> stripedText 
['Absolute FreeBSD, 2nd Edition', 'Absolute OpenBSD, 2nd Edition', 'Android Security Internals', 'Apple Confidential 2.0', 'Arduino Playground', 'Arduino Project Handbook', 'Arduino Workshop', 'Art of Assembly Language, 2nd Edition', 'Art of Debugging', 'Art of Interactive Design'] 
0

而不是打電話.strip()明確,使用strip參數:

a = name.get_text(strip=True) 

這也將刪除多餘的空格和換行字符在文本兒童如果有的話。

0

以下方法可以幫助您從上面的列表中刪除\ n,同時迭代它。

>>> web_text = ['\nAbsolute FreeBSD, 2nd Edition\n', 
... '\nAbsolute OpenBSD, 2nd Edition\n', 
... '\nAndroid Security Internals\n', 
... '\nApple Confidential 2.0\n', 
... '\nArduino Playground\n', 
... '\nArduino Project Handbook\n', 
... '\nArduino Workshop\n', 
... '\nArt of Assembly Language, 2nd Edition\n', 
... '\nArt of Debugging\n', 
... '\nArt of Interactive Design\n',] 

>>> text = [] 
>>> for line in web_text: 
...  a = line.strip() 
...  text.append(a) 
... 
>>> text 
['Absolute FreeBSD, 2nd Edition', 'Absolute OpenBSD, 2nd Edition', 'Android 
Security Internals', 'Apple Confidential 2.0', 'Arduino Playground', 
'Arduino Project Handbook', 'Arduino Workshop', 'Art of Assembly Language, 
2nd Edition', 'Art of Debugging', 'Art of Interactive Design']