個人清單:的Python:在列表中替換 n r 噸不包括起始 n n和與 n r n 噸結束
['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']
示例代碼:
import requests
from bs4 import BeautifulSoup
import re
re=requests.get('http://www.abcde.com/banana')
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
print(title_tag.text)
list=[]
for tag in soup.select('.page_article_content'):
list.append(tag.text)
#list=([c.replace('\n', '') for c in list])
#list=([c.replace('\r', '') for c in list])
#list=([c.replace('\t', '') for c in list])
print(list)
我颳了一個網頁後,我需要做數據清理。我想,以取代所有的"\r"
,"\n"
,"\t"
爲""
,但我發現我有字幕可以在這一點,如果我這樣做,字幕和句子要一起混合。
每個字幕總是與\n\n
開始,以\n\r\n\t
結束,是有可能,我可以做些什麼來區分它們在此列表中像\aEtymology\a
。如果我將\n\n
和\n\r\n\t
分別替換爲\a
,首先會導致其他部分可能具有相同的元素,例如\n\n\r
,它將變成\a\r
。提前致謝!
這非常整潔!對我來說很容易學習和理解。只是列表中的問題列表= [r.replace('',x,1)],1用於什麼?當我刪除它時,它打印出相同的結果。只是好奇:)謝謝! –
Makiyo
@Makiyo 1是單獨替換第一個出現的。如果刪除1,則輸出中的字幕將相同。 –