2017-10-16 121 views
0

個人清單:的Python:在列表中替換 n r 噸不包括起始 n n和與 n r n 噸結束

['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']

示例代碼:

import requests 
from bs4 import BeautifulSoup 
import re 
re=requests.get('http://www.abcde.com/banana') 
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser") 
title_tag = soup.select_one('.page_article_title') 
print(title_tag.text) 
list=[] 
for tag in soup.select('.page_article_content'): 
    list.append(tag.text) 
#list=([c.replace('\n', '') for c in list]) 
#list=([c.replace('\r', '') for c in list]) 
#list=([c.replace('\t', '') for c in list]) 
print(list) 

我颳了一個網頁後,我需要做數據清理。我想,以取代所有的"\r""\n""\t""",但我發現我有字幕可以在這一點,如果我這樣做,字幕和句子要一起混合。

每個字幕總是與\n\n開始,以\n\r\n\t結束,是有可能,我可以做些什麼來區分它們在此列表中像\aEtymology\a。如果我將\n\n\n\r\n\t分別替換爲\a,首先會導致其他部分可能具有相同的元素,例如\n\n\r,它將變成\a\r。提前致謝!

回答

1

方法

  1. 更換字幕列表
  2. 自定義字符串,<subtitles>更換\n\r\t等列表
  3. 實際字幕
  4. 更換自定義字符串

代碼

l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n'] 

import re 
regex=re.findall("\n\n.*.\n\r\n\t",l[0]) 
print(regex) 

for x in regex: 
    l = [r.replace(x,"<subtitles>") for r in l] 

rep = ['\n','\t','\r'] 
for y in rep: 
    l = [r.replace(y, '') for r in l] 

for x in regex: 
    l = [r.replace('<subtitles>', x, 1) for r in l] 
print(l) 

輸出

['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t'] 

['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.'] 
+0

這非常整潔!對我來說很容易學習和理解。只是列表中的問題列表= [r.replace('',x,1)],1用於什麼?當我刪除它時,它打印出相同的結果。只是好奇:)謝謝! – Makiyo

+0

@Makiyo 1是單獨替換第一個出現的。如果刪除1,則輸出中的字幕將相同。 –

0
import re  

print([re.sub(r'[\n\r\t]', '', c) for c in list]) 

我想你可以使用正則表達式

+0

,我不認爲這是一個正確的答案,他的 「\ n \ r \ t」 的意思是 '\ n' 或 '\ r' 或 '\ T',如果你閱讀它爲「\ n \ r \ t」,那麼下面的句子將是無用的「開始\ n \ n並以\ n \ r \ n \ t結尾」。檢查他的例子,根本沒有「\ n \ r \ t」 –

0

您可以通過使用正則表達式做到這一點:

import re 
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t') 
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li] 

\g<1>是一個逆向引用的第一正則表達式(\ w +)。它可以讓你重用那裏的東西。

+0

嗨!我試過了,但它不起作用,不知道是不是我把它放在了錯誤的地方。我剛剛上傳了上面的整個代碼:) – Makiyo

+0

什麼沒有工作?任何錯誤? –

+0

AttributeError:'Response'對象沒有'compile'屬性 – Makiyo