2014-01-14 77 views
0

因此,我想從標籤中刪除所有字符(主要是字母),但保留例外列表中的單詞。Python:刪除標籤中的所有字符與例外列表

例如,

我想改變

<html>VERY RARE CAR WITH NEW TIRES WHITE</html> 

到:

<html>CAR WHITE</html> 

這意味着兩個詞汽車和白色是從例外列表中。

+0

Sooo,你有什麼嘗試? – Matt

+0

你嘗試過什麼嗎?它工作嗎? – 2014-01-14 08:27:13

+0

'所有的字符(主要是字母)'???!? – thefourtheye

回答

0

我不確定這是你要找的。我會展示如何剝離任何文本,你想用2列表,例外詞和html標籤:

#This is to maintain the html tags unmodified 
html_tags = ['<a>','</a>','<html>','</html>'] 

#Exception words list 
word_list = ['WORD1','CAR','WORD2','WHITE','WORD3','WORD4'] 
#String you want to split 
string = '<html>VERY RARE CAR WITH NEW TIRES WHITE</html>' 

#The result string where we concatenate desired words and tags 
final_string = '' 

#now we change the string to add # before '<' and after '>' so we can split the text by tags 
string = string.replace('<','#<') 
string = string.replace('>','>#') 

string_list = string.split('#') #Now we have the tags unmodified (<html>,<a>...) 

#Now we have: 
#string_list = ['', '<html>', 'VERY RARE CAR WITH NEW TIRES WHITE', '</html>', ''] 

for word in string_list: #We go over all string_list 
    if (word in html_tags): #If we find a tag, we add it to final_string 
     final_string+=word 
    else: #If it isn't a tag, it is text, in this case 'VERY RARE CAR WITH NEW TIRES WHITE' 
     for word2 in word.split(): #We split by whitespace 
      if word2 in word_list: #If it is in word_list, we add it to final_string 
       final_string+=' '+word2+' ' 

#The result of this code is final_string with '<html> CAR WHITE </html>' 
#You can manage better the white spaces, and I make the code little complex 
#to make sure it works with different tags, and bigger html code. 

希望它有幫助!

相關問題