2017-10-21 53 views
2

我有一個字符串列表。例如字符串,預處理字符串列表

mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies" 

對於字符串(其中)項由逗號分隔的每個術語,我想刪除「/」之後的所有文本。如果沒有反斜槓,請不要執行任何操作。

例如,我希望得到的字符串是像這樣,

mesh = "Adrenergic beta-Antagonists, Adult, Aged, Aged, Antihypertensive Agents, Blood Glucose, Celiprolol, Female, Glucose Tolerance Test, Humans, Hypertension, Male, Middle Aged, Prospective Studies" 

我然後像字符串中刪除的任何重複的值(例如,老化)。所需的字符串,

mesh = "Adrenergic beta-Antagonists, Adult, Aged, Antihypertensive Agents, Blood Glucose, Celiprolol, Female, Glucose Tolerance Test, Humans, Hypertension, Male, Middle Aged, Prospective Studies" 

我寫了這個代碼適用於一個字符串,但正在尋找一種更有效的方式來爲一個字符串列表做到這一點:

import string 
mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies" 
newMesh = [] 
for each in mesh.split(","): 
    newMesh.append(each.split('/', 1)[0].lstrip(' ')) 
newMesh = list(set(newMesh)) 
meshString = ",".join(newMesh) 
print(meshString) 

注:順序字符串中的術語是不相關的。

+0

請停止添加錯誤的標籤,你擁有的是不是一個數據框。 –

+0

@cᴏʟᴅsᴘᴇᴇᴅ道歉 - 不知道我在想什麼... – jdoe

+1

包裝到功能,並將其應用於地圖。 'list_of_strings = list(map(your_function,list_of_strings))' –

回答

4

您可以使用re.sub

mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies" 
import re 
s = re.sub("\/\*[\w\s]+", '', mesh) 
final_string = [] 
for i in re.split(",", s): 
    if i not in final_string: 
     final_string.append(i) 

new_final_string = ', '.join(final_string) 
print(new_final_string) 

輸出:

'Adrenergic beta-Antagonists, Adult, Aged, Antihypertensive Agents, Blood Glucose, Celiprolol, Female, Glucose Tolerance Test, Humans, Hypertension, Male, Middle Aged, Prospective Studies' 
+0

什麼是從這個字符串中刪除重複標記的最有效方法? – jdoe

+0

@jdoe請看我最近的編輯。 – Ajax1234

+0

已編輯糾正錯誤 - (例如,腎上腺素能β拮抗劑是一個單一的標記) – jdoe

0

隨着re.sub()功能和set對象(更快的項目搜索):

import re 

mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies" 
word_set = set() 
result = [] 

for w in re.sub(r'/[^,]+', '', mesh).split(','): 
    w = w.strip() 
    if w not in word_set: 
     result.append(w) 
     word_set.add(w) 
result = ', '.join(result) 

print(result) 

輸出:

Adrenergic beta-Antagonists, Adult, Aged, Antihypertensive Agents, Blood Glucose, Celiprolol, Female, Glucose Tolerance Test, Humans, Hypertension, Male, Middle Aged, Prospective Studies