預處理字符串列表

我有一個字符串列表。例如字符串，預處理字符串列表

mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies"

對於字符串（其中）項由逗號分隔的每個術語，我想刪除「/」之後的所有文本。如果沒有反斜槓，請不要執行任何操作。

例如，我希望得到的字符串是像這樣，

mesh = "Adrenergic beta-Antagonists, Adult, Aged, Aged, Antihypertensive Agents, Blood Glucose, Celiprolol, Female, Glucose Tolerance Test, Humans, Hypertension, Male, Middle Aged, Prospective Studies"

我然後像字符串中刪除的任何重複的值（例如，老化）。所需的字符串，

mesh = "Adrenergic beta-Antagonists, Adult, Aged, Antihypertensive Agents, Blood Glucose, Celiprolol, Female, Glucose Tolerance Test, Humans, Hypertension, Male, Middle Aged, Prospective Studies"

我寫了這個代碼適用於一個字符串，但正在尋找一種更有效的方式來爲一個字符串列表做到這一點：

import string 
mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies" 
newMesh = [] 
for each in mesh.split(","): 
    newMesh.append(each.split('/', 1)[0].lstrip(' ')) 
newMesh = list(set(newMesh)) 
meshString = ",".join(newMesh) 
print(meshString)

注：順序字符串中的術語是不相關的。

來源

2017-10-21 jdoe

請停止添加錯誤的標籤，你擁有的是不是一個數據框。 –

@cᴏʟᴅsᴘᴇᴇᴅ道歉 - 不知道我在想什麼... – jdoe

包裝到功能，並將其應用於地圖。 'list_of_strings = list（map（your_function，list_of_strings））' –

您可以使用re.sub：

mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies" 
import re 
s = re.sub("\/\*[\w\s]+", '', mesh) 
final_string = [] 
for i in re.split(",", s): 
    if i not in final_string: 
     final_string.append(i) 

new_final_string = ', '.join(final_string) 
print(new_final_string)

輸出：

'Adrenergic beta-Antagonists, Adult, Aged, Antihypertensive Agents, Blood Glucose, Celiprolol, Female, Glucose Tolerance Test, Humans, Hypertension, Male, Middle Aged, Prospective Studies'

來源

2017-10-21 16:22:38 Ajax1234

什麼是從這個字符串中刪除重複標記的最有效方法？ – jdoe

@jdoe請看我最近的編輯。 – Ajax1234

已編輯糾正錯誤 - （例如，腎上腺素能β拮抗劑是一個單一的標記） – jdoe

隨着re.sub()功能和set對象（更快的項目搜索）：

import re 

mesh = "Adrenergic beta-Antagonists/*therapeutic use, Adult, Aged, Aged/*effects, Antihypertensive Agents/*therapeutic use, Blood Glucose/*drug effects, Celiprolol/*therapeutic use, Female, Glucose Tolerance Test, Humans, Hypertension/*drug therapy, Male, Middle Aged, Prospective Studies" 
word_set = set() 
result = [] 

for w in re.sub(r'/[^,]+', '', mesh).split(','): 
    w = w.strip() 
    if w not in word_set: 
     result.append(w) 
     word_set.add(w) 
result = ', '.join(result) 

print(result)

輸出：

Adrenergic beta-Antagonists, Adult, Aged, Antihypertensive Agents, Blood Glucose, Celiprolol, Female, Glucose Tolerance Test, Humans, Hypertension, Male, Middle Aged, Prospective Studies

來源

2017-10-21 16:31:27 RomanPerekhrest

預處理字符串列表

回答

相關問題