2016-04-19 29 views
2

我試圖在標題中搜索帶有特定詞語的論文。更確切地說,2010年和2015年之間。在此發表的論文中字病毒或病毒的代碼我有:使用entrez和biopython在medline數據庫中搜索標題

import re 
from Bio import Medline 

handle = Entrez.esearch(db="pubmed", # database to search 
        term="2010[Date - Publication]:2015[Date - Publication]" 
        ) 
record = Entrez.read(handle) 
handle.close() 

pmid_list = record["IdList"] #list of records 

handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline",  retmode="text") 
records = Medline.parse(handle) 

titles = [] # start with empty list of titles 
for record in records: 
    ti_list = record['TI'] #titles 
    for title in ti_list: 
     if title == "virus" and title not in titles: #searching viral/virus 
     titles.append(title) 

print('Publications with viral or virus in the title:') 
for record in records: 
    print(" ", title) 

如果我只是打印(記錄[「TI」],然後我得到的所有圖書的清單在我的搜索查詢中,但是我無法搜索到特定的單詞,我認爲我的錯誤可能出現在「if title ==」病毒「中(因爲顯然沒有紙張會單獨用這個單詞標題)

我非常堅持。有沒有更好的方式來尋找在我質疑的論文的標題字?

感謝。

編輯:更新的代碼(現在仍然沒有運氣)

import re 
from Bio import Medline 

handle = Entrez.esearch(db="pubmed", # database to search 
        term="2010[Date - Publication]:2015[Date - Publication]" 
        ) 
record = Entrez.read(handle) 
handle.close() 

pmid_list = record["IdList"] #list of records 

from Bio import Medline 
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline",  retmode="text") 
records = Medline.parse(handle) 

r = re.compile(r"\bvir(al|us)\b") 
titles = set() # start with empty list of titles 
for record in records: 
    ti_list = record['TI'] # titles 
    for title in ti_list: 
     if r.search(title): # 
      titles.add(title) 

print('Publications with viral or virus in the title:') 
for record in records: 
    print(" ", title) 

新代碼:

import re 
from Bio import Medline 
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text", 
         term="2010[Date - Publication]:2015[Date - Publication]") 
records = Medline.parse(handle) 
titles = [] 
for record in records: 
    ti_list = record['TI'] 
    for title in ti_list: 
     titles.append(title) 
handle.close() 
for title in titles: 
    print(title) 

回答

1

如果你想匹配字符串使用來看看是否有任何的話都包含在標題:

words = ("viral","virus") 
if any(w in title for w in words) and title not in titles: # 

但你似乎要篩選的記錄得到含有病毒或維魯任何錄像標題S:

st = {"viral","virus"} 

filtered_records = [ record for record in records if any(w in st for w in record['TI'])] 

如果你想匹配字符串和使用模式,那麼你真的需要使它成爲一個正則表達式,"vir(al|us)"只是在你的代碼的字符串:

import re 

r = re.compile("vir(al|us)") 
filtered_records = [record for record in records if any(r.search(w) for w in record['TI'])] 

在自己的正則表達式循環會去的地方,如果你是:

import re 

r = re.compile(r"vir(al|us)") 
if r.search(title) and title not in titles: 
     ....... 

如果你不想病毒等相匹配,然後用一個詞邊界爲您正則表達式:

r = re.compile(r"\bvir(al|us)\b") 

你也應該做標題了一套使用自己的代碼不能有愚弄,工作示例:

r = re.compile(r"\bvir(al|us)\b") 
titles = set() # start with empty list of titles 
for record in records: 
    ti_list = record['TI'] # titles 
    for title in ti_list: 
     if r.search(title): # 
      titles.add(title) 

它可以成爲一套理解:

r = re.compile(r"\bvir(al|us)\b") 

titles = {title for record in records for title in record['TI'] if r.search(title)} # titles 

由於record['TI']返回字符串而不是列表:

r = re.compile(r"\bvir(al|us)\b") 
titles = set() 
for record in records: 
    title = record['TI'] # title is a str not a list 
    if r.search(title): # 
      titles.add(title) 

Do th與集合comp或任何其他示例相同。

+0

對不起,我對此很新。我如何將你的答案的正則表達式版本放入我的代碼中? – jarch

+0

@ user3723011,你想達到什麼目的?您正在添加到標題列表,但您似乎沒有使用它。你還在尋找子串還是精確匹配? –

+0

我的目標是有輸出,說 '在標題病毒或病毒刊物: [與病毒或病毒在標題出版物清單]'。 我試圖獲得完全匹配。 – jarch