2015-12-10 20 views
1

我剛開始學習python和編程,所以這可能是一個非常天真的問題。但我會感謝任何幫助。Python:嵌套循環,而不是創建多個輸入和輸出

下面的代碼有效,但我被告知有這些多個輸入和輸出是壞的,我應該嵌套循環。但嘗試我可能每次試圖嵌套任何東西,它只是最終給我一個空的文件夾。

所以我的問題是如何嵌套所有這些?

感謝和對長文章感到抱歉。

#1) I call a perl script and execute it to get the input file. 
perl = "/usr/bin/perl" 
perl_script = "geoFF.pl"; 
params = " --mount-doom-hot" 
pl_script = subprocess.Popen([perl, perl_script, params], stdout=sys.stdout) 
pl_script.communicate() 

## 2) input the output from the perl script but only the wanted data. 
# The input is a BIG file and I just want some specific lines from it. 
infile1 = "inputperl.txt" 
outfile1 = "c1.txt" 

f1 = open(infile1,'rU') 
o1 = open(outfile1,'w+') 

words = ['Acc','title','orgn','date','GP'] #for lines in file f1 get lines with the words 

for line in f1: 
    if any(words in line for words in words): 
     o1.write(line) 

# From the specific lines delete some symbols/charactewords I don't want. 

input1 =open("c1.txt",'rU') 
output1 = open("c2.txt",'w') 
del_list = ['>','title', 'orgn','date','<','GP','/Item','"','</Item>','<DS>','Name=','DocS','Acc'] # I want to keep the rest of the line but not these words. 

for line in input1: 
    for word in del_list: 
     line = line.replace(word, "") 
    output1.write(line) 

# For one specific word in the lines AB. The file has lines with AB129, AB8877, AB0997 and AB(etc). Here I want to attach and url so it will be an hyperlink.Attached url to GSE to get hyperlink 
inp = open("c2.txt",'rU') 
out= open("c3.txt",'w') 
filedata2 = inp.read() 
newdata2 = filedata2.replace('AB', "\n"'http://www.whatever.com/g/qu/acc.cgi?acc=AB') 
out.write(newdata2) 
# this output the line as http://www.whatever.com/g/qu/acc.cgi?acc=AB(somenumber) 
#for example http://www.whatever.com/g/qu/acc.cgi?acc=AB129 
#and http://www.whatever.com/g/qu/acc.cgi?acc=AB8877 etc. 

### then I want to take this files with the changes and send it by email 
from email.MIMEMultipart import MIMEMultipart 
from email.MIMEText import MIMEText 

fromaddr = "[email protected]" 
toaddr = "[email protected]" 
msg = MIMEMultipart() 
msg['From'] = fromaddr 
msg['To'] = toaddr 
msg['Subject'] = "RESULT" 

# send txt file in email body 
f6 = (open("c3.txt",'rU')) 
results = MIMEText(f6.read(),'plain') 
f6.close() 
msg.attach(results) 

#convert to string 
import smtplib 
server = smtplib.SMTP('smtp.gmail.com', 587) 
server.ehlo() 
server.starttls() 
server.ehlo() 
server.login("sender email", "password") 
text = msg.as_string() 
server.sendmail(fromaddr, toaddr, text) 

輸入文件看起來像

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE> 
<eSummaryResult> 
<DS> 
    <Id>20006767</Id> 
    <Item Name="Acc" Type="String">AB64767</Item> 
    <Item Name="GDS" Type="String"></Item> 
    <Item Name="title" Type="String">word word title of this word...</Item> 
    <Item Name="summary" Type="String">word word word..word word word..</Item> 
    <Item Name="GP" Type="String">11002;13112</Item> 
    <Item Name="AB" Type="String">64767</Item> 
    <Item Name="orgn" Type="String">Mus musculus</Item> 
    <Item Name="entryType" Type="String">AB</Item> 
    <Item Name="gdsType" Type="String">word word word..word word word..word word word..</Item> 
    <Item Name="ptechType" Type="String"></Item> 
    <Item Name="valType" Type="String"></Item> 
    <Item Name="SSInfo" Type="String"></Item> 
    <Item Name="subsetInfo" Type="String"></Item> 
    <Item Name="date" Type="String">2015/12/09</Item> 
    <Item Name="suppFile" Type="String">WIG</Item> 
    <Item Name="Samples" Type="List"> 
    </Item> 
    <Item Name="n_samples" Type="Integer">12</Item> 
    <Item Name="SeriesTitle" Type="String"></Item> 
    <Item Name="PlatformTitle" Type="String"></Item> 
    <Item Name="PlatformTaxa" Type="String"></Item> 
    <Item Name="SamplesTaxa" Type="String"></Item> 
    <Item Name="Ids" Type="List"> 
</Item> 
    <Id>200098567</Id> 
    <Item Name="Acc" Type="String">AB64789</Item> 
    <Item Name="GDS" Type="String"></Item> 
    <Item Name="title" Type="String">word word word...</Item> 
    <Item Name="summary" Type="String">word word word..word word word..</Item> 
    <Item Name="GP" Type="String">11002;13112</Item> 
    <Item Name="AB" Type="String">AB64789</Item> 
    <Item Name="orgn" Type="String">Mus musculus</Item> 
    <Item Name="entryType" Type="String">AB</Item> 
    <Item Name="gdsType" Type="String">word word word..word word word..word word word..</Item> 
    <Item Name="ptechType" Type="String"></Item> 
    <Item Name="valType" Type="String"></Item> 
    <Item Name="SSInfo" Type="String"></Item> 
    <Item Name="subsetInfo" Type="String"></Item> 
    <Item Name="date" Type="String">2015/12/09</Item> 
    <Item Name="suppFile" Type="String">WIG</Item> 
    <Item Name="Samples" Type="List"> 
</Item> 
    </Item>  
    <Id>200064997</Id> 
    <Item Name="Acc" Type="String">AB69957</Item> 
    <Item Name="GDS" Type="String"></Item> 
    <Item Name="title" Type="String">word word word...</Item> 
    <Item Name="summary" Type="String">word word word..word word word..</Item> 
    <Item Name="GP" Type="String">1100</Item> 
    <Item Name="AB" Type="String">69957</Item> 
    <Item Name="orgn" Type="String">Mus musculus</Item> 
    <Item Name="entryType" Type="String">AB</Item> 
    <Item Name="gdsType" Type="String">word word word..word word word..word word word..</Item> 
    <Item Name="ptechType" Type="String"></Item> 
    <Item Name="valType" Type="String"></Item> 
    <Item Name="SSInfo" Type="String"></Item> 
    <Item Name="subsetInfo" Type="String"></Item> 
    <Item Name="date" Type="String">2015/12/09</Item> 
    <Item Name="suppFile" Type="String">WIG</Item> 
    <Item Name="Samples" Type="List"> 
    </Item> 
    <Item Name="n_samples" Type="Integer">12</Item> 
    <Item Name="SeriesTitle" Type="String"></Item> 
    <Item Name="PlatformTitle" Type="String"></Item> 
    <Item Name="PlatformTaxa" Type="String"></Item> 
    <Item Name="SamplesTaxa" Type="String"></Item> 
    <Item Name="Ids" Type="List"> 
    <Item Name="int" Type="Integer">26476451</Item> 
    </Item> 
    <Item Name="Projects" Type="List"></Item> 
    <Item Name="G2R" Type="String">no</Item> 

我只想以下數據:

<Item Name="Acc" Type="String">AB64767</Item> 
<Item Name="title" Type="String">word word title of this word...</Item> 
<Item Name="AB" Type="String">64767</Item> 
<Item Name="orgn" Type="String">Mus musculus</Item> 
<Item Name="date" Type="String">2015/12/09</Item> 

但作爲顯示:

http://www.whatever.com/g/qu/acc.cgi?acc=AB64767 
word word title of this word... 
Mus musculus 
2015/12/09 

http://www.whatever.com/g/qu/acc.cgi?acc=AB64789 
word word title of this word... 
Mus musculus 
2015/12/09 

http://www.whatever.com/g/qu/acc.cgi?acc=AB69957 
word word title of this word... 
Mus musculus 
2015/12/09 
+0

不確定它是否相關,但你可能需要從'--mount-doom-hot'中刪除前導空格; Perl腳本獲取的參數以空格開頭,而不是「 - 」,因此可能無法將其識別爲選項。 – chepner

+0

你完全正確。謝謝 –

回答

1

讀取文件一次,並使用正則表達式將是一個更好的辦法:

import re 
del_list = ['>', 'title', 'orgn', 'date', '<', 'GP', '/Item', '"', '</Item>', '<DS>', 'Name=', 'DocS', 
      'Acc'] # I want to keep the rest of the line but not these words. 
words = ['Acc', 'title', 'orgn', 'date', 'GP'] 


rep = re.compile(r'|'.join(del_list)) 
keep = re.compile(r"|".join(words)) 
r3 = re.compile("AB(?=\d)") 

with open("test.txt") as f, open("out.txt","w") as out: 
    for line in f: 
     # if line contains match from words 
     if keep.search(line): 
      # replace all unwanted substrings 
      line = rep.sub("", line.lstrip()) 
      line = r3.sub("\n"'http://www.whatever.com/g/qu/acc.cgi?acc=AB', line) 
      out.write(line) 

out.txt:

Item Type=String 
http://www.whatever.com/g/qu/acc.cgi?acc=AB64767 
Item Type=Stringword word of this word... 
Item Type=String11002;13112 
Item Type=StringMus musculus 
Item Type=String2015/12/09 
Item Type=String 
http://www.whatever.com/g/qu/acc.cgi?acc=AB64789 
Item Type=Stringword word word... 
Item Type=String11002;13112 
Item Type=StringMus musculus 
Item Type=String2015/12/09 
Item Type=String 
http://www.whatever.com/g/qu/acc.cgi?acc=AB69957 
Item Type=Stringword word word... 
Item Type=String1100 
Item Type=StringMus musculus 
Item Type=String2015/12/09 

如果你正在尋找匹配一些單詞正好,那麼你將需要在正則表達式中使用單詞邊界,否則最終匹配"foo" in "foobar",如果您只想發送文件,則不必將其寫入磁盤。

+1

哇,這是驚人的各種各樣。非常感謝。 –

+0

沒有問題,不客氣。 –

1

雖然這是接近完成這裏有一些提示:

磁盤IO速度很慢,所以如果您只讀一次,請執行所有處理,然後生成輸出,而不是通過每個篩選步驟的文件來獲得更好的性能。

例如讓examen這樣的:

for line in f1: 
    if any(words in line for words in words): 
     o1.write(line) 

# From the specific lines delete some symbols/charactewords I don't want. 

input1 =open("c1.txt",'rU') 
output1 = open("c2.txt",'w') 
del_list = ['>','title', 'orgn','date','<','GP','/Item','"','</Item>','<DS>','Name=','DocS','Acc'] # I want to keep the rest of the line but not these words. 

for line in input1: 
    for word in del_list: 
     line = line.replace(word, "") 
    output1.write(line) 

在您選擇您輸入文件只有幾行第一個循環。 在第二個循環中,您從選定的行中刪除一些單詞。在你之間你寫你的整個數據到磁盤。

一個相當簡單的優化是寫回磁盤之前做的話,直接替換,即:

del_list = ['>','title', 'orgn','date','<','GP','/Item','"','</Item>','<DS>','Name=','DocS','Acc'] 

for line in f1: 
    if any(words in line for words in words): 
     for word in del_list: 
      line = line.replace(word, "") 
     o1.write(line) 

你能看到這是如何節省往返到磁盤?替代技術是通過將文件讀入list然後在該列表上操作而不是每次都來回地將數據保存在存儲器中。

我希望這能指出你正確的方式,你現在可以弄清楚如何擺脫第三組文件,以便最終只有一個輸入文件和一個輸出文件。

+0

這是我嘗試過的一件事情,但它用「」代替了整個數據,最後我得到了一個空文件。我會試圖找出我在這裏做錯了什麼,然後感謝 –