Python：通過URL的.csv循環並保存爲另一列

Python新手，閱讀了一堆，並觀看了很多視頻。我無法讓它工作，我感到沮喪。Python：通過URL的.csv循環並保存爲另一列

我有一個鏈接列表如下圖所示：

"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL" 
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip" 
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"

我試圖讓蟒蛇去「URL」，並將其保存在名爲「定位」爲文件名的文件夾中「API.las」。

EX）...... 「位置」/組/ 「API」 .las C：//.../T32S R29W/Sec.27/15-119-00164.las

該文件有數百行和鏈接供下載。我想實現一個休眠功能也不會轟炸服務器。

有什麼不同的方式來做到這一點？我試過熊貓和其他一些方法......有什麼想法？

來源

2017-08-30 gdink1020

你嘗試過這麼遠嗎？ – Mortz

'進口熊貓作爲PD 數據= pd.read_csv（ 'MeadeLAS.csv'）鏈接= data.URL file_names = data.API 用於鏈路，FILE_NAME在拉鍊（鏈接，file_names）：文件= pd.read_csv （鏈接）.to_csv（文件名+'。las'，索引= False）' – gdink1020

@mortz忘記標記 – gdink1020

你將不得不做這樣的

for link, file_name in zip(links, file_names): 
    u = urllib.urlopen(link) 
    udata = u.read() 
    f = open(file_name+".las", w) 
    f.write(udata) 
    f.close() 
    u.close()

的東西，如果你的文件的內容是你想要什麼沒有，你可能想看看在刮圖書館像BeautifulSoup解析。

來源

2017-08-31 03:35:58 Mortz

我需要它遵循問下載的鏈接，我希望它保存爲列API中的值。從我如何閱讀你的建議，這不會是這種情況。 – gdink1020

這正是代碼試圖做的事情。 'udata'保存'link'的內容，並將其寫入來自API列的'file_name'。嘗試一下。如果你正在嘗試實現完全不同的東西，那麼請相應地編輯你的問題 – Mortz

對不起，我對我讀代碼的方式感到困惑。我運行它能夠使其工作。只有在打開文件進行寫入時，編輯纔會在打開的函數中將W引號引起來。此外，鏈接下載一個.zip保存到.las之前解壓縮是一個問題，但使用7 zip我能批量解壓縮文件 – gdink1020

方法1： -

您的文件已經假設1000行。通過這個masterlist>
[ROW1，ROW2，ROW3等]

一旦這樣做，環 -

創建masterlist其具有存儲在該形式的數據。每次迭代你都會得到一個字符串格式的行。拆分它製作一個列表並拼接最後一列url，即行[-1]

並將其追加到名爲result_url的空列表中。一旦它運行的所有行，它保存在一個文件，你可以方便地使用os模塊創建一個目錄，然後將您的文件那邊

方法2： -

如果文件過於龐大，閱讀在try塊中一行一行地處理你的數據（使用csv模塊，你可以將每一行作爲一個列表，拼接url並將它每次寫入文件API.las）。

一旦你的程序移動了第1001行，它將移動到除了可以「通過」或寫入打印以獲得通知的塊之外。
在方法2中，您不是將所有數據保存在任何數據結構中，只是在執行時存儲單行，因此速度更快。

import csv, os 
    directory_creater = os.mkdir('Locations')  
    fme = open('./Locations/API.las','w+') 
    with open('data.csv','r') as csvfile: 
     spamreader = csv.reader(csvfile, delimiter = ',') 
     print spamreader.next() 
     while True: 
      try: 
       row= spamreader.next() 
       get_url = row[-1] 
       to_write = get_url+'\n' 
       fme.write(to_write) 
      except: 
       print "Program has run. Check output." 
       exit(1)

此代碼可以在更短的時間內完成所有您提到的效率。

來源

2017-08-30 12:25:27

該文件有240,163行，需要保存唯一的url地址 – gdink1020

請使用方法2（我已經重新編輯過）。不要在列表中累積所有數據。 –

@ gdink1020我的代碼不工作？ –

這可能有點骯髒，但這是解決問題的第一步。這一切都取決於CSV中的每個值都包含在雙引號中。如果這不是真的，這個解決方案將需要大量調整。

代碼：

import os 

csv = """ 
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL" 
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip" 
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip" 
""".strip() # trim excess space at top and bottom 

root_dir = '/tmp/so_test' 

lines = csv.split('\n') # break CSV on newlines 
header = lines[0].strip('"').split('","') # grab first line and consider it the header 

lines_d = [] # we're about to perform the core actions, and we're going to store it in this variable 
for l in lines[1:]: # we want all lines except the top line, which is a header 
    line_broken = l.strip('"').split('","') # strip off leading and trailing double-quote 
    line_assoc = zip(header, line_broken) # creates a tuple of tuples out of the line with the header at matching position as key 
    line_dict = dict(line_assoc) # turn this into a dict 
    lines_d.append(line_dict) 

    section_parts = [s.strip() for s in line_dict['Location'].split(',')] # break Section value to get pieces we need 

    file_out = os.path.join(root_dir, '%s%s%s%sAPI.las'%(section_parts[0], os.path.sep, section_parts[1], os.path.sep)) # format output filename the way I think is requested 

    # stuff to show what's actually put in the files 
    print file_out, ':' 
    print ' ', '"%s"'%('","'.join(header),) 
    print ' ', '"%s"'%('","'.join(line_dict[h] for h in header))

輸出：

~/so_test $ python so_test.py 
/tmp/so_test/T32S R29W/Sec. 27/API.las : 
    "KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL" 
    "1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip" 
/tmp/so_test/T34S R26W/Sec. 2/API.las : 
    "KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL" 
    "1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip" 
~/so_test $

來源

2017-08-31 06:00:50

這裏是鏈接的輸出示例[link]（http://www.kgs.ku.edu/oracle/las40033.txt） – gdink1020

零件的代碼創建目錄路徑是正確的，但我無法獲得代碼來實際創建文件夾路徑。此外，我需要代碼來跟蹤網址，開始下載，並將文件保存爲API.like上面的@Mortz。所以給你提供的目錄看起來像/ tmp/so_test/T32S R29W/Sec。 27/15-119-00164.las。 – gdink1020

Python：通過URL的.csv循環並保存爲另一列

回答

相關問題