2017-01-21 41 views
0

我颳了一張我們正在使用的票務網站,現在我有一個CSV文件,如下所示:ID,Attachment_URL,Ticket_URL。我現在需要做的是下載每個附件並使用Ticket_URL重命名該文件。我遇到的主要問題是,在導航到Attachment_URL時,您必須使用基本身份驗證,然後重定向到aws s3鏈接。我已經能夠使用wget下載單個文件,但是我一直無法遍歷整個列表(35k行左右),我不知道如何能夠將文件命名爲ticket_id。任何意見,將不勝感激。使用WGET或Python從CSV下載和重命名附件,需要基本身份驗證

+0

嘗試使用請求,http://docs.python-requests.org/en/master/user/advanced/ – wu4m4n

回答

0

明白了。

要打開驗證會話:

# -*- coding: utf-8 -*- 
import requests 
import re 
from bs4 import BeautifulSoup 
import csv 
import pandas as pd 
import time 


s = requests.session() 

payload = { 
    'user': '', 
    'pw': '' 
} 

s.post('login.url.here', data=payload) 
for i in range(1, 6000): 
    testURL = s.get(
     'https://urlhere.com/efw/stuff&page={}'.format(i)) 


    soup = BeautifulSoup(testURL.content) 
    table = soup.find("table", {"class": "table-striped"}) 
    table_body = table.find('tbody') 
    rows = table_body.find_all('tr')[1:] 
    print "The current page is: " + str(i) 

    for row in rows: 
     cols = row.find_all('a', attrs={'href': re.compile("^/helpdesk/")}) 
     # time.sleep(1) 
     with open('fd.csv', 'a') as f: 
     writer = csv.writer(f) 
     writer.writerow(cols) 
     print cols 
    print cols 

然後我打掃鏈接位R和下載的文件。

#! /usr/bin/env python 
    import threading 
    import os 
    from time import gmtime, strftime 
    from Queue import Queue 

    import requests 
    s = requests.session() 

    payload = { 
     'user': '', 
     'pw': '' 
    } 
    s.post('login', data=payload) 

    class log: 

     def info(self, message): 
      self.__message("info", message) 
     def error(self, message): 
      self.__message("error", message) 
     def debug(self, message): 
      self.__message("debug", message) 
     def __message(self, log_level, message): 
      date = strftime("%Y-%m-%d %H:%M:%S", gmtime()) 
      print "%s [%s] %s" % (date, log_level, message) 


    class fetch: 
     def __init__(self): 
      self.temp_dir = "/tmp" 


     def run_fetcher(self, queue): 

      while not queue.empty(): 
       url, ticketid = queue.get() 

       if ticketid.endswith("NA"): 
        fileName = url.split("/")[-1] + 'NoTicket' 
       else: 
        fileName = ticketid.split("/")[-1] 

       response = s.get(url) 

       with open(os.path.join('/Users/Desktop/FolderHere', fileName + '.mp3'), 'wb') as f: 

        f.write(response.content) 

        print fileName 




       queue.task_done() 


    if __name__ == '__main__': 

     # load in classes 
     q = Queue() 
     log = log() 
     fe = fetch() 


     # get bucket name 
     #Read in input file 
     with open('/Users/name/csvfilehere.csv', 'r') as csvfile: 
      for line in csvfile: 
       id,url,ticket = line.split(",") 
       q.put([url.strip(),ticket.strip()]) 

     # spin up fetcher workers 
     threads = [] 
     for i in range(8): 
      t = threading.Thread(target=fe.run_fetcher, args=(q,)) 
      t.daemon = True 
      threads.append(t) 
      t.start() 

     # close threads 
     [x.join() for x in threads] 

     # close queue 
     q.join() 
     log.info("End")