2016-11-07 71 views
0

我已經過基於腳本LiteScraper從GitHub,從http://ifunny.coPython的刮板文件命名

腳本刮迷因和GIF保存在文件夾中的所有圖像與時間戳,例如「ifunny-(時間戳)」

我正在從http://ifunny.co/feeds/shuffle刮,所以我每次都得到一個隨機頁面10張圖片。

問題是,我需要修改腳本,以便將所有圖像保存在給定的文件夾名稱中。

我試圖刪除添加時間戳的代碼,但問題是每次獲取10張圖像並擦除下一頁時,10張新圖像會覆蓋舊圖像。

腳本似乎已命名,如 「1,2,3,4」 的影像ECT

下面是代碼:

import os 
import time 
from html.parser import HTMLParser 
import urllib.request 

#todo: char support for Windows 
#deal with triple backslash filter 
#recursive parser option 


class LiteScraper(HTMLParser): 
    def __init__(self): 
     HTMLParser.__init__(self) 
     self.lastStartTag="No-Tag" 
     self.lastAttributes=[] 
     self.lastImgUrl="" 
     self.Data=[] 
     self.acceptedTags=["div","p","h","h1","h2","h3","h4","h5","h6","ul","li","a","img"] 
     self.counter=0 
     self.url="" 


     self.SAVE_DIR="" #/Users/stjepanbrkic/Desktop/temp 
     self.Headers=["User-Agent","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"] 

    def handle_starttag(self,tag,attrs): 
     #print("Encountered a START tag:",tag) 
     self.lastStartTag=tag 
     self.lastAttributes=attrs #unnecesarry, might come in hany 

     if self.lastStartTag=="img": 
      attrs=self.lastAttributes 

      for attribute in attrs: 
       if attribute[0]=="src": 
        self.lastImgUrl=attribute[1] 
        print(attribute[1]) 

        #Allow GIF from iFunny to download 
        for attribute in attrs: 
         if attribute[0]=="data-gif": 
          self.lastImgUrl=attribute[1] 
          print(attribute[1]) 
          #End Gif Code 

      self.handle_picture(self.lastImgUrl) 

    def handle_endtag(self,tag): 
     #print("Encountered a END tag:",tag) 
     pass 

    def handle_data(self,data): 
     data=data.replace("\n"," ") 
     data=data.replace("\t"," ") 
     data=data.replace("\r"," ") 
     if self.lastStartTag in self.acceptedTags: 
      if not data.isspace(): 
       print("Encountered some data:",data) 
       self.Data.append(data) 

     else: 
      print("Encountered filtered data.") #Debug 

    def handle_picture(self,url): 
     print("Bumped into a picture. Downloading it now.") 
     self.counter+=1 
     if url[:2]=="//": 
      url="http:"+url 

     extension=url.split(".") 
     extension="."+extension[-1] 

     try: 
      req=urllib.request.Request(url) 
      req.add_header(self.Headers[0],self.Headers[1]) 
      response=urllib.request.urlopen(req,timeout=10) 
      picdata=response.read() 
      file=open(self.SAVE_DIR+"/pics/"+str(self.counter)+extension,"wb") 
      file.write(picdata) 
      file.close() 
     except Exception as e: 
      print("Something went wrong, sorry.") 


    def start(self,url): 
     self.url=url 
     self.checkSaveDir() 

     try: #wrapped in exception - if there is a problem with url/server 
      req=urllib.request.Request(url) 
      req.add_header(self.Headers[0],self.Headers[1]) 
      response=urllib.request.urlopen(req,timeout=10) 
      siteData=response.read().decode("utf-8") 
      self.feed(siteData) 
     except Exception as e: 
      print(e) 

     self.__init__() #resets the parser/scraper for serial parsing/scraping 
     print("Done!") 

    def checkSaveDir(self): 
     #----windows support 
     if os.name=="nt": 
      container="\ " 
      path=os.path.normpath(__file__) 
      path=path.split(container[0]) 
      path=container[0].join(path[:len(path)-1]) 
      path=path.split(container[0]) 
      path="/".join(path) 
     #no more windows support! :P 
     #for some reason, os.normpath returns path with backslashes 
     #on windows, so they had to be supstituted with fowardslashes. 

     else: 
      path=os.path.normpath(__file__) 
      path=path.split("/") 
      path="/".join(path[:len(path)-1]) 

     foldername=self.url[7:] 
     foldername=foldername.split("/")[0] 

     extension=time.strftime("iFunny")+"-"+time.strftime("%d-%m-%Y") + "-" + time.strftime("%Hh%Mm%Ss") 

     self.SAVE_DIR=path+"/"+foldername+"-"+extension 


     if not os.path.exists(self.SAVE_DIR): 
      os.makedirs(self.SAVE_DIR) 

     if not os.path.exists(self.SAVE_DIR+"/pics"): 
      os.makedirs(self.SAVE_DIR+"/pics") 

     print(self.SAVE_DIR) 

,這就是我正在使用的腳本:

引擎收錄點com/PNwJ9wEJ

對不起,引擎收錄,它不會讓我後我的代碼...

我對python很新,所以我不知道如何解決這個問題。是否有可能這樣做呢?

第1頁圖像的名稱:(1,2,3,4,5,6,7,8,9,10) 第2頁圖像的名稱:(11,12,13 ....)

回答

0

每次解析器實例化時(對於每個新頁面),counter都設置爲零。這就是圖像不斷被覆蓋的原因。

一種替代方法是確定哪些文件名已被使用。

i = 0 
while os.path.isfile('your_filename_logic_'+str(i)): 
    i += 1 
# Now i is the first number which hasn't been used. 

但是,如果你得到成千上萬的圖像,這可能不會像你想要的那麼快。

LiteScraper完成後,您可以將計數器存儲在文件中,並在下一次啓動時將其讀回。

def startMyNewCounter(self): 
    if os.path.isfile('your_filename_logic_' + 'count'): 
     with open('your_filename_logic_'+'count', 'r') as f: 
      self.counter = int(next(f)) 
    else: 
     self.counter = 0 

def saveMyCounter(self): 
    with open('your_filename_logic_'+'count', 'w') as f: 
     f.write(str(self.counter) + '\n') 

或者最簡單的答案:如果你不關心你的圖像程序關閉後,可以使計數器一個全局變量,而不是你的LiteScraper的成員。因此,每一個新的LiteScraper都會在最後一箇中斷的地方繼續。