2015-09-07 45 views
4

我試圖打開通過它的PID的所有鏈接,但有兩種情況:美麗的湯打開所有PID的URL在它

  1. 當它打開的所有網址(我的意思是,即使垃圾網址)

    def get_links(self): 
        links = [] 
        host = urlparse(self.url).hostname 
        scheme = urlparse(self.url).scheme 
        domain_link = scheme+'://'+host 
        pattern = re.compile(r'(/pid/)') 
    
        for a in self.soup.find_all(href=True):    
         href = a['href'] 
         if not href or len(href) <= 1: 
          continue 
         elif 'javascript:' in href.lower(): 
          continue 
         elif 'forgotpassword' in href.lower(): 
          continue 
         elif 'images' in href.lower(): 
          continue 
         elif 'seller-account' in href.lower(): 
          continue 
         elif 'review' in href.lower(): 
          continue 
         else: 
          href = href.strip() 
         if href[0] == '/': 
          href = (domain_link + href).strip() 
         elif href[:4] == 'http': 
          href = href.strip() 
         elif href[0] != '/' and href[:4] != 'http': 
          href = (domain_link + '/' + href).strip()     
         if '#' in href: 
          indx = href.index('#') 
          href = href[:indx].strip() 
         if href in links: 
          continue 
    
         links.append(self.re_encode(href)) 
    
        return links 
    
  2. 在這種情況下,它只是打開與PID的網址中,但在這種情況下,它不遵循鏈接,僅限於主頁。在與pid打開幾個鏈接後,它崩潰。

    def get_links(self): 
        links = [] 
        host = urlparse(self.url).hostname 
        scheme = urlparse(self.url).scheme 
        domain_link = scheme+'://'+host 
        pattern = re.compile(r'(/pid/)') 
    
        for a in self.soup.find_all(href=True): 
         if pattern.search(a['href']) is not None: 
          href = a['href'] 
          if not href or len(href) <= 1: 
           continue 
          elif 'javascript:' in href.lower(): 
           continue 
          elif 'forgotpassword' in href.lower(): 
           continue 
          elif 'images' in href.lower(): 
           continue 
          elif 'seller-account' in href.lower(): 
           continue 
          elif 'review' in href.lower(): 
           continue 
          else: 
           href= href.strip() 
          if href[0] == '/': 
           href = (domain_link + href).strip() 
          elif href[:4] == 'http': 
           href = href.strip() 
          elif href[0] != '/' and href[:4] != 'http': 
           href = (domain_link + '/' + href).strip()     
          if '#' in href: 
           indx = href.index('#') 
           href = href[:indx].strip() 
          if href in links: 
           continue 
    
          links.append(self.re_encode(href)) 
    
        return links 
    

有人可以幫得到的所有環節,甚至在URL中,並在年底的內部鏈接只接受PID作爲返回的鏈接。

回答

0

也許我錯過了一些東西,但爲什麼不在正則表達式中輸入if語句?因此,它應該是這樣的:

def get_links(self): 
    links = [] 
    host = urlparse(self.url).hostname 
    scheme = urlparse(self.url).scheme 
    domain_link = scheme+'://'+host 

    for a in self.soup.find_all(href=True):    
     href = a['href'] 
     if not href or len(href) <= 1: 
      continue 
     if href.lower().find("/pid/") != -1: 
      if 'javascript:' in href.lower(): 
       continue 
      elif 'forgotpassword' in href.lower(): 
       continue 
      elif 'images' in href.lower(): 
       continue 
      elif 'seller-account' in href.lower(): 
       continue 
      elif 'review' in href.lower(): 
       continue 

      if href[0] == '/': 
       href = (domain_link + href).strip() 
      elif href[:4] == 'http': 
       href = href.strip() 
      elif href[0] != '/' and href[:4] != 'http': 
       href = (domain_link + '/' + href).strip() 

      if '#' in href: 
       indx = href.index('#') 
       href = href[:indx].strip() 

      if href in links: 
       continue 

      links.append(self.re_encode(href)) 

    return links 

此外,我除去以下行,因爲我相信,否則你的代碼永遠不會得到較低的領域,因爲你不斷的一切。

else: 
     continue 
+0

我試圖用僅當,而不是正則表達式,但它開始認識到所有的比賽情況有是'id'或'sid'等詞。我使用正則表達式,因爲它匹配整個單詞。 – joe

+0

這很奇怪,但如果你有這個問題,你應該嘗試'find',我更新了我的代碼。 – dstudeba

+0

這很奇怪,這是不打開任何網址,除了主要的網址 – joe

0

我想是這樣的:請評論,如果我可以提高代碼結構

for a in self.soup.find_all(href=True):    
     href = a['href'] 
     if not href or len(href) <= 1: 
      continue 
     if href[0] == '/': 
      href = (domain_link + href).strip() 
      if href.lower().find("?pid=") != -1: 
       href = href.strip() 
      elif 'javascript:' in href.lower(): 
       continue 
      elif 'reviews' in href.lower(): 
       continue 
     elif href[:4] == 'http': 
      if href.lower().find("?pid=") != -1: 
       href = href.strip() 
     elif href[0] != '/' and href[:4] != 'http': 
      href = (domain_link + '/' + href).strip() 
      if href.lower().find("?pid=") != -1: 
       href = href.strip() 
     if '#' in href: 
      indx = href.index('#') 
      href = href[:indx].strip() 
     if href in links: 
      continue 
     links.append(self.re_encode(href))