美麗的湯打開所有PID的URL在它

我試圖打開通過它的PID的所有鏈接，但有兩種情況：美麗的湯打開所有PID的URL在它

當它打開的所有網址（我的意思是，即使垃圾網址）

def get_links(self): 
    links = [] 
    host = urlparse(self.url).hostname 
    scheme = urlparse(self.url).scheme 
    domain_link = scheme+'://'+host 
    pattern = re.compile(r'(/pid/)') 

    for a in self.soup.find_all(href=True):    
     href = a['href'] 
     if not href or len(href) <= 1: 
      continue 
     elif 'javascript:' in href.lower(): 
      continue 
     elif 'forgotpassword' in href.lower(): 
      continue 
     elif 'images' in href.lower(): 
      continue 
     elif 'seller-account' in href.lower(): 
      continue 
     elif 'review' in href.lower(): 
      continue 
     else: 
      href = href.strip() 
     if href[0] == '/': 
      href = (domain_link + href).strip() 
     elif href[:4] == 'http': 
      href = href.strip() 
     elif href[0] != '/' and href[:4] != 'http': 
      href = (domain_link + '/' + href).strip()     
     if '#' in href: 
      indx = href.index('#') 
      href = href[:indx].strip() 
     if href in links: 
      continue 

     links.append(self.re_encode(href)) 

    return links

在這種情況下，它只是打開與PID的網址中，但在這種情況下，它不遵循鏈接，僅限於主頁。在與pid打開幾個鏈接後，它崩潰。

def get_links(self): 
    links = [] 
    host = urlparse(self.url).hostname 
    scheme = urlparse(self.url).scheme 
    domain_link = scheme+'://'+host 
    pattern = re.compile(r'(/pid/)') 

    for a in self.soup.find_all(href=True): 
     if pattern.search(a['href']) is not None: 
      href = a['href'] 
      if not href or len(href) <= 1: 
       continue 
      elif 'javascript:' in href.lower(): 
       continue 
      elif 'forgotpassword' in href.lower(): 
       continue 
      elif 'images' in href.lower(): 
       continue 
      elif 'seller-account' in href.lower(): 
       continue 
      elif 'review' in href.lower(): 
       continue 
      else: 
       href= href.strip() 
      if href[0] == '/': 
       href = (domain_link + href).strip() 
      elif href[:4] == 'http': 
       href = href.strip() 
      elif href[0] != '/' and href[:4] != 'http': 
       href = (domain_link + '/' + href).strip()     
      if '#' in href: 
       indx = href.index('#') 
       href = href[:indx].strip() 
      if href in links: 
       continue 

      links.append(self.re_encode(href)) 

    return links

有人可以幫得到的所有環節，甚至在URL中，並在年底的內部鏈接只接受PID作爲返回的鏈接。

來源

2015-09-07 joe

也許我錯過了一些東西，但爲什麼不在正則表達式中輸入if語句？因此，它應該是這樣的：

def get_links(self): 
    links = [] 
    host = urlparse(self.url).hostname 
    scheme = urlparse(self.url).scheme 
    domain_link = scheme+'://'+host 

    for a in self.soup.find_all(href=True):    
     href = a['href'] 
     if not href or len(href) <= 1: 
      continue 
     if href.lower().find("/pid/") != -1: 
      if 'javascript:' in href.lower(): 
       continue 
      elif 'forgotpassword' in href.lower(): 
       continue 
      elif 'images' in href.lower(): 
       continue 
      elif 'seller-account' in href.lower(): 
       continue 
      elif 'review' in href.lower(): 
       continue 

      if href[0] == '/': 
       href = (domain_link + href).strip() 
      elif href[:4] == 'http': 
       href = href.strip() 
      elif href[0] != '/' and href[:4] != 'http': 
       href = (domain_link + '/' + href).strip() 

      if '#' in href: 
       indx = href.index('#') 
       href = href[:indx].strip() 

      if href in links: 
       continue 

      links.append(self.re_encode(href)) 

    return links

此外，我除去以下行，因爲我相信，否則你的代碼永遠不會得到較低的領域，因爲你不斷的一切。

else: 
     continue

來源

2015-09-07 15:00:57 dstudeba

我試圖用僅當，而不是正則表達式，但它開始認識到所有的比賽情況有是'id'或'sid'等詞。我使用正則表達式，因爲它匹配整個單詞。 – joe

這很奇怪，但如果你有這個問題，你應該嘗試'find'，我更新了我的代碼。 – dstudeba

這很奇怪，這是不打開任何網址，除了主要的網址 – joe

我想是這樣的：請評論，如果我可以提高代碼結構

for a in self.soup.find_all(href=True):    
     href = a['href'] 
     if not href or len(href) <= 1: 
      continue 
     if href[0] == '/': 
      href = (domain_link + href).strip() 
      if href.lower().find("?pid=") != -1: 
       href = href.strip() 
      elif 'javascript:' in href.lower(): 
       continue 
      elif 'reviews' in href.lower(): 
       continue 
     elif href[:4] == 'http': 
      if href.lower().find("?pid=") != -1: 
       href = href.strip() 
     elif href[0] != '/' and href[:4] != 'http': 
      href = (domain_link + '/' + href).strip() 
      if href.lower().find("?pid=") != -1: 
       href = href.strip() 
     if '#' in href: 
      indx = href.index('#') 
      href = href[:indx].strip() 
     if href in links: 
      continue 
     links.append(self.re_encode(href))

來源

2015-09-08 02:36:51 joe

美麗的湯打開所有PID的URL在它

回答

相關問題