我試圖打開通過它的PID的所有鏈接,但有兩種情況:美麗的湯打開所有PID的URL在它
當它打開的所有網址(我的意思是,即使垃圾網址)
def get_links(self): links = [] host = urlparse(self.url).hostname scheme = urlparse(self.url).scheme domain_link = scheme+'://'+host pattern = re.compile(r'(/pid/)') for a in self.soup.find_all(href=True): href = a['href'] if not href or len(href) <= 1: continue elif 'javascript:' in href.lower(): continue elif 'forgotpassword' in href.lower(): continue elif 'images' in href.lower(): continue elif 'seller-account' in href.lower(): continue elif 'review' in href.lower(): continue else: href = href.strip() if href[0] == '/': href = (domain_link + href).strip() elif href[:4] == 'http': href = href.strip() elif href[0] != '/' and href[:4] != 'http': href = (domain_link + '/' + href).strip() if '#' in href: indx = href.index('#') href = href[:indx].strip() if href in links: continue links.append(self.re_encode(href)) return links
在這種情況下,它只是打開與PID的網址中,但在這種情況下,它不遵循鏈接,僅限於主頁。在與pid打開幾個鏈接後,它崩潰。
def get_links(self): links = [] host = urlparse(self.url).hostname scheme = urlparse(self.url).scheme domain_link = scheme+'://'+host pattern = re.compile(r'(/pid/)') for a in self.soup.find_all(href=True): if pattern.search(a['href']) is not None: href = a['href'] if not href or len(href) <= 1: continue elif 'javascript:' in href.lower(): continue elif 'forgotpassword' in href.lower(): continue elif 'images' in href.lower(): continue elif 'seller-account' in href.lower(): continue elif 'review' in href.lower(): continue else: href= href.strip() if href[0] == '/': href = (domain_link + href).strip() elif href[:4] == 'http': href = href.strip() elif href[0] != '/' and href[:4] != 'http': href = (domain_link + '/' + href).strip() if '#' in href: indx = href.index('#') href = href[:indx].strip() if href in links: continue links.append(self.re_encode(href)) return links
有人可以幫得到的所有環節,甚至在URL中,並在年底的內部鏈接只接受PID作爲返回的鏈接。
我試圖用僅當,而不是正則表達式,但它開始認識到所有的比賽情況有是'id'或'sid'等詞。我使用正則表達式,因爲它匹配整個單詞。 – joe
這很奇怪,但如果你有這個問題,你應該嘗試'find',我更新了我的代碼。 – dstudeba
這很奇怪,這是不打開任何網址,除了主要的網址 – joe