2017-02-13 61 views
2

我正在學習Python,試圖編寫一個腳本來抓取xHamster。如果任何人都熟悉這個網站,我想特別寫一個給定用戶視頻的所有網址到一個.txt文件。使用Selenium Python(NSFW)從網頁上刮下網址

目前,我已經設法從特定頁面刮掉URL,但是有多個頁面,我正在努力循環瀏覽頁面數量。

在我的下面的嘗試中,我已經評論了我正在嘗試讀取下一頁的URL,但是它當前打印的內容爲None。任何想法爲什麼以及如何解決這個問題?

當前腳本:

#!/usr/bin/env python 

from selenium import webdriver 
from selenium.webdriver.common.keys import Keys 

chrome_options = webdriver.ChromeOptions() 
chrome_options.add_argument("--incognito") 

driver = webdriver.Chrome(chrome_options=chrome_options) 

username = **ANY_USERNAME** 
##page = 1 
url = "https://xhams***.com/user/video/" + username + "/new-1.html" 

driver.implicitly_wait(10) 
driver.get(url) 

links = []; 
links = driver.find_elements_by_class_name('hRotator') 
#nextPage = driver.find_elements_by_class_name('last') 

noOfLinks = len(links) 
count = 0 

file = open('x--' + username + '.txt','w') 
while count < noOfLinks: 
    #print links[count].get_attribute('href') 
    file.write(links[count].get_attribute('href') + '\n'); 
    count += 1 

file.close() 
driver.close() 

我在通過網頁循環嘗試:

#!/usr/bin/env python 

from selenium import webdriver 
from selenium.webdriver.common.keys import Keys 

chrome_options = webdriver.ChromeOptions() 
chrome_options.add_argument("--incognito") 

driver = webdriver.Chrome(chrome_options=chrome_options) 

username = **ANY_USERNAME** 
##page = 1 
url = "https://xhams***.com/user/video/" + username + "/new-1.html" 

driver.implicitly_wait(10) 
driver.get(url) 

links = []; 
links = driver.find_elements_by_class_name('hRotator') 
#nextPage = driver.find_elements_by_class_name('colR') 

## TRYING TO READ THE NEXT PAGE HERE 
print driver.find_element_by_class_name('last').get_attribute('href') 

noOfLinks = len(links) 
count = 0 

file = open('x--' + username + '.txt','w') 
while count < noOfLinks: 
    #print links[count].get_attribute('href') 
    file.write(links[count].get_attribute('href') + '\n'); 
    count += 1 

file.close() 
driver.close() 

UPDATE:

我用菲利普Oger的的回答以下,但將兩種方法下面爲單頁結果工作:

def find_max_pagination(self): 
    start_url = 'https://www.xhamster.com/user/video/{}/new-1.html'.format(self.user) 
    r = requests.get(start_url) 
    tree = html.fromstring(r.content) 
    abc = tree.xpath('//div[@class="pager"]/table/tr/td/div/a') 
    if tree.xpath('//div[@class="pager"]/table/tr/td/div/a'): 
     self.max_page = max(
      [int(x.text) for x in tree.xpath('//div[@class="pager"]/table/tr/td/div/a') if x.text not in [None, '...']] 
     ) 
    else: 
     self.max_page = 1 

    return self.max_page 

def generate_listing_urls(self): 
    if self.max_page == 1: 
     pages = [self.paginated_listing_page(str(page)) for page in range(0, 1)] 
    else: 
     pages = [self.paginated_listing_page(str(page)) for page in range(0, self.max_page)] 

    return pages 
+0

_______但是爲什麼? –

+1

即使您導入它,也不會顯示您使用BeautifulSoup – xbonez

+0

@xbonez嗯,是的,我最初在切換到Selenium之前使用了BeautifulSoup。編輯。 – kong88

回答

1

在用戶頁面上,我們實際上可以找出分頁的距離,因此我們可以通過列表理解生成用戶的每個URL,而不是通過分頁循環,然後逐個刪除。

這是我使用LXML的兩分錢。如果您只需複製/粘貼此代碼,它將以TXT文件的形式返回每個視頻網址。您只需更改用戶名。

from lxml import html 
import requests 


class XXXVideosScraper(object): 

    def __init__(self, user): 
     self.user = user 
     self.max_page = None 
     self.video_urls = list() 

    def run(self): 
     self.find_max_pagination() 
     pages_to_crawl = self.generate_listing_urls() 
     for page in pages_to_crawl: 
      self.capture_video_urls(page) 
     with open('results.txt', 'w') as f: 
      for video in self.video_urls: 
       f.write(video) 
       f.write('\n') 

    def find_max_pagination(self): 
     start_url = 'https://www.xhamster.com/user/video/{}/new-1.html'.format(self.user) 
     r = requests.get(start_url) 
     tree = html.fromstring(r.content) 

     try: 
      self.max_page = max(
      [int(x.text) for x in tree.xpath('//div[@class="pager"]/table/tr/td/div/a') if x.text not in [None, '...']] 
     ) 
     except ValueError: 
      self.max_page = 1 
     return self.max_page 

    def generate_listing_urls(self): 
     pages = [self.paginated_listing_page(page) for page in range(1, self.max_page + 1)] 
     return pages 

    def paginated_listing_page(self, pagination): 
     return 'https://www.xhamster.com/user/video/{}/new-{}.html'.format(self.user, str(pagination)) 

    def capture_video_urls(self, url): 
     r = requests.get(url) 
     tree = html.fromstring(r.content) 
     video_links = tree.xpath('//a[@class="hRotator"]/@href') 
     self.video_urls += video_links 


if __name__ == '__main__': 
    sample_user = 'wearehairy' 
    scraper = XXXVideosScraper(sample_user) 
    scraper.run() 

我沒有檢查的情況下,總共只有1頁的用戶。讓我知道這是否正常工作。

+0

感謝您的示例。對於用戶總共1頁(例如unmasker777),它在...第27行上的錯誤,在find_max_pagination中爲 [int(x.text)for tree.xpath('// div [@ class =「pager」如果x.text不在[None,'...']] ValueError:max()arg是一個空序列 – kong88

+0

我們可以用Try /除。讓我編輯代碼。 –