2016-03-29 46 views
1

我想在Mac OSX上使用Python中的網絡抓取工具,我正在測試的一個示例是從MyFonts頁面(例如here)加載標籤和圖像。最初我使用的是BeautifulSoup,但我注意到該網站最初加載了一個'blank.png'來代替我試圖抓取的字體圖像,然後用'js'替換爲'真正'的字體圖像。 我想使用Selenium,我可以使用webdriverwait來監聽img src中的變化,類似於下面的示例,但不能通過ID或Class?在Python中使用Selenium注入JavaScript的Python中的圖像

ff = webdriver.Firefox() 
ff.get("http://www.myfonts.com/fonts/fort-foundry/gin/") 
try: 
    element = WebDriverWait(ff, 10).until(EC.presence_of_element_located((By.ID, "myDynamicElement"))) 
finally: 
    ff.quit() 

理想情況下這應該是等待 IMG SRC =「*/blank.png」,因爲該元素不會改變類或得到一個一致的名稱。還是應該等到頁面完全加載完成後才能完成?刮刀必須經歷很多這些,所以我試圖保持相當快。

我對Python很陌生,所以任何幫助將不勝感激。

回答

1

首先,確定你在做什麼是合法的:Legal page

至少一個字體樣本被加載然後進行提取等待:

# wait for at least one font sample to be loaded 
wait = WebDriverWait(ff, 10) 
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#overview_samples .search-result-item"))) 

# get font sample urls 
for sample in ff.find_elements_by_css_selector("#overview_samples .search-result-item .sample .fontsample[title]"): 
    print(sample.get_attribute("src")) 

打印:

http://samples.myfonts.net/e_91/u/e7/19061adcc0c9ac025d0414e5ff11a1.gif 
http://samples.myfonts.net/a_91/u/e5/4d795cdae0cb99d1424b13020d0f6e.gif 
... 
http://samples.myfonts.net/b_92/u/2c/4c21ddeb53f19f109306746dac6b24.gif 
1

我第二亞歷克斯關於合法性說,但你可以如果您使用請求和bs4模仿Ajax請求,也可以獲取字體:

In [16]: import requests 

In [17]: from bs4 import BeautifulSoup 

In [18]: data = { 
    ....:  'seed': '24', 
    ....:  "text": "Pangrams", 
    ....:  "src": "pangram.auto", 
    ....:  "size": "72", 
    ....:  "fg": "000000", 
    ....:  "bg": "ffffff", 
    ....:  "goodies": "_2x:0", 
    ....:  "w": "720", 
    ....:  "i[]": ["fort-foundry/gin/regular,,720", "fort-foundry/gin/oblique,,720", "fort-foundry/gin/rough,,720", 
    ....:    "fort-foundry/gin/rough-oblique,,720", "fort-foundry/gin/round,,720","fort-foundry/gin/round-oblique,,720", 
    ....:    "fort-foundry/gin/lines,,720", "fort-foundry/gin/lines-oblique,,720"], 
    ....:  "showimgs": "true"} 

In [19]: js = requests.post("http://www.myfonts.com/ajax-server/testdrive_new-ajax.php", data=data).json() 

In [20]: 

In [20]: urls = [img["src"] for img in BeautifulSoup("".join(js.values()),"lxml").find_all("img")] 

In [21]: pp(urls) 
['//samples.myfonts.net/a_91/u/af/5e840d069d35f2c8e5f7077bae7b1e.gif', 
'//samples.myfonts.net/e_91/u/d6/1d63ad993299d182ae19eddb2c41e1.gif', 
'//samples.myfonts.net/e_92/u/7c/15b8e24e4b077ae3b1c7a614afa8b5.gif', 
'//samples.myfonts.net/b_92/u/ce/63dffdda8581fc83f6fe20874714e7.gif', 
'//samples.myfonts.net/e_91/u/51/e8b7a0b5cccb2abf530b05e1d3fb04.gif', 
'//samples.myfonts.net/b_91/u/6f/a5f870c719dcf9961e753b9f4afd7e.gif', 
'//samples.myfonts.net/b_92/u/7c/94d652e4f146801e3c81f694898e07.gif', 
'//samples.myfonts.net/b_91/u/47/39fa3ab779cabd1068abbca7ce98c5.gif'] 

只有你需要通過的是i []:值,其餘的可以用來改變尺寸,背景顏色等。

所以如果你不在乎改變bg,fg或size並使用bs4和請求獲取所有名稱,可以從search-result-item類中獲取字體名稱,然後使用以下命令構建Ajax請求:

In [1]: import requests 

In [2]: from bs4 import BeautifulSoup 

In [3]: r = requests.get("http://www.myfonts.com/fonts/fort-foundry/gin/") 

In [4]: soup = BeautifulSoup(r.content, "lxml") 

# creates fort-foundry/gin/regular,,720" etc.. 
In [5]: fonts = ["{},,720".format(a["href"].strip("/").split("/", 1)[1]) 
        for a in soup.select("div .search-result-item h4 a[href]")] 

In [6]: data = { 
    ...:  "i[]": fonts 
    ...:  } 

In [7]: js = requests.post("http://www.myfonts.com/ajax-server/testdrive_new-ajax.php", data=data).json() 

In [8]: urls = [img["src"] for img in BeautifulSoup("".join(js.values()),"lxml").select("img[src]")] 

In [9]: 

In [9]: from pprint import pprint as pp 

In [10]: pp(urls) 
['//samples.myfonts.net/b_91/u/06/64bdafe9368dd401df4193a7608028.gif', 
'//samples.myfonts.net/b_92/u/06/b8ad49c563d310a97147d8220f55ab.gif', 
'//samples.myfonts.net/a_91/u/e7/8f84ce98f19e3f91ddc15304d636e7.gif', 
'//samples.myfonts.net/e_91/u/71/9769a1ab626429d63d3c779fcaa3d7.gif', 
'//samples.myfonts.net/b_92/u/65/fe416f15ea94b1f8603ddc675fd638.gif', 
'//samples.myfonts.net/b_91/u/5d/3ced9e71910bc411a0d76316d18df1.gif', 
'//samples.myfonts.net/e_92/u/cd/0df987a72bb0a43cba29b38c16b7a5.gif', 
'//samples.myfonts.net/e_91/u/88/3f80a1108fd0a075c69b09e9c21a8d.gif']