2015-06-22 58 views
1

我想編寫一個簡單的網絡爬蟲爲了測試新的asyncio模塊如何工作,但有一些我錯了。我正嘗試使用單個網址啓動抓取工具。該腳本應該下載該頁面,在頁面上找到任何<a>標籤,並安排它們也要下載。我期望的輸出是一串行,表示第一頁已經被下載,隨後的頁面以隨機順序(即,下載完成)完成,但實際上它們只是按順序下載。我完全陌生於異步一般和這個模塊,所以我確定只有一些基本概念,我錯過了。將任務添加到python asyncio

這是到目前爲止我的代碼:

import asyncio 
import re 
import requests 
import time 
from bs4 import BeautifulSoup 
from functools import partial 

@asyncio.coroutine 
def get_page(url, depth=0): 
    print('%s: Getting %s' % (time.time(), url)) 
    page = requests.get(url) 
    print('%s: Got %s' % (time.time(), url)) 
    soup = BeautifulSoup(page.text) 
    if depth < 2: 
     for a in soup.find_all('a', href=re.compile(r'\w+\.html'))[:3]: 
      u = 'https://docs.python.org/3/' + a['href'] 
      print('%s: Scheduling %s' % (time.time(), u)) 
      yield from get_page(u, depth+1) 
    if depth == 0: 
     loop.stop() 
    return soup 

root = 'https://docs.python.org/3/' 
loop = asyncio.get_event_loop() 
loop.create_task(get_page(root)) 
loop.run_forever() 

這裏是輸出:

1434971882.3458219: Getting https://docs.python.org/3/ 
1434971893.0054126: Got https://docs.python.org/3/ 
1434971893.015218: Scheduling https://docs.python.org/3/genindex.html 
1434971893.0153584: Getting https://docs.python.org/3/genindex.html 
1434971894.464993: Got https://docs.python.org/3/genindex.html 
1434971894.4752269: Scheduling https://docs.python.org/3/py-modindex.html 
1434971894.4753256: Getting https://docs.python.org/3/py-modindex.html 
1434971896.9845033: Got https://docs.python.org/3/py-modindex.html 
1434971897.0756354: Scheduling https://docs.python.org/3/index.html 
1434971897.0757186: Getting https://docs.python.org/3/index.html 
1434971907.451529: Got https://docs.python.org/3/index.html 
1434971907.4600112: Scheduling https://docs.python.org/3/genindex-Symbols.html 
1434971907.4600625: Getting https://docs.python.org/3/genindex-Symbols.html 
1434971917.6517148: Got https://docs.python.org/3/genindex-Symbols.html 
1434971917.6789174: Scheduling https://docs.python.org/3/py-modindex.html 
1434971917.6789672: Getting https://docs.python.org/3/py-modindex.html 
1434971919.454042: Got https://docs.python.org/3/py-modindex.html 
1434971919.574361: Scheduling https://docs.python.org/3/genindex.html 
1434971919.574434: Getting https://docs.python.org/3/genindex.html 
1434971920.5942516: Got https://docs.python.org/3/genindex.html 
1434971920.6020699: Scheduling https://docs.python.org/3/index.html 
1434971920.6021295: Getting https://docs.python.org/3/index.html 
1434971922.1504402: Got https://docs.python.org/3/index.html 
1434971922.1589775: Scheduling https://docs.python.org/3/library/__future__.html#module-__future__ 
1434971922.1590302: Getting https://docs.python.org/3/library/__future__.html#module-__future__ 
1434971923.30988: Got https://docs.python.org/3/library/__future__.html#module-__future__ 
1434971923.3215268: Scheduling https://docs.python.org/3/whatsnew/3.4.html 
1434971923.321574: Getting https://docs.python.org/3/whatsnew/3.4.html 
1434971926.6502898: Got https://docs.python.org/3/whatsnew/3.4.html 
1434971926.89331: Scheduling https://docs.python.org/3/../genindex.html 
1434971926.8934016: Getting https://docs.python.org/3/../genindex.html 
1434971929.0996494: Got https://docs.python.org/3/../genindex.html 
1434971929.1068246: Scheduling https://docs.python.org/3/../py-modindex.html 
1434971929.1068716: Getting https://docs.python.org/3/../py-modindex.html 
1434971932.5949798: Got https://docs.python.org/3/../py-modindex.html 
1434971932.717457: Scheduling https://docs.python.org/3/3.3.html 
1434971932.7175465: Getting https://docs.python.org/3/3.3.html 
1434971934.009238: Got https://docs.python.org/3/3.3.html 

回答

5

使用ASYNCIO不會奇蹟般地使你的所有代碼異步的。在這種情況下,requests被阻塞,所以你所有的協同程序都會等待它。

有一個稱爲aiohttp的異步庫允許異步http請求,雖然它不像requests那樣用戶友好。