Python中的網絡爬蟲

我想在Python中編寫一個基本的網絡爬蟲。我遇到的麻煩是解析頁面以提取網址。我都嘗試了BeautifulSoup和正則表達式，但我無法實現有效的解決方案。Python中的網絡爬蟲

舉個例子：我試圖在Facebook的Github頁面中提取所有成員網址。（https://github.com/facebook?tab=members）。我寫的代碼提取了成員的URL;

def getMembers(url): 
    text = urllib2.urlopen(url).read(); 
    soup = BeautifulSoup(text); 
    memberList = [] 
    #Retrieve every user from the company 
    #url = "https://github.com/facebook?tab=members" 

    data = soup.findAll('ul',attrs={'class':'members-list'}); 
    for div in data: 
    links = div.findAll('li') 
    for link in links: 
      memberList.append("https://github.com" + str(link.a['href'])) 

    return memberList

但是這需要相當長的一段解析和，我在想，如果我能更有效地做到這一點，因爲爬行過程太長。

來源

2012-11-06 Ali

您是否嘗試過使用不同的解析器？您可以使用[lxml]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser）解析器與美麗的湯，使其相當快。 – kreativitea

@kreativitea我正在查看它。非常感謝您的幫助！ – Ali

當然，這不是你的互聯網連接？處理本身應該很快。我的建議：將輸出寫入文件，並檢查需要多長時間。 – RParadox