NLTK在Python從網頁

我如何使用NLTK在Python中提取網頁（工作在我的情況下提供）NLTK在Python從網頁

我使用此代碼提取文本的一部分信息中提取信息，

import nltk 
import time 
import urllib2 
from urllib2 import urlopen 
from cookielib import CookieJar 
import datetime 


website = "http://tanitjobs.com/search-results-jobs/" 
topSplit = "<div class=\"offre\">" 
ButtomSplit = "<div class=\"offre-emploi&nbsp;\">" 
cj = CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 
opener.addheaders = [('Uer-agent', 'Mozilla/5.0')] 

def main(): 

    try: 

     ss =opener.open(website).read() 
     sourceCodeSplit = ss.split(topSplit)[1].split(ButtomSplit)[0] 
     texte = nltk.clean_html(sourceCodeSplit) 
     print texte 
    except Exception,e: 
     print 'fail in the main loop' 
     print str(e) 


main()

，但我不知道該怎麼做，如果我想提取特定款（工作機會）從網頁一般

來源

2014-02-21 Athari

歡迎不幸的是，在那裏是沒有辦法抓取網頁和提取特定的部分。有樂趣爬行/清潔 – alvas

謝謝「阿爾瓦」。你能給我一些例子來開始嗎？ – Athari

首先，你需要從python3 urllib.request，看到http://docs.python.org/3.0/library/urllib.request.html

接下來，BeautifulSoup是你的朋友：http://www.crummy.com/software/BeautifulSoup/bs4/doc/。我發現這很有用在py3.x安裝BS4看到http://annelagang.blogspot.fr/2012/11/beautifulsoup-4-for-python-3x.html

這裏有一個工作示例：

import urllib.request 
from bs4 import BeautifulSoup as bs 

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7' 

headers={'User-Agent':user_agent,} 
url="http://tanitjobs.com/search-results-jobs/" 

request=urllib.request.Request(url,None,headers) #The assembled request 
response = urllib.request.urlopen(request) 
data = response.read() 

for i in bs(data).find_all(attrs={"class": "offre-emploi vedette"}): 
    print(" ".join(i.find("div",attrs={"class":"detail"}).text.split())) 
    print

[出]：

Téléopérateurs（trices）質量的COM中心QualityCom SIS a Montplaisir recrute desteéléopérateurs（trices），en ...發送消息給我發送電子郵件 offres de Quality Com Center

ContrôleurDE GESTION尤里卡發展署Humain傾吐le孔特D'UNE Multinationale丹斯樂酒莊DE L'工業，知性recrutons聯合國：... 案中案所有領域萊offres德尤里卡發展署Humain

RESPONSABLE ressources humaines（H/F）尤里卡發展署Humain倒兒子黑白配孔特尤里卡發展署Humain Recrute：RESPONSABLE ... 案中案所有領域萊offres德尤里卡發展署Humain

Contrôleur金融家初級代理突尼斯RattachéAU DIRECTEUR Administratif等金融家支付，VOTRE作用EST德garantir la GESTION ...案中案所有領域LES offres德代理突尼斯

Superviseur EN獎金德RDV（大氣能源renouvelable）質量的COM 中心質量的COM中心SIS一個Montplaisir Recrute 1 Superviseur（E）恩PANNEAUX ...案中案所有領域LES offres遠程教育質量的COM中心

Téléconseillers（H/F）AXESS全球服務AXESS GLOBAL SERVICES Recrutement VOUS souhaitez travailler丹斯UNE企業指數Jeune等... 案中案所有領域萊offres德AXESS全球服務

來源

2014-02-25 15:29:59 alvas

我使用python 2.7;有沒有「urllib.request」模塊：（ – Athari

只有python3允許你使用用戶代理頭，如果你只是颳了網站，使用py3.x然後使用py2.x來處理其他東西 – alvas

我用「 urllib2「，並返回相同的結果：） – Athari

NLTK在Python從網頁

回答

相關問題