2016-02-27 41 views
0

我想從機構的網頁中提取href's。 我必須爲進一步的抓取活動提取部門代碼。 ,我已經寫了下面的代碼:在python3中使用beautifulsoup從html爬行錨標記的困難

import requests 
import re 
import urllib 
from bs4 import BeautifulSoup 

codesurl="http://www.iitkgp.ac.in/academics/?page=acadunits" 
response = requests.get(codesurl) 
# print(response.content) 
soup=BeautifulSoup(response.content) 
# print(soup.prettify()) 
p = re.compile('page=acadunits*') 
p1 = re.compile('<a href=.*page=acadunits*') 

links=soup.find_all("a") 
print(links) 
for link in links: 
    # if p1.match(link): 
     print("%s" %(link)) 

但我沒有得到所有HREF的,例如:

<a href="?page=acadunits&amp;&amp;dept=ME">Mechanical Engineering</a> 
<a href="?page=acadunits&amp;&amp;dept=MD">Medical Science &amp; Technology</a> 
<a href="?page=acadunits&amp;&amp;dept=MT">Metallurgical &amp; Materials Engineering</a> 

,還有更多 有人可以幫我this.This是我第一次爬行。 你也可以看看website.I需要從URL中提取部門代碼

dept=ME 
dept=MT 
dept=MD 

我的網頁包含:

<div class="tab_container"> 
<div id="tab1" class="tab_content" style="display: block;"> 
<h3></h3> 
    <!--Content--> 
    <img src="./Indian Institute of Technology Kharagpur_files/academicunits.jpg"> 
    <br><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=AE">Aerospace Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=AG">Agricultural &amp; Food Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=AR">Architecture &amp; Regional Planning</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=BT">Biotechnology</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=CH">Chemical Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=CM">Chemistry</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=CE">Civil Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=CS">Computer Science &amp; Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=CR">Cryogenic Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=ED">Center for Educational Technology</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=EE">Electrical Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=EC"> Electronics &amp; Electrical Communication Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=GS">G S Sanyal School of Telecommunications</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=MG">Geology &amp; Geophysics</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=HS">Humanities &amp; Social Sciences</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=IM">Industrial &amp; Systems Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=IT">Information Technology</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=MS">Materials Science</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=MM">Mathematics</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=ME">Mechanical Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=MD">Medical Science &amp; Technology</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=MT">Metallurgical &amp; Materials Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=MI">Mining Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=NA">Ocean Engineering &amp; Naval Architecture</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=N2">Oceans, Rivers, Atmosphere and Land Sciences</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=MP">Physics</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=PK">P K Sinha Centre for Bio Energy</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=RJ">Rajendra Mishra School of Engineering Entrepreneurship</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=RG">Rajiv Gandhi School of Intellectual Property Law</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=ID">Ranbir and Chitra Gupta School of Infrastructure Design and Management</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=RE">Reliability Engineering Centre</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=RT">Rubber Technology Centre</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=RD">Rural Development Centre</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=BS">School of Bioscience</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=ES">School of Energy Science &amp; Engineering</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=EF">School of Environmental Science and Technology</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=NT">School of Nano-Science and Technology</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=WM">School of Water Resources</a><br> 
     <a href="http://www.iitkgp.ac.in/academics/?page=acadunits&amp;&amp;dept=SM">Vinod Gupta School of Management</a><br> 
    <br><br> 

    <!--Content--> 
    </div> 

但是當我做:

codesurl="http://www.iitkgp.ac.in/academics/?page=acadunits" 
response = requests.get(codesurl) 
soup=BeautifulSoup(response.text) 

湯呢不顯示這些href的 有人可以建議如何提取這些href標籤?

回答

1

首先,部門的聯繫與一個GET請求來this URL動態加載。

然後,想法是找到href屬性值與特定pattern匹配的所有鏈接,然後使用此模式提取部門代碼。工作代碼:

import re 

import requests 
from bs4 import BeautifulSoup 

codesurl = "http://www.iitkgp.ac.in/academics/academic.php" 
response = requests.get(codesurl) 
soup = BeautifulSoup(response.content, "lxml") 

pattern = re.compile(r"dept=([A-Z]+)") 
links = soup.find_all("a", href=pattern) 

for link in links: 
    print(pattern.search(link["href"]).group(1)) 

打印:

AE 
AG 
AR 
... 
NT 
WM 
SM 
0

執行此操作的最佳方法是使用urllib.parse模塊中的parse_qs

for link in links: 
    qs = parse_qs(link.get('href')) 
    print('dept', qs['dept'][0]) 

或使用rpartition

​​