2014-11-21 54 views
2

我試圖從網頁抓取數據,並且我想要的所有文本位於<p class="heading2">More...之間。網絡抓取:我只獲得我想要的文本的1/10(使用BeautifulSoup)

它適用於第一批文本,但僅適用於該文本。

E.g.我得到:

Info about grant 1 

,但該網站上有:

Info about grant 1 
Info about grant 2 
Info about grant 3 
etc. 

下面是我使用的代碼。我是BeautifulSoup的新手,所以我希望有人能幫助!

from bs4 import BeautifulSoup 
import sheetsync 
import urllib2, csv 
url = urllib2.urlopen('http://www.asanet.org/funding/funding_and_grants.cfm').read() 
def processData(): 
    url = urllib2.urlopen('http://www.asanet.org/funding/funding_and_grants.cfm').read() 
    soup = BeautifulSoup(url) 
    metaData = soup.find_all("div", {"id":"memberscontent"}) 
    authors = [] 
    for html in metaData: 
      text = BeautifulSoup(str(html).strip()).encode("utf-8").replace("Deadline", "DEADLINE").replace('\s+',' ').replace('\n+',' ').replace('\s+',' ') 
      authors.append(text.split('<p class="heading2">')[1].split('More...')[0].strip()) # get Pos 
      txt = 'grants.txt' 
    with open(txt, 'ab') as out: 
     out.writelines(authors) 
processData() 

回答

2

我靠heading2並獲得未來兩個p標籤siblings:首先是最後期限,二是格蘭特的文字:

import urllib2 
from bs4 import BeautifulSoup 

soup = BeautifulSoup(urllib2.urlopen('http://www.asanet.org/funding/funding_and_grants.cfm')) 

for heading in soup.select('div#memberscontent p.heading2'): 
    deadline = heading.find_next_sibling('p') 
    article = deadline.find_next_sibling('p') 

    print heading.get_text(strip=True) 
    print deadline.get_text(strip=True) 
    print article.get_text(strip=True) 
    print "----" 

打印:

The Sydney S. Spivack Program in Applied Social Research and Social PolicyASA Congressional Fellowship 
Deadline: February 15 
The ASA encourages applications for its Congressional Fellowship. The Fellowship brings a PhD-level sociologist to Washington, DC, to work as a staff member on a congressional committee, in a congressional member office, or in a congressional agency (e.g., the Government Accountability Office). This intensive six-month experience reveals the intricacies of the policy making process to the sociological fellow, and shows the usefulness of sociological data and concepts to policy issues.  [More...] 
---- 
Community Action Research Initiative (CARI Grants) The Sydney S. Spivack Program in Applied Social Research and Social Policy 
Deadline:  February 15 
To encourage sociologists to undertake community action projects that bring social science knowledge, methods, and expertise to bear in addressing community-identified issues and concerns, ASA administers competitive CARI awards. Grant applications are encouraged from sociologists seeking to work with community organizations, local public interest groups, or community action projects. Appointments will run for the duration of the project, whether the activity is to be undertaken during the year, in the summer, or for other time-spans.   [More...] 
---- 
Fund for the Advancement of the Discipline 
Deadlines:  June 15 | December 15 
The American Sociological Association invites submissions by PhD sociologists for the Fund for the Advancement of the Discipline (FAD) awards. Supported by the American Sociological Association through a matching grant from the National Science Foundation, the goal of this project is to nurture the development of scientific knowledge by funding small, groundbreaking research initiatives and other important scientific research activities such as conferences. FAD awards provide scholars with small grants ($7,000 maximum) for innovative research that has the potential for challenging the discipline, stimulating new lines of research, and creating new networks of scientific collaboration. The award is intended to provide opportunities for substantive and methodological breakthroughs, broaden the dissemination of scientific knowledge, and provide leverage for acquisition of additional research funds.  [More...] 
---- 
... 
+0

由於這似乎工作很好,但贈款頭銜不見了?標題2後的第一個文本! – Isak 2014-11-21 19:44:09

+0

@ user3343907確定,更新了答案。 – alecxe 2014-11-21 19:45:19

+0

太棒了,這是非常有用的,我可以從中學到很多東西。謝謝! – Isak 2014-11-21 19:49:50

相關問題