篩選從美麗的湯生產webscrape返回的列表

我正在使用python來編碼。我一直在試圖對網站的名稱，團隊形象以及nba選秀前景大學進行網絡掃描。然而，當我颳去大學的名字時，我同時獲得了大學頁面和大學名稱。我如何得到它，以便我只看到大學？我曾嘗試將.string和.text添加到錨點（anchor.string）的末尾。篩選從美麗的湯生產webscrape返回的列表

import urllib2 
from BeautifulSoup import BeautifulSoup 
# or if your're using BeautifulSoup4: 
# from bs4 import BeautifulSoup 

list = [] 
soup = BeautifulSoup(urllib2.urlopen(
          'http://www.cbssports.com/nba/draft/mock-draft' 
          ).read() 
        ) 

rows = soup.findAll("table", 
        attrs = {'class':'data borderTop'})[0].tbody.findAll("tr")[2:] 

for row in rows: 
    fields = row.findAll("td") 
    if len(fields) >= 3: 
    anchor = row.findAll("td")[2].findAll("a")[1:] 
    if anchor: 
     print anchor

來源

2012-06-26 user1470901

而不只是：

print anchor

使用：

print anchor[0].text

來源

2012-06-26 14:37:34 miles82

-1

在HTML錨的格式是<a href='web_address'>Text-that-is-displayed</a>所以除非已經有一個奇特的HTML解析器庫（我敢打賭有，只是不知道任何），你可能需要使用一些正則表達式來解析出你想要的錨的部分。

來源

2012-06-26 14:32:25 IanVS

BeautifulSoup是，「看中的HTML解析器庫」。而且你不能用正則表達式解析HTML。請參閱http://stackoverflow.com/a/1732454/10077 –

謝謝，我將不得不調查BeautifulSoup。至於正則表達式，我喜歡閱讀那篇文章，但第二個答案（以及獲得賞金的那個答案）的確確實實在在地說，你可以解析一個有限的，已知的html子集，這就是我推測的findAll 。 – IanVS

篩選從美麗的湯生產webscrape返回的列表

回答

相關問題