使用Beautiful Soup獲取課程名稱和內容

使用Beautiful Soup模塊，如何獲取div標籤的數據，該標籤的類名是feeditemcontent cxfeeditemcontent？它是：使用Beautiful Soup獲取課程名稱和內容

soup.class['feeditemcontent cxfeeditemcontent']

或：

soup.find_all('class')

這是HTML源代碼：

<div class="feeditemcontent cxfeeditemcontent"> 
    <div class="feeditembodyandfooter"> 
     <div class="feeditembody"> 
     <span>The actual data is some where here</span> 
     </div> 
    </div> 
</div>

，這是Python代碼：

from BeautifulSoup import BeautifulSoup 
html_doc = open('home.jsp.html', 'r') 

soup = BeautifulSoup(html_doc) 
class="feeditemcontent cxfeeditemcontent"

來源

2012-07-04 Rajeev

試試這個，也許是太多這個簡單的事情，但它的工作原理：

def match_class(target): 
    target = target.split() 
    def do_match(tag): 
     try: 
      classes = dict(tag.attrs)["class"] 
     except KeyError: 
      classes = "" 
     classes = classes.split() 
     return all(c in classes for c in target) 
    return do_match 

html = """<div class="feeditemcontent cxfeeditemcontent"> 
<div class="feeditembodyandfooter"> 
<div class="feeditembody"> 
<span>The actual data is some where here</span> 
</div> 
</div> 
</div>""" 

from BeautifulSoup import BeautifulSoup 

soup = BeautifulSoup(html) 

matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent")) 
for m in matches: 
    print m 
    print "-"*10 

matches = soup.findAll(match_class("feeditembody")) 
for m in matches: 
    print m 
    print "-"*10

來源

2012-07-04 15:16:49 jadkik94

'classes = dict（tag.attrs）.get（'class'，''）'比'try''除了'block'要短得多，它的功能是一樣的。 –

@DoronCohen是否需要「dict（）」？似乎沒有工作。 – Mark

@Mark我得到一個沒有'dict（）'的異常，因爲它是一個列表'TypeError：列表索引必須是整數，而不是str'。此外，這個答案假設美麗的湯3（可能是爲什麼你看到不同的結果），你應該使用版本4，並使用其他答案。 – jadkik94

soup.find("div", {"class" : "feeditemcontent cxfeeditemcontent"})

來源

2012-07-04 14:55:52

或soup.findAll如果你想多個（使用相同的參數） –

我不會真的使用該代碼的顯而易見的原因。檢查我的答案。有一個相關的錯誤報告。 – SuperSaiyan

你能解釋爲什麼你低估了我的解決方案嗎？它完美無瑕。 –

檢查這個錯誤報告：https://bugs.launchpad.net/beautifulsoup/+bug/410304

正如你所看到的，美麗的湯不能真正瞭解class="a b"作爲兩個類a和b。

但是，正如它在第一條評論中出現的那樣，一個簡單的正則表達式就足夠了。在你的情況下：

soup = BeautifulSoup(html_doc) 
for x in soup.findAll("div",{"class":re.compile(r"\bfeeditemcontent\b")}): 
    print "result: ",x

注意：這已在最近的測試版中修復。我沒有閱讀最近版本的文檔，可能你可以做到這一點。或者，如果你想使用舊版本來使用它，你可以使用上面的版本。

來源

2012-07-04 14:56:05 SuperSaiyan

美麗的湯4對待「class」屬性爲列表，而不是一個字符串的值，這意味着可以jadkik94的解決方案被簡化：

來源

2012-07-05 14:22:08

from BeautifulSoup import BeautifulSoup 
f = open('a.htm') 
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'id':'abc def'}) 
print list

來源

2013-02-16 06:26:47

soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")

所以，如果我想GE牛逼從stackoverflow.com類的頭<div class="header">的所有div標籤，與BeautifulSoup一個例子是這樣的：

from bs4 import BeautifulSoup as bs 
import requests 

url = "http://stackoverflow.com/" 
html = requests.get(url).text 
soup = bs(html) 

tags = soup.findAll("div", class_="header")

它已經在BS4 documentation。

來源

2014-07-24 05:29:55

使用Beautiful Soup獲取課程名稱和內容

回答

相關問題