2017-03-07 36 views
0
<p class=""> 
    Teacher: 
<a href="/name/nm12345/?ref_=adv_0" 
>Scott</a> 
      <span class="ghost">|</span> 
    Students: 
<a href="/name/nm12345/?ref_=adv_1" 
>Benedict</a>, 
<a href="/name/nm12345/?ref_=adv_2" 
>Chiwetel</a>, 
<a href="/name/nm12345/?ref_=adv_3" 
>Rachel</a>, 
<a href="/name/nm12345/?ref_=adv_4" 
>Benedict Wong</a> 
    </p> 

我想提取老師的名字 - 「斯科特」,在「老師」的標籤下,並提取所有學生的名字下的「學生」的標籤。我想: soup.find(lambda tag:tag)並返回用美麗的湯提取幾個價值

<a href="/name/nm12345/?ref_=adv_0" 
>Scott</a> 

,我認爲這不是一個正確的做法。代碼應該如何實際提取「老師」和「學生」標籤下的名字?

回答

1

假設您的HTML代碼塊在解析其他頁面時不會有太大變化,您可以按類找到您的p標記(您的示例中沒有),並驗證Teacher文本是否存在。

如果是從元素上的第一個a的p標籤獲得.contents[1]

下一步查找所有a標籤,其中href屬性與您的教師不匹配。

實施例:

from bs4 import BeautifulSoup 

example = """<p class=""> 
Teacher: 
<a href="/name/nm12345/?ref_=adv_0" 
>Scott</a> 
     <span class="ghost">|</span> 
Students: 
<a href="/name/nm12345/?ref_=adv_1" 
>Benedict</a>, 
<a href="/name/nm12345/?ref_=adv_2" 
>Chiwetel</a>, 
<a href="/name/nm12345/?ref_=adv_3" 
>Rachel</a>, 
<a href="/name/nm12345/?ref_=adv_4" 
>Benedict Wong</a> 
</p>""" 

soup = BeautifulSoup(example, "html.parser") 

Classroom = soup.find(lambda x: "Teacher" in x.get_text()) 

if Classroom is not None: 

    Teacher = Classroom.contents[1] 
    TeacherUrl = Teacher["href"] 

    Students = Classroom.find_all(lambda tag: tag.has_attr('href') and TeacherUrl not in tag["href"]) 

    print (Teacher.text) 
    for Student in Students: 
     print (Student.text) 

,其輸出:

斯科特

切瓦特

雷切爾

本篤王黃