2017-01-10 72 views
1

我與美麗的湯嘗試,我試圖從包含以下類型的段一個HTML文檔的信息:網絡與美麗的湯拼搶給空結果

<div class="entity-body"> 
<h3 class="entity-name with-profile"> 
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&amp;trk=manage_invitations_profile" 
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&amp;trk=manage_invitations_miniprofile" 
class="miniprofile" 
aria-label="View profile for Ivan Grigorov"> 
<span>Ivan Grigorov</span> 
</a> 
</h3> 
<p class="entity-subheader"> 
Teacher 
</p> 
</div> 

我用下面的命令:

with open("C:\Users\pv\MyFiles\HTML\Invites.html","r") as Invites: soup = bs(Invites, 'lxml') 
soup.title 
out: <title>Sent Invites\n| LinkedIn\n</title> 
invites = soup.find_all("div", class_ = "entity-body") 
type(invites) 
out: bs4.element.ResultSet 
len(invites) 
out: 0 

爲什麼find_all返回空的ResultSet對象?

您的建議將不勝感激。

+0

嘗試查看頁面時,您獲取它。如果你在這裏看不到這個'div'標籤,那就意味着這個部分是用'JS'生成的,所以你不能用這種方法刮擦它(你必須使用'selenium')。 – Fejs

回答

0

的問題是,該文件沒有被讀取,這是一個公正的TextIOWrapperPython 3)或FilePython 2)對象你。閱讀文檔並通過標記,實質上是stringBeautifulSoup

c正確的代碼將是:

with open("C:\Users\pv\MyFiles\HTML\Invites.html", "r") as Invites: 
    soup = BeautifulSoup(Invites.read(), "html.parser") 
    soup.title 
    invites = soup.find_all("div", class_="entity-body") 
    len(invites) 
+0

我按照你的建議更改了代碼,但是我仍然得到len(邀請)爲0. – gk7

+0

我得到1.也許添加'print'statement:'print(len(invites))'(Python 3)或'print len (邀請)'(Python 2)。 – dasdachs

0
import bs4 

html = '''<div class="entity-body"> 
<h3 class="entity-name with-profile"> 
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&amp;trk=manage_invitations_profile" 
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&amp;trk=manage_invitations_miniprofile" 
class="miniprofile" 
aria-label="View profile for Ivan Grigorov"> 
<span>Ivan Grigorov</span> 
</a> 
</h3> 
<p class="entity-subheader"> 
Teacher 
</p> 
</div>''' 

soup = bs4.BeautifulSoup(html, 'lxml') 
invites = soup.find_all("div", class_ = "entity-body") 
len(invites) 

出來:

1 

此代碼工作正常

+0

然後問題在於讀取html頁面並將其轉換爲湯對象的語句。這很奇怪,因爲我從書中複製了這個語法,並且已經用另一個html頁面測試過它。 Chrome瀏覽器通過右鍵單擊在瀏覽器中打開的網頁時通過「另存爲...」命令生成html頁面。出了什麼問題? – gk7

+0

@ gk7您能否提供該頁面的完整HTML代碼或網址 –

+0

感謝您的回覆。該網頁是:https://www.linkedin.com/people/invite – gk7