使用Python從網頁中刮掉沒有id或class的元素BeautifulSoup

我知道如果元素具有id或class，那麼如何從網頁中抓取數據。使用Python從網頁中刮掉沒有id或class的元素BeautifulSoup

例如這裏，soup是一個BeautifulSoup對象。

for item in soup.findAll('a',{"class":"class_name"}): 
    title = item.string 
    print(title+"\n")

如果元素沒有id或class，我們該怎麼做？例如，沒有id或class的段落元素。

或者在更糟糕的情況下，如果我們只需要刮取如下所示的純文本會發生什麼？

<body> 
<p>YO!</p> 
hello world!! 
</body>

如何在上述頁面源碼中只打印hello world!!？它沒有id或class。

來源

2015-12-19 RaviTej310

我刪除了你的第二個問題，因爲這是題外話了（因此）。但是，你的意思是'soup.find（'body'）'或'soup.find_all（'body'）'？ –

我不知道上面這兩個語句的含義。如果你告訴我，我可以回答你的問題。 :) – RaviTej310

啊，很好。然而，關於*除了BeautifulSoup以外，還有哪些其他好的搜索軟件包？*是主要基於觀點的**問題。這是該網站的主題。請**不要問他們**。如果你看看你的問題，你可以看到我的編輯，我刪除了它們。 [這是什麼在這個網站上的主題問題。]（http://stackoverflow.com/help/on-topic） –

如果你想找到具體有沒有明確的id和class屬性的元素：

soup.find("p", class_=False, id=False)

要找到「文本」就像你的榜樣hello world!!節點，您可以通過文本本身得到它 - 通過部分匹配或正則表達式匹配：

import re 

soup.find(text=re.compile("^hello")) # find text starting with "hello" 
soup.find(text="hello world!!") # find text with an exact "hello world!!" text 
soup.find(text=lambda text: text and "!!" in text) # find text havin "!!" inside it

或者，你可以找到前面p元素，並獲得next text node：

soup.find("p", class_=False, id=False).find_next_sibling(text=True) 
soup.find("p", text="YO!").find_next_sibling(text=True)

來源

2015-12-19 12:47:11 alecxe

但是，如果你只是想獲得body標籤內的文本，但不希望文本在它的任何標記。

您可以使用tag.find_all()獲取其中的所有標籤，然後使用tag.extract()刪除它們。然後你會得到一個body標籤，其中只有文字。

例如：

>>> soup = BeautifulSoup('''\ 
... <body> 
... <p>YO!</p> 
... hello world!! 
... </body> 
... ''') 

>>> print(soup.get_text()) 

YO! 
hello world!! 


>>> print(soup.find('body').get_text()) 

YO! 
hello world!! 

>>> for tag in soup.find('body').find_all(): 
...  tag.extract() 
...  
... 
<p>YO!</p> 
>>> print(soup.find('body').get_text()) 


hello world!! 

>>> print(soup.find('body').get_text(strip=True)) 
hello world!! 
>>>

來源

2015-12-19 12:50:56

使用Python從網頁中刮掉沒有id或class的元素BeautifulSoup

回答

相關問題