如何在網站上找到句子？

嘗試這樣：

import urllib 
from urllib import request 

url = "https://fotka.com/profil/k" 
word = "Nie ma profilu" 


def search_website(url, word): 
page = urllib.request.urlopen(url) 
phrase_present = False 

for i in page: 
    if bytes(word, encoding='utf8') in i: 
     phrase_present = True 
     print(i) 

return phrase_present 

finder = search_website(url, word) 
print(finder)

看起來它工作正常，但，解釋有關url。如果你在瀏覽器中打開：

url = "https://fotka.com/profil/k"

確實是有搜索word所以目前的回報True，但如果你打開：

url = "https://fotka.com/profil/kkkk"

有沒有這樣的word頁面上，它仍然返回True。

我檢查的變量page並在這兩種情況下的內容是一樣的，而url是不同的...

任何人都知道爲什麼與解決辦法的任何想法？

來源

2017-09-21 Emejcz

您已經發布了一個非常廣闊的演員，但我認爲你正在尋找段落標記<p>之間的數據：

import re 
import urllib 
url = "some page" 
word = "some word" 

page_data = str(urllib.urlopen(url).read()) 
paragraph_data = re.findall("<p>(.*?)</p>", page_data) 
final_paragraph_data = [i for i in paragraph_data if word in i]

final_paragraph_data現在存儲了包含word內容句子的所有集羣的列表。

來源

2017-09-21 17:57:45 Ajax1234

你可能也想看看我已經改變了我的問題的內容更加理解了're.MULTILINE'和're.DOTALL'標誌 –

。 – Emejcz

如果您的問題是「如何檢查頁面上是否有可見的測試？」那麼，這可能是您的解決方案爲您

import urllib 
from bs4 import BeautifulSoup 

url = "some page" 
word = "some word" 

page = urllib.urlopen(url).read() 

html = BeautifulSoup(page, "html.parser") 
print word in html.get_text()

來源

2017-09-21 18:08:28 ruX

我改變了我的問題的內容，以便更容易理解。 – Emejcz

如何在網站上找到句子？

回答

相關問題