報廢文章與Python 3.4和BeautifulSoup，請

我想放棄的網站：報廢文章與Python 3.4和BeautifulSoup，請

https://xueqiu.com/yaodewang

而且我想放棄他的所有文章。我使用BeautifulSoup和採購這樣的：

import requests 
from bs4 import BeautifulSoup 
url = 'https://xueqiu.com/yaodewang' 
header = {'user-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'} 
r = requests.get(url,headers = header).content 
soup = BeautifulSoup(r,'lxml') 
artile = soup.find_all('ul',{'class':'status-list'}) 
print(artile)

結果是什麼這是回報！

[]

SO，我TYR另一個規則是這樣的：

# art = soup.find_all('div',{'class':'allStatuses no-head'}) 
# art = soup.find_all('div',{'class':'status_bd'}) 
# art = soup.find_all('div',{'class':'status_content container active tab-pane'})

但是，它返回了一些不正確的詞。我想要這樣的內容

我需要你的幫助，非常感謝！

來源

2016-05-01 champion Ch

所需的數據實際上不在status-list類的元素中。如果你想查看源代碼，你會發現一個空的容器，而不是：

<div class="status_bd"> 
    <div id="statusLists" class="allStatuses no-head"></div> 
</div>

相反，狀態都位於script元素，你需要找到裏面，提取所需的對象，從JSON加載到Python字典並提取所需的信息：

import json 
import re 
import requests 
from bs4 import BeautifulSoup 

url = 'https://xueqiu.com/yaodewang' 
headers = { 
    'user-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36' 
} 
r = requests.get(url, headers=headers).content 
soup = BeautifulSoup(r, 'lxml') 

pattern = re.compile(r"SNB\.data\.statuses = ({.*?});", re.MULTILINE | re.DOTALL) 
script = soup.find("script", text=pattern) 

data = json.loads(pattern.search(script.text).group(1)) 
for item in data["statuses"]: 
    print(item["description"])

打印：

The best advice: Remember common courtesy and act toward others as you want them to act toward you. 
Lighten up! It&#39;s the weekend. we&#39;re just having a little fun! Industrial Bank is expected to rise,next week... 
... 
點.點.點... 點到這個，學位、學歷、成績單翻譯一下要50塊、100塊的...

來源

2016-05-01 02:24:49 alecxe

非常感謝你much.It是一個正確的methlod但是，我想知道，如果我知道conten！ t位於腳本中，我如何找到這樣的正則表達式：pattern = re.compile（r「SNB \ .data \ .statuses =（{。*？}）;」，re.MULTILINE | re.DOTALL） –

另一個問題：我想獲得artiles的列表，但現在，我得到了一個字符串。我想得到這樣的結果= [str01，str02 .....] –

@championCh當然，只是提取腳本文本並使用它，例如[regex101]（https://regex101.com/）。至於你的第二個問題，我認爲你是在詢問如何將結果放入一個列表中：'articles = [item [「description」] for data in data [「statuses」]]]'。希望有所幫助。 – alecxe

報廢文章與Python 3.4和BeautifulSoup，請

回答

相關問題