如何從鵝印地文網頁中提取文章？

-1

我使用Python Goose從網頁中提取文章。它適用於很多語言，但對印地語不起作用。我試圖添加印地語停止作爲stopwords-hi.txt和設置target_language嗨，沒有成功。謝謝，伊蘭如何從鵝印地文網頁中提取文章？

2016-05-17 Eran Ben-Natan

究竟如何失敗？ –

清空文本函數不會返回任何內容 –

是的，我有同樣的問題。我一直在研究所有印度地區語言的文章，而且我無法單獨使用Goose來提取內容。如果您可以單獨使用文章描述，那麼meta_description完美地起作用。您可以使用它來代替不返回任何內容的clean_text。

另一種選擇，但更多的行代碼：

import urllib 
from bs4 import BeautifulSoup 

url = "http://www.jagran.com/news/national-this-pay-scale-calculator-will-tell-your-new-salary-after-7th-pay-commission-14132357.html" 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html, "lxml") 

##removing all script, style and reference links to get only the article content 
for script in soup(["script", "style",'a',"href","formfield"]): 
    script.extract() 


text = soup.get_text() 

lines = (line.strip() for line in text.splitlines()) 
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 
text = '\n'.join(chunk for chunk in chunks if chunk) 

print (text)

公開披露：事實上，我的原代碼某處只有堆棧溢出。修改它一點點。

來源

2016-06-10 04:48:44

如何從鵝印地文網頁中提取文章？

回答

相關問題