我無法使這個腳本工作來從一系列維基百科文章中刮取信息。維基百科的美麗的湯
我想要做的是迭代一系列維基URL並拉出維基門戶類別上的頁面鏈接(例如https://en.wikipedia.org/wiki/Category:Electronic_design)。
我知道我所經歷的所有維基頁面都有一個頁面鏈接部分。爲什麼會出現這種錯誤
Traceback (most recent call last):
File "./wiki_parent.py", line 37, in <module>
cleaned = pages.get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
:
然而,當我試圖通過他們迭代我收到此錯誤信息?
我讀的第一部分中的文件是這樣的:
1 Category:Abrahamic_mythology
2 Category:Abstraction
3 Category:Academic_disciplines
4 Category:Activism
5 Category:Activists
6 Category:Actors
7 Category:Aerobics
8 Category:Aerospace_engineering
9 Category:Aesthetics
,並存儲在該port_ID字典,像這樣:
{1:「類別:Abrahamic_mythology ',2:'Category:Abstraction',3:'Category:Academic_disciplines',4:'Category:Activism',5:'Category:Activists',6:'Category:Actors',7:'Category:Aerobics', 8:'Category:Aerospace_engineering',9:'Category:Aesthetics',10:'Category:Agnosticism',11:'Category:Agriculture'...}
所需的輸出是:
parent_num, page_ID, page_num
我認識的代碼是一個小的hackish,但我只是試圖讓這個工作:
#!/usr/bin/env python
import os,re,nltk
from bs4 import BeautifulSoup
from urllib import urlopen
url = "https://en.wikipedia.org/wiki/"+'Category:Furniture'
rootdir = '/Users/joshuavaldez/Desktop/L1/en.wikipedia.org/wiki'
reg = re.compile('[\w]+:[\w]+')
number=1
port_ID = {}
for root,dirs,files in os.walk(rootdir):
for file in files:
if reg.match(file):
port_ID[number]=file
number+=1
test_file = open('test_file.csv', 'w')
for key, value in port_ID.iteritems():
url = "https://en.wikipedia.org/wiki/"+str(value)
raw = urlopen(url).read()
soup=BeautifulSoup(raw)
pages = soup.find("div" , { "id" : "mw-pages" })
cleaned = pages.get_text()
cleaned = cleaned.encode('utf-8')
pages = cleaned.split('\n')
pages = pages[4:-2]
test = test = port_ID.items()[0]
page_ID = 1
for item in pages:
test_file.write('%s %s %s\n' % (test[0],item,page_ID))
page_ID+=1
page_ID = 1
好,那麼在代碼中,頁面將被綁定到無 –
您可能需要仔細檢查您使用的方式soup.find() – nthall
對不起,我是一個漂亮的新手公司der,但在這種情況下,頁面被綁定到None意味着什麼,並且有沒有簡單的方法來解決這個問題? – jdv12