維基百科的美麗的湯

我無法使這個腳本工作來從一系列維基百科文章中刮取信息。維基百科的美麗的湯

我想要做的是迭代一系列維基URL並拉出維基門戶類別上的頁面鏈接（例如https://en.wikipedia.org/wiki/Category:Electronic_design）。

我知道我所經歷的所有維基頁面都有一個頁面鏈接部分。爲什麼會出現這種錯誤

Traceback (most recent call last): 
    File "./wiki_parent.py", line 37, in <module> 
    cleaned = pages.get_text() 
AttributeError: 'NoneType' object has no attribute 'get_text'

：
然而，當我試圖通過他們迭代我收到此錯誤信息？

我讀的第一部分中的文件是這樣的：

1 Category:Abrahamic_mythology 
2 Category:Abstraction 
3 Category:Academic_disciplines 
4 Category:Activism 
5 Category:Activists 
6 Category:Actors 
7 Category:Aerobics 
8 Category:Aerospace_engineering 
9 Category:Aesthetics

，並存儲在該port_ID字典，像這樣：

{1：「類別：Abrahamic_mythology '，2：'Category：Abstraction'，3：'Category：Academic_disciplines'，4：'Category：Activism'，5：'Category：Activists'，6：'Category：Actors'，7：'Category：Aerobics'， 8：'Category：Aerospace_engineering'，9：'Category：Aesthetics'，10：'Category：Agnosticism'，11：'Category：Agriculture'...}

所需的輸出是：

parent_num, page_ID, page_num

我認識的代碼是一個小的hackish，但我只是試圖讓這個工作：

#!/usr/bin/env python 
import os,re,nltk 
from bs4 import BeautifulSoup 
from urllib import urlopen 
url = "https://en.wikipedia.org/wiki/"+'Category:Furniture' 

rootdir = '/Users/joshuavaldez/Desktop/L1/en.wikipedia.org/wiki' 

reg = re.compile('[\w]+:[\w]+') 
number=1 
port_ID = {} 
for root,dirs,files in os.walk(rootdir): 
    for file in files: 
     if reg.match(file): 
      port_ID[number]=file 
      number+=1 


test_file = open('test_file.csv', 'w') 

for key, value in port_ID.iteritems(): 

    url = "https://en.wikipedia.org/wiki/"+str(value) 
    raw = urlopen(url).read() 
    soup=BeautifulSoup(raw) 
    pages = soup.find("div" , { "id" : "mw-pages" }) 
    cleaned = pages.get_text() 
    cleaned = cleaned.encode('utf-8') 
    pages = cleaned.split('\n') 
    pages = pages[4:-2] 
    test = test = port_ID.items()[0] 

    page_ID = 1 
    for item in pages: 
     test_file.write('%s %s %s\n' % (test[0],item,page_ID)) 
     page_ID+=1 
    page_ID = 1

來源

2015-06-21 jdv12

好，那麼在代碼中，頁面將被綁定到無 –

您可能需要仔細檢查您使用的方式soup.find（） – nthall

對不起，我是一個漂亮的新手公司der，但在這種情況下，頁面被綁定到None意味着什麼，並且有沒有簡單的方法來解決這個問題？ – jdv12

你刮的幾頁循環。但可能會有一些頁面沒有任何<div id="mw-pages">標籤。所以，你所得到的AttributeError在行，

cleaned = pages.get_text()

您可以使用if條件檢查，如：

if pages: 
    # do stuff

或者你可以用try-except塊一樣避開它，

try: 
    cleaned = pages.get_text() 
    # do stuff 
except AttributeError as e: 
    # do something

來源

2015-06-22 06:41:54

非常感謝你！ – jdv12

維基百科的美麗的湯

回答

相關問題