2015-05-01 63 views
0

目前我正在使用Python 3.4.3和MongoDB作爲技術進行POC工作。如何使用Python和MongoDB從隨機URL中讀取數據?

我需要www.socialmention.com網站以搜索任何字符串像「財經」或者「蘋果季度業績」等。結果將是多個URL,並且將是隨機的。現在我需要解析每個鏈接並閱讀文章,評論,喜歡,用戶詳細信息等。

直到現在,我成功地從socialmention中捕獲隨機鏈接URL,然後我的想法是創建一個博客字典MongoDB的和維護信息如下圖所示:

> db.blogs_dictionary.find().pretty() 
{ 
    "_id" : ObjectId("55401455a1ce265d58f21049"), 
    "blog_name" : "www.networkcomputing.com", 
    "article" : "yes", 
    "article_tag" : "div", 
    "article_tag_type" : "id", 
    "article_string" : "article-main", 
    "article_multipage" : "yes", 
    "article_multipage_tag" : "span", 
    "article_multipage_tag_type" : "class", 
    "article_multipage_tag_string" : "blue strong allcaps", 
    "article_multipage_query_variable" : "page_number", 
    "comments" : "yes", 
    "comments_multipage" : "no", 
    "comments_multipage_tag" : "", 
    "comments_multipage_tag_type" : "", 
    "comments_multipage_tag_string" : "", 
    "comments_threaded" : "yes", 
    "comments_threaded_query_variable" : "piddl_msgorder", 
    "comments_threaded_query_value" : "thrd#msgs", 
    "comments_main" : "yes", 
    "comments_main_tag" : "div", 
    "comments_main_tag_type" : "class", 
    "comments_main_tag_string" : "comments-main", 
    "user_name" : "yes", 
    "user_name_tag" : "span", 
    "user_name_tag_type" : "class", 
    "user_name_tag_string" : "smaller strong black", 
    "user_rank" : "yes", 
    "user_rank_tag" : "span", 
    "user_rank_tag_type" : "class", 
    "user_rank_tag_string" : "smaller black", 
    "comments_body" : "yes", 
    "comments_body_tag" : "div", 
    "comments_body_tag_type" : "class", 
    "comments_body_tag_string" : "comment-body" 
} 

然後在Python代碼使用的一些東西一樣......如果從socialmention網站上的鏈接有在我的博客dictonary ......然後檢查文章和評論是否存在..如果存在,則通過URL打開URL並閱讀所需的內容....但是爲了實現這一切,我需要傳遞標籤並動態搜索字符串

for i in db.social_mention.find({},{"blog_name":1,"_id":0}): 
    for j in db.blogs_dictionary.find({},{"blog_name":1,"_id":0}): 
     if i['blog_name']==j['blog_name']: 
     link=db.social_mention.find_one({"blog_name":i['blog_name']},{"link":1,"_id":0}) 
     url=link['link'] 
     print (url) 
     if (db.blogs_dictionary.find({"blog_name":j['blog_name']},{"article":1,"_id":0})) == "yes": 
      article_variables=db.blogs_dictionary.find({"blog_name":j['blog_name']},{"article":1,"article_tag":1,"article_tag_type":1,"article_string":1,"article_multi":1,"article_multipage_tag":1,"article_multipage_tag_type":1,"article_multipage_tag_string":1,"article_multipage_query_variable":1,"_id":0}).pretty() 
      soup = BeautifulSoup(urllib.request.urlopen(url)) 
      data=soup.find(article_variables['article_tag'],article_variables['article_tag_type']=article_variables['article_string']) 
      print (data.text) 

但我得到像關鍵字不能是表達式的錯誤。有沒有其他的方式來做到這一點,或者我應該改變我的設計?

+0

確切的錯誤是什麼? – skyline75489

回答

0

我認爲你要調用find()與屬性字典,attrs

data = soup.find(article_variables['article_tag'], 
       attrs={article_variables['article_tag_type']: article_variables['article_string']}) 

的原因:你不能通過使用標識字符串,即關鍵字參數在

article_variables['article_tag_type']=article_variables['article_string'] 

article_variables['article_tag_type']不是關鍵字參數的有效標識符。一般的解決方法是使用字典和解壓這樣的:

kwargs = {article_variables['article_tag_type']: article_variables['article_string']} 
data=soup.find(article_variables['article_tag'], **kwargs) 

但是,因爲find()接受attrs字典,你可以直接通過。

+0

工作完美..非常感謝:) –

相關問題