2014-10-17 110 views
0

我正在學習如何解析Beautifulsoup。有人能解釋我如何解析div class="article-content"中的<p></p>元素。我希望在腳本啓動後只能看到內容信息。讓我證明我想要的東西:Python Beautifulsoup。解析<p></p>

enter image description here

我可以在<p></p>解析div class="article-content",但不能所需的信息。我的代碼看起來這樣的方式:

import urllib2 
from bs4 import BeautifulSoup 

html = urllib2.urlopen('http://www.engadget.com/2014/10/17/local-multiplayer-is-coming-to-android-games/') 
parsed_html = BeautifulSoup(html) 
print parsed_html.body.find('div', attrs={'class':'article-content'}).text 

,但我得到也有很多垃圾:

$ python engadget_parser.py 


Ever wish that you could just whip out your Android device and harass a passer-by to play games with you? It's the sort of thing that Nintendo DS users, for example, have been using thanks to that company's StreetPass feature, but, until now, hasn't been available on Google's smartphones. Now, however, the company has an added an update to its games infrastructure that enables "ambient, real-time" games with more than one user - so long that the game relies upon Google's home-grown multiplayer backend. Still, maybe don't sprint into the street and start challenging people to a dual, because they might get the wrong idea. 





     onBreak({ 
      0: function(){ 
       (function() { 
         var a = { 
           mobilePlacementID: "348-14-15-135b", 
         width: "320", 
         height: "115" 
         }; 
        madserver.requestAd(a); 
       })(); 
      }, 
      768: function(){} 
     }); 






Source: Android Developers (G+) 



Tags: android, AndroidGames, gaming, google, googleplaygames, mobile, mobilepostcross 





 Hide Comments 
0Comments 










      _when_.eng("eng.livefyre.init", { 
       articleId: 20979699 , 
       domain: "engadget.fyre.co" , 
       siteId: "296092" , 
       el: "livefyre_20979699", 
       initialNumVisible: 2 
      }) 



_when_.eng("eng.perm.init"); 



lab.scriptBs('gravity.js') 




onBreak({ 
    0: function(){}, 
    320: function(){}, 
    768: function(){} 
}); 

謝謝!

回答

1

我喜歡在這種情況下美麗的選擇方法。替換此:

print parsed_html.body.find('div', attrs={'class':'article-content'}).text 

有了這個:

for p in parsed_html.select('div.article-content p'): 
    print p.text 
1

也許這是非常糟糕的代碼,但無論如何,我會告訴他,不要捅我,我只是初學者在Python:

import urllib2 
from bs4 import BeautifulSoup 

url = "http://www.engadget.com/2014/10/17/castar-augmented-reality/" 

html = urllib2.urlopen(url) 
parsed_html = BeautifulSoup(html) 


def news_parser(url): 
    list = [] 
    for p in parsed_html.select('div.article-content p'): 
     list.append(p.text) 
    return list 


def longest_text_position(list): 
    # sometimes article is not in list[1] position, so I am searching a longest element in list 
    a = 0 
    longest_text = "" 

    for item in list: 
     x = len(item) 
     if x > a: 
      a = x 
      longest_text = item 

    position = list.index(longest_text) 
    return position 


def print_news(position): 
    print "-" * 80 
    print parsed_html.title.string 
    print "-" * 80 
    print list[position] 
    print "-" * 80 
    print " " 

list = news_parser(url) 
position = longest_text_position(list) 
print_news(position) 

結果如下:

$ python engadget_parser_new.py 
-------------------------------------------------------------------------------- 
castAR bets big on its augmented reality hardware with move to Silicon Valley 
-------------------------------------------------------------------------------- 
And they certainly were. From just a brief hands-on with the new hardware, I could tell the make out ....ating that I could look around objects by just walking around the table. Henkel-Wallace mentioned a potential for a holodeck application by blanketing a room with that retroreflective material, and I could certainly see a use case for that. 
-------------------------------------------------------------------------------- 

T向你致敬@Vincent Beltman。