從使用Python中的beautifulsoup的網站中提取數字

我正在嘗試使用urllib來抓取html頁面，然後使用beautifulsoup來提取數據。我想從comments_42.html中獲取所有數字並打印出它們的總和，然後顯示數據的數量。這是我的代碼，我正在嘗試使用正則表達式，但它不適用於我。從使用Python中的beautifulsoup的網站中提取數字

import urllib 
from bs4 import BeautifulSoup 
url = 'http://python-data.dr-chuck.net/comments_42.html' 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html,"html.parser") 
tags = soup('span') 
for tag in tags: 
    print tag

來源

2015-12-13 Salosha

1.您沒有使用正則表達式，只要我能看到; 2. *「不起作用」的意思是什麼？ – jonrsharpe

我的意思是我在使用正則表達式時得到了堆棧，這可能是由於我的編程技巧低下造成的。 – Salosha

那麼？這不是教程服務。 *試一試。* – jonrsharpe

使用BeautifulSoup的findAll（）方法提取所有帶有'comments'類的span標籤，因爲它們包含了您需要的信息。然後您可以根據您的要求對它們執行任何操作。

soup = BeautifulSoup(html,"html.parser") 
data = soup.findAll("span", { "class":"comments" }) 
numbers = [d.text for d in data]

這裏是輸出：

[u'100', u'97', u'87', u'86', u'86', u'78', u'75', u'74', u'72', u'72', u'72', u'70', u'70', u'66', u'66', u'65', u'65', u'63', u'61', u'60', u'60', u'59', u'59', u'57', u'56', u'54', u'52', u'52', u'51', u'47', u'47', u'41', u'41', u'41', u'38', u'35', u'32', u'31', u'24', u'19', u'19', u'18', u'17', u'16', u'13', u'8', u'7', u'1', u'1', u'1']

來源

2015-12-13 09:14:26 Learner

謝謝，這對我來說很好，有沒有辦法擺脫「u」'「？ Sry回覆這麼晚，我需要使用vpn連接網站才能通過GFW，這就是爲什麼我無法儘快回覆。 – Salosha

使用'數字= [d.text.encode（'utf-8'）作爲數據中的d]' – Learner

@學習者的解決方案是完全正確的！但如果你想要做更多的名稱和註釋，你可以做到這一點，它返回名稱和註釋的列表：

from BeautifulSoup import BeautifulSoup 
import re 
import urllib 
url = 'http://python-data.dr-chuck.net/comments_42.html' 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html) 
all = soup.findAll('span',{'class':'comments'},text=re.compile(r'[0-9]{0,4}')) #use regex to extract only numbers 
cleaned = filter(lambda x: x!=u'\n',all)[4:] 
In [18]: cleaned 
Out[18]: 
[u'Leven', 
u'100', 
u'Mahdiya', 
u'97', 
u'Ajayraj', 
u'87', 
u'Lillian', 
u'86', 
u'Aon', 
u'86', 
u'Ruaraidh', 
u'78', 
u'Gursees', 
u'75', 
u'Emmanuel', 
u'74', 
u'Christy', 
u'72', 
u'Annoushka', 
u'72', 
u'Inara', 
u'72', 
u'Caite', 
u'70', 
u'Rosangel', 
u'70', 
u'Iana', 
u'66', 
u'Anise', 
u'66', 
u'Jaosha', 
u'65', 
u'Cadyn', 
u'65', 
u'Edward', 
u'63', 
u'Charlotte', 
u'61', 
u'Sammy', 
u'60', 
u'Zarran', 
u'60',.....] #

來源

2015-12-13 09:35:03

太棒了！你用正則表達式，這正是我想要的，但我怎麼能在列表中脫離「u」？作爲答覆這麼晚，世界上有兩個互聯網，中國和其他國家，我很難用vpn來檢查答案。 – Salosha

@Saikorin：你會發現它是一個unicode字符串！您可以使用** encode（）**方法將其轉換爲普通字符串。例如，如果ustr = u'str'是unicode，那麼str = ustr.encode（）是一個普通的字符串。 –

我明白了，但是我仍然對Python中的unicode輸出感到有點迷惑，因此請檢查一下。謝謝你和學習者，100％解決了我所有的困惑！ – Salosha

不要忘記，你必須要想在代碼中使用它們導入正則表達式。

import re

來源

2015-12-21 02:35:48 cybernerd

我從Coursera開始學習同樣的課程。不要去尋求上述解決方案，你介意嘗試這一個。直到上述問題，我覺得這個問題屬於我們所瞭解的範圍。它絕對爲我工作。

import urllib 
import re 
from bs4 import * 

url = 'http://python-data.dr-chuck.net/comments_216543.html' 
html = urllib.urlopen(url).read() 

soup = BeautifulSoup(html,"html.parser") 
sum=0 
# Retrieve all of the anchor tags 
tags = soup('span') 
for tag in tags: 
    # Look at the parts of a tag 
    y=str(tag) 
    x= re.findall("[0-9]+",y) 
    for i in x: 
     i=int(i) 
     sum=sum+i 
print sum

來源

2016-01-14 11:52:50 Tuhin

做它的基本途徑...

# Retrieve all of the anchor tags 
tags = soup('span') 
sum = 0 
count = 0 
for tag in tags: 
# Look at the parts of a tag 

    #print tag.contents[0] 
    num = float(tag.contents[0]) 
    #print num 
    sum = sum + num 
    count = count + 1 

print 'count:',count 
print 'sum:',sum

來源

2016-01-20 05:36:55 JPAbucay

我這樣做的光標，它給了我所有的正確答案。希望它幫助;）

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import ssl 

# Ignore SSL certificate errors 
ctx = ssl.create_default_context() 
ctx.check_hostname = False 
ctx.verify_mode = ssl.CERT_NONE 

url = input('Enter - ') 
html = urlopen(url, context=ctx).read() 
soup = BeautifulSoup(html,"html.parser") 

# Retrieve all of the anchor tags 
tags = soup('span') 
sum = 0 
count = 0 
for tag in tags: 
# Look at the parts of a tag 

    #print tag.contents[0] 
    num = float(tag.contents[0]) 
    #print num 
    sum = sum + num 
    count = count + 1 

print ('count:', count) 
print ('sum:', sum)

來源

2017-07-27 23:52:43 Anna

-1

import urllib.request,urllib.parse,urllib.error 

import re 

from bs4 import BeautifulSoup 

url = input('Enter - ') 


html = urllib.request.urlopen(url).read() 

soup = BeautifulSoup(html,"html.parser") 

tags=soup('span') 

sum=0 

for tag in tags: 

    x=re.findall("[0-9]+",tag) 



    for i in x: 

     z=int(i) 


     sum=sum+i 


print(sum)

來源

2017-09-20 23:18:42

歡迎使用堆棧溢出。請編輯您的答案，以便對代碼進行格式化，並添加關於您的代碼的解釋，以及爲什麼OP應該使用它，或者是更好的解決方案，然後是接受的答案。 – Syfer

從使用Python中的beautifulsoup的網站中提取數字

回答

相關問題