如何在python中使用re.findall（）時顯示正確的輸出？

我做了一個Python腳本來從雅虎財經獲得最新的股票價格。如何在python中使用re.findall（）時顯示正確的輸出？

import urllib.request 
import re 

htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG"); 

htmltext = htmlfile.read(); 

price = re.findall(b'<span id="yfs_l84_goog">(.+?)</span>',htmltext); 
print(price);

工作平穩，但是當我輸出的價格它出來像這樣
[b'1,217.04']

這可能是一個小問題要問，但我是新來的Python腳本，所以請，如果你能幫助我。

我想擺脫'b'。如果我從b'<span id="yfs_l84_goog">"中刪除'b'，則會顯示此錯誤。

File "C:\Python33\lib\re.py", line 201, in findall 
return _compile(pattern, flags).findall(string) 
TypeError: can't use a string pattern on a bytes-like object

我所要的輸出只是

1,217.04

來源

2014-02-25 Saurabh Rana

b''爲在Python bytes文字語法。這是你如何定義Python源代碼中的字節序列。

您在輸出中看到的是re.findall()返回的price列表中單個bytes對象的表示形式。你可以將其解碼爲一個字符串並打印：

>>> for item in price: 
...  print(item.decode()) # assume utf-8 
... 
1,217.04

你也可以直接寫字節到stdout例如，sys.stdout.buffer.write(price[0])。

你可以use an html parser instead of a regex to parse html：

#!/usr/bin/env python3 
import cgi 
from html.parser import HTMLParser 
from urllib.request import urlopen 

url = 'http://finance.yahoo.com/q?s=GOOG' 

def is_price_tag(tag, attrs): 
    return tag == 'span' and dict(attrs).get('id') == 'yfs_l84_goog' 

class Parser(HTMLParser): 
    """Extract tag's text content from html.""" 
    def __init__(self, html, starttag_callback): 
     HTMLParser.__init__(self) 
     self.contents = [] 
     self.intag = None 
     self.starttag_callback = starttag_callback 
     self.feed(html) 

    def handle_starttag(self, tag, attrs): 
     self.intag = self.starttag_callback(tag, attrs) 
    def handle_endtag(self, tag): 
     self.intag = False 
    def handle_data(self, data): 
     if self.intag: 
      self.contents.append(data) 

# download and convert to Unicode 
response = urlopen(url) 
_, params = cgi.parse_header(response.headers.get('Content-Type', '')) 
html = response.read().decode(params['charset']) 

# parse html (extract text from the price tag) 
content = Parser(html, is_price_tag).contents[0] 
print(content)

檢查是否雅虎提供，不需要網頁抓取API。

來源

2014-02-25 16:33:44 jfs

它使用item.decode（）很好用，它將字節轉換爲字符串嗎？ –

@SaurabhRana：是的。 'some_bytes.decode（character_encoding）== some_unicode_text' – jfs

好吧尋找了一段時間後。我找到了解決方案。對我來說工作很好。

import urllib.request 
import re 

htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG"); 

htmltext = htmlfile.read(); 

pattern = re.compile('<span id="yfs_l84_goog">(.+?)</span>'); 

price = pattern.findall(str(htmltext)); 
print(price);

來源

2014-02-25 15:52:49

不會無條件地調用str（some_bytes_object_that_might_be_text）。如果頁面沒有在'sys.getdefaultencoding（）'中編碼，那麼'str（）'可能會產生錯誤的結果（有時候會默默地）。你可以[使用'Content-Type'頭來得到字符編碼]（http://stackoverflow.com/a/22020377/4279） – jfs

如何在python中使用re.findall（）時顯示正確的輸出？

回答

相關問題