HTML解析：從源代碼到Python中的文本

-1

我已經閱讀過這個問題（Python HTML parsing from url），但我還沒有理解。這是代碼：HTML解析：從源代碼到Python中的文本

import urllib.request 
from html.parser import HTMLParser 
    # create a subclass and override the handler methods 
    class MyHTMLParser(HTMLParser): 
      def handle_starttag(self, tag, attrs): 
       print ("Encountered a start tag:"+ tag) 
      def handle_endtag(self, tag): 
       print ("Encountered an end tag :"+ tag) 
      def handle_data(self, data): 
       print ("Encountered some data :"+ data) 
    parser = MyHTMLParser() 
    info = "http://www.calendario-365.it/js/365.php?page=moon" 
    response = urllib.request.urlopen(info) 
    content = response.read() 
    parser.feed(str(content))

應用此代碼到我的網站給我： http://pastebin.com/m4YV38uM 我想保存到變量

10,6 giorni

82%

如何？感謝您的回答。 Python版本：3.5。

來源

2016-04-17 MarcoBuster

-1

好的，如果你正在尋找一個簡單的解決方案，你可以對結果運行一個正則表達式，或者用它來限制你的輸出開始。我真的不能告訴你如何輸出這個數據，但你可能想嘗試這些模式：

"\d?\d,\d giorni" 
"\d?\d%"

第一個應該找到一個或兩個數字後跟一個逗號和另一個數字的任何模式，第二個一位或兩位數字後跟％。您也可以使用「+」或「*」運算符，具體取決於輸入中的可變性。

來源

2016-04-17 14:47:55 patrick

這不是一個答案......如果您打算從OP獲取更多信息，請將其作爲評論發佈，否則如果您不能，請不要將其作爲答案發布。 –

這兩個元素每天都可以改變，例如「54％」或「100％」之後的「82％」。也許「％」之前的兩個（或3個）數字？或者可能是「giorni」之前的三位數字？我不知道是否有真正的標準:) – MarcoBuster

@鐵拳：我的壞，我是新來的。這並不是要回答。似乎它不讓我評論，因爲低地位，雖然... – patrick

-1

試試這個：

# -- coding: UTF-8 -- 
import urllib2 
from bs4 import BeautifulSoup 

page = urllib2.urlopen('http://www.calendario-365.it/js/365.php?page=moon').read() 

soup = BeautifulSoup(page) 

print(soup.find(text='Età della Luna:').findNext('div').text) 

print(soup.find(text='Percentuale visibile:').findNext('div').text)

輸出：

10,6 giorni 
82%

來源

2016-04-17 14:53:58

語法錯誤： print * soup * .find（text ='Etàdella Luna：'）。findNext（'div'）。text – MarcoBuster

May是你使用python3，看到更新的答案。還要確保你是否有bs4。 –

你可以給我Beautifulsoup的下載鏈接嗎？因爲我沒有這些模塊 – MarcoBuster

雖然正則表達式是簡潔，像LXML和Beautifulsoup解析器方便，在這個特殊的問題，我不會介意使用HTMLParser 。即使你沒有最終使用它，這裏是程序。使用它有一點微妙之處。如果目標元素是如下（你還沒有表現出我假設一個實際的元素）

<div id="x" class="y"> 82% </div>

然後，實現方法如下

Class My(HTMLParser): 
    def __init__(self): 
     self.percent = 0 
     self.flag = False 

    def handle_starttag(self, tag, attrs): 
     attrs = dict(attrs) # {"id": someid, "class": someclass} 
     if attrs.get("id") == "x": # or attrs.get("class") == "y" 
      self.flag = True # we entered the target 

    def handle_data(self, data): 
     if self.flag == True: # we are inside our target 
      self.percent = data # do str -> int conversion 

    def handle_endtag(self, tag): 
     if self.flag == True: # reached the end of target 
      self.flag = False

爲了要捕捉

每個值

添加的實例屬性（如self.percent）
添加標記（如self.flag）
在所有三種方法中實現相應的邏輯：識別條目，提取數據和識別退出。

來源

2016-04-17 16:54:10

HTML解析：從源代碼到Python中的文本

回答

相關問題