解析HTML，寫入文件

我對使用python解析HTML標記有疑問。我的代碼如下所示：解析HTML，寫入文件

#!/usr/bin/python 
# -*- coding: utf-8 -*- 
from lxml import html 
import requests 
import urllib2 
import sys 
import re 
import time 
import urllib 
import datetime 
def get_web(): 

    try: 
     input_sat = open('rtc.xml','w') 
     godina = datetime.date.today().strftime("%Y") 
     print godina 
     mjesec = datetime.date.today().strftime("%m") 
     print mjesec 
     for x in range (32): 
      if x < 1: 
       x = x + 1 
       var = x 

       url = 'http://www.rts.rs/page/tv/sr/broadcast/20/RTS+1.html?month={}&year={}&day={}&type=0'.format(mjesec, godina, var) 

       page = requests.get(url) 
       tree = html.fromstring(page.text) 
       a = tree.xpath('//div[@id="center"]/h1/text()') # datum 
       b = tree.xpath('//div[@class="ProgramTime"]/text()') # time 
       c = tree.xpath('//div[@class="ProgramName"]/text()') 
       e = tree.xpath('//div[@class="ProgramName"]/a[@class="recnik"]/text()') 


       for line in zip(a,b,c,e): 
        var = line[0] 
        print >> input_sat, line+'\n' 




    except: 
     pass 
get_web()

該腳本工作正常，並從URL中獲取標籤，但我怎麼能寫他們到了處理的文件？當我用for循環運行我的代碼時，它不起作用。我不知道問題出在哪裏。

我重寫了我的代碼，它不會將頁面上的內容輸出到文件中。

來源

2014-01-08 Fox_01

這是您的整個代碼？我試着運行它，得到了'NameError：全局名'logging'未定義'。 – Kevin

查看python文件I/O例如http://www.tutorialspoint.com/python/python_files_io.htm（第二擊在谷歌）...你只需要打開文件，寫你想要的，然後關閉。 –

我現在重寫了我的代碼，問題在URL中的sercont標記的FOR循環中，它不會寫入文件中。 –

據我瞭解，您的print()功能不正確。你必須使用的處理器的write()功能，並且也編碼文本爲UTF-8：

for line in zip(a,b,c,e): 
    var = line[0] 
    input_sat.write(line[0].encode('utf-8') + '\n')

它產生：

Programska šema - sreda, 01. jan 2014

來源

2014-01-08 15:20:59 Birei

是的，但有3個更多的標籤解析，這隻輸出一個循環。 –

@ Fox_01：在第一次循環之後，您有一條指令，如果x <1'只允許一次迭代，並且在第二次循環中使用'zip（）'函數，該函數需要進行若干次迭代，等於所有的最小長度四個列表，而作爲'a'變量只有一個元素，循環只執行一次。看看它，因爲這對於如何寫入文件沒有任何關係。 – Birei

我試着這個 –

解析HTML，寫入文件

回答

相關問題