2013-03-02 90 views
0

我有下面的預期輸出。嘗試讀取URL,能夠成功讀取, 然而,當我嘗試捕獲塊「Combo」下的數據時,上午遇到錯誤,有關如何解決此問題的任何輸入?閱讀URL後捕獲一塊數據

# Version YYYYMMDD 
version = "20121112" 

# File type to be output to logs 
# Should be changed to exe before building the exe. 
fileType = "py" 

# Import sys to read command line arguments 
import sys, getopt 
#import pdb 
#pdb.set_trace() 

import argparse 
import urllib 
import urllib2 
import getpass 
import re 

def update (url): 
    print url 

    authhost = 'https://login.company.com' 
    # Siteminder test server 
    user = getpass.getuser() 
    password = getpass.getpass() 
    realm = None 

    # handle the authentication and cookies 
    cookiehand = urllib2.HTTPCookieProcessor() 
    password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() 
    password_mgr.add_password(user=user, 
           passwd=password, 
           uri=authhost, 
           realm=realm) 
    auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr) 
    opener = urllib2.build_opener(auth_handler, cookiehand) 
    urllib2.install_opener(opener) 
    #make the request 
    req = urllib2.Request(url=url) 
    try: 
     f = urllib2.urlopen(req) 
     txt = f.read() 
     f.close() 
    except urllib2.HTTPError, e: 
     txt = '' 
     print 'An error occured connecting to the wiki. No wiki page will be generated.' 
     return '<font color=\"red\">QWiki</font>' 
    # Find the start tag of the textarea with Regular Expressions 
    print txt 
    p = re.compile('<Combo[^>]*>') 
    m = p.search(txt) 
    (tagStart, tagEnd) = m.span() 
    # Find the end of the textarea 
    endTag = txt.index("</textarea>") 

def main(): 
    #For logging 
    print "test" 
    parser = argparse.ArgumentParser(description='This is the update.py script created by test') 
    parser.add_argument('-u','--url',action='store',dest='url',default=None,help='<Required> url link',required=True) 
    results = parser.parse_args()# collect cmd line args 
    url = results.url 
    #print url 
    update(url) 
if __name__ == '__main__': 
    main() 

電流輸出: -

C:\Dropbox\scripts>python announce_update.py --u "http://qwiki.company.com/component/w/index.php?title=Test1&action=raw" 
test 
http://qwiki.company.com/component/w/index.php?title=Test1&action=raw 
Password: 
==== <font color="#008000">Combo</font> ==== 

{| border="1" cellspacing="1" cellpadding="1" 
|- 
! bgcolor="#67B0F9" scope="col" | test1 
! bgcolor="#67B0F9" scope="col" | test2 
! bgcolor="#67B0F9" scope="col" | test3 
! bgcolor="#67B0F9" scope="col" | test4 
|- 
| [http:link.com] 
|} 

==== <font color="#008000">COde:</font> ==== 
Traceback (most recent call last): 
    File "announce_update.py", line 66, in <module> 
    main() 
    File "announce_update.py", line 64, in main 
    update(url) 
    File "announce_update.py", line 52, in update 
    (tagStart, tagEnd) = m.span() 
AttributeError: 'NoneType' object has no attribute 'span' 

預期輸出: -

{| border="1" cellspacing="1" cellpadding="1" 
|- 
! bgcolor="#67B0F9" scope="col" | test1 
! bgcolor="#67B0F9" scope="col" | test2 
! bgcolor="#67B0F9" scope="col" | test3 
! bgcolor="#67B0F9" scope="col" | test4 
|- 
| [http:link.com] 
|} 
+2

考慮'美麗的湯'或另一個HTML解析庫。正則表達式不適合這個任務。 – mpen 2013-03-02 07:50:41

回答

0

錯誤表明您的字符串m是空的/沒有定義。

此外,似乎您的正則表達式無論如何都找不到正確的文本,因爲它會停在</font>的右括號處。

我在http://docs.python.org/2/howto/regex.html

發現了一個很好的參考,使用re看完後,我認爲你需要這樣的

p = re.compile(r'>Combo<.*({.*})'); 

注意r表達式來表示raw字符串,告知的Python不解釋反斜槓等等;我用圓括號創建了一個「組」,因此您可以提取「僅此匹配的這一點」。現在,當你與

m = p.match(); 

搜索您應該能夠在第一組括號後面>Combo<

myText = m.group(1); 

這可能不是完美只提取位,但它應該是非常接近 - 我試圖告訴你需要找到「第一個大括號」> Combo <直到下一個大括號「。圓括號表示「這是我想要的位」,索引groupmatch對象中提取它。

+0

我實際上是在搜索組合..我更新了腳本..源包含字符串「組合」,我實際上也打印源.. – user2125827 2013-03-02 06:56:00

+0

任何正則表達式的專家? – user2125827 2013-03-02 07:09:27

+0

我看到你打印的內容,但我仍然認爲你的正則表達式是錯誤的 - 這就是爲什麼我添加了標籤。 – Floris 2013-03-02 07:09:46

1

p.search(txt)返回None如果在文本txt中未找到模式pNone.span導致錯誤。

要在HTML第一<textarea>元素中提取文本,你可以使用BeautifulSoup(HTML解析器),而不是正則表達式的:

from bs4 import BeautifulSoup # pip install beautifulsoup4 

soup = BeautifulSoup(txt) 
print(soup.textarea.string) 

你可以嘗試使用僅HTMLParser從STDLIB做同樣的:

#!/usr/bin/env python 
import cgi 

try: 
    from html.parser import HTMLParser 
except ImportError: # Python 2 
    from HTMLParser import HTMLParser 

try: 
    from urllib.request import urlopen 
except ImportError: # Python 2 
    from urllib2 import urlopen 

url = 'http://qwiki.company.com/component/w/index.php?title=Test1&action=raw' 
tag = 'textarea' 

class Parser(HTMLParser): 
    """Extract tag's text content from html.""" 
    def __init__(self, html, tag): 
     HTMLParser.__init__(self) 
     self.contents = [] 
     self.intag = None 
     self.tag = tag 
     self.feed(html) 

    def handle_starttag(self, tag, attrs): 
     self.intag = (tag == self.tag) 
    def handle_endtag(self, tag): 
     self.intag = False 
    def handle_data(self, data): 
     if self.intag: 
      self.contents.append(data) 

# download and convert to Unicode 
response = urlopen(url) 
_, params = cgi.parse_header(response.headers.get('Content-Type', '')) 
html = response.read().decode(params['charset']) 

# parse html (extract text from the first `<tag>` element) 
content = Parser(html, tag).contents[0] 
print(content)