2015-08-17 70 views
-1

使用csv文件。它包含一個源代碼列表(簡單的ssl鏈接),地點,網站(< a>不是ssl鏈接</a>),Direcciones和電子郵件。當某些數據不可用時,它不會顯示。像這樣:Python for re.match re.sub

httpsgoogledotcom, GooglePlace2, Direcciones, Montain View, Email, [email protected] 

儘管如此,網站'一個html標記'鏈接總是出現兩次,後面跟着幾個逗號。同樣,遵循逗號,有時由Direcciones,有時由源(https)。因此,如果EOF過程沒有中斷,它可以'替換'幾個小時,並創建一個輸入文件,其中包含reduce和misplaced信息的gbs。讓我們拿起四個條目作爲Reutput.csv的例子:

> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,, 
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,, 
> ,,Direcciones, Montain View, Email, [email protected] 
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, [email protected] 
> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,, 
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,, 
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, [email protected] 

這樣的想法是刪除不必要的網站「一個HTML標籤」鏈接和多餘的逗號,但尊重新線/ n和不易脫落在循環中。就像這樣:

> httpsgoogledotcom, GooglePlace, Website, "<a href='httpgoogledotcom'></a>",Direcciones, Montain View, Email, [email protected] 
> httpsbingdotcom, BingPlace, Direcciones,MicroWorld, Email, [email protected] 
> httpsgoogledotcom, GooglePlace,Website, <a href='httpgoogledotcom'></a>" 
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, [email protected] 

這是代碼的最後一個版本:

with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf: 
    text = str(reuf.read()) 
    for lines in text: 
     d = re.match('</a>".*D?',text,re.DOTALL) 
     if d is not None: 
      if not 'https' in d: 
       replace = re.sub(d,'</a>",Direc',lines) 
     h = re.match('</a>".*?http',text,re.DOTALL|re.MULTILINE) 
     if h is not None: 
      if not 'Direc' in h: 
       replace = re.sub(h,'</a>"\nhttp',lines) 
     replace = str(replace) 
     putuf.write(replace) 

現在,我得到一個Put.csv與永遠重複最後一排文件。爲什麼這個循環?我已經嘗試了幾種方法來處理這些代碼,但不幸的是,我仍然堅持這樣做。提前致謝。

回答

0

最後我自己拿了代碼。我在這裏張貼它希望有人認爲它有用。無論如何,謝謝你的幫助和反對票!

import re 
with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf: 
    text = str(reuf.read()) 
    d = re.findall('</a>".*?Direc',text,re.DOTALL|re.MULTILINE) 
    if d is not None: 
     for elements in d: 
      elements = str(elements) 
      if not 'https' in elements: 
        s = re.compile('</a>".*?Direc',re.DOTALL) 
        replace = re.sub(s,'</a>",Direc',text) 
    h = re.findall('</a>".*?https',text,re.DOTALL|re.MULTILINE) 
    if h is not None: 
     for elements in h: 
      if not 'Direc' in elements: 
       s = re.compile('</a>".*?https',re.DOTALL) 
       replace = re.sub(s,'</a>"\nhttps',text) 
     replace = str(replace) 
     putuf.write(replace) 
0

當沒有匹配時,groups將是None。你需要警惕這一點(或重構正則表達式,以便它總是匹配一些東西)。

groups = re.search('</a>".*?Direc',lines,re.DOTALL) 
    if groups is not None: 
     if not 'https' in groups: 

通知添加的not None條件和其它支配以下行的後續縮進。

+0

我嘗試添加其他 : \t更換=行,但沒了 – Abueesp

+0

看到示例代碼 – tripleee

+0

更新我嘗試了,我得到了一個空白的文件,所以你是對的,團體的比賽必須是無。爲什麼?那麼如何解決Reutput.csv呢?預先感謝tripleee – Abueesp