Python for re.match re.sub

-1

使用csv文件。它包含一個源代碼列表（簡單的ssl鏈接），地點，網站（< a>不是ssl鏈接</a>），Direcciones和電子郵件。當某些數據不可用時，它不會顯示。像這樣：Python for re.match re.sub

httpsgoogledotcom, GooglePlace2, Direcciones, Montain View, Email, [email protected]

儘管如此，網站'一個html標記'鏈接總是出現兩次，後面跟着幾個逗號。同樣，遵循逗號，有時由Direcciones，有時由源（https）。因此，如果EOF過程沒有中斷，它可以'替換'幾個小時，並創建一個輸入文件，其中包含reduce和misplaced信息的gbs。讓我們拿起四個條目作爲Reutput.csv的例子：

> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,, 
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,, 
> ,,Direcciones, Montain View, Email, [email protected] 
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, [email protected] 
> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,, 
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,, 
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, [email protected]

這樣的想法是刪除不必要的網站「一個HTML標籤」鏈接和多餘的逗號，但尊重新線/ n和不易脫落在循環中。就像這樣：

> httpsgoogledotcom, GooglePlace, Website, "<a href='httpgoogledotcom'></a>",Direcciones, Montain View, Email, [email protected] 
> httpsbingdotcom, BingPlace, Direcciones,MicroWorld, Email, [email protected] 
> httpsgoogledotcom, GooglePlace,Website, <a href='httpgoogledotcom'></a>" 
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, [email protected]

這是代碼的最後一個版本：

with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf: 
    text = str(reuf.read()) 
    for lines in text: 
     d = re.match('</a>".*D?',text,re.DOTALL) 
     if d is not None: 
      if not 'https' in d: 
       replace = re.sub(d,'</a>",Direc',lines) 
     h = re.match('</a>".*?http',text,re.DOTALL|re.MULTILINE) 
     if h is not None: 
      if not 'Direc' in h: 
       replace = re.sub(h,'</a>"\nhttp',lines) 
     replace = str(replace) 
     putuf.write(replace)

現在，我得到一個Put.csv與永遠重複最後一排文件。爲什麼這個循環？我已經嘗試了幾種方法來處理這些代碼，但不幸的是，我仍然堅持這樣做。提前致謝。

來源

2015-08-17 Abueesp

最後我自己拿了代碼。我在這裏張貼它希望有人認爲它有用。無論如何，謝謝你的幫助和反對票！

import re 
with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf: 
    text = str(reuf.read()) 
    d = re.findall('</a>".*?Direc',text,re.DOTALL|re.MULTILINE) 
    if d is not None: 
     for elements in d: 
      elements = str(elements) 
      if not 'https' in elements: 
        s = re.compile('</a>".*?Direc',re.DOTALL) 
        replace = re.sub(s,'</a>",Direc',text) 
    h = re.findall('</a>".*?https',text,re.DOTALL|re.MULTILINE) 
    if h is not None: 
     for elements in h: 
      if not 'Direc' in elements: 
       s = re.compile('</a>".*?https',re.DOTALL) 
       replace = re.sub(s,'</a>"\nhttps',text) 
     replace = str(replace) 
     putuf.write(replace)

來源

2015-08-17 08:02:35 Abueesp

當沒有匹配時，groups將是None。你需要警惕這一點（或重構正則表達式，以便它總是匹配一些東西）。

groups = re.search('</a>".*?Direc',lines,re.DOTALL) 
    if groups is not None: 
     if not 'https' in groups:

通知添加的not None條件和其它支配以下行的後續縮進。

來源

2015-08-17 06:14:48 tripleee

我嘗試添加其他： \t更換=行，但沒了 – Abueesp

看到示例代碼 – tripleee

更新我嘗試了，我得到了一個空白的文件，所以你是對的，團體的比賽必須是無。爲什麼？那麼如何解決Reutput.csv呢？預先感謝tripleee – Abueesp

Python for re.match re.sub

回答

相關問題