2012-11-22 26 views
0

我有以下代碼片段,它需要一個url打開它,解析出只是文本,然後搜索小部件。它檢測小部件的方式是查找單詞widget1,然後是endwidget,這表示小部件的結尾。寫入文件並得到奇怪的縮進

基本上,代碼一旦找到文字widget1就會將所有文本行寫入文件,並在其讀取endwidget時結束。但是,我的代碼在第一行widget1行後縮進了所有行。

這是我的輸出

widget1 this is a really cool widget 
     it does x, y and z 
     and also a, b and c 
     endwidget 

我想要的是:

widget1 this is a really cool widget 
it does x, y and z 
and also a, b and c 
endwidget 

爲什麼會出現這個缺口?這是我的代碼...

for url in urls: 
     page = mech.open(url) 
     html = page.read() 
     soup = BeautifulSoup(html) 
     text= soup.prettify() 
     texts = soup.findAll(text=True) 

     def visible(element): 
      if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: 
      # If the parent of your element is any of those ignore it 

       return False 

      elif re.match('<!--.*-->', str(element)): 
      # If the element matches an html tag, ignore it 

       return False 

      else: 
      # Otherwise, return True as these are the elements we need 

       return True 

     visible_texts = filter(visible, texts) 

     inwidget=0 
     # open a file for write 
     for line in visible_texts: 
     # if line doesn't contain .widget1 then ignore it 
      if ".widget1" in line and inwidget==0: 
       match = re.search(r'\.widget1 (\w+)', line) 
       line = line.split (".widget1")[1] 
       # make the next word after .widget1 the name of the file 
       filename = "%s" % match.group(1) + ".txt" 
       textfile = open (filename, 'w+b') 
       textfile.write("source:" + url + "\n\n") 
       textfile.write(".widget1" + line) 
       inwidget = 1 
      elif inwidget == 1 and ".endwidget" not in line: 
       print line 
       textfile.write(line) 
      elif ".endwidget" in line and inwidget == 1: 
       textfile.write(line) 
       inwidget= 0 
      else: 
       pass 

回答

1

原因你得到這個缺口中的所有線路除了第一個是因爲第一行用textfile.write(".widget1" + line)編輯行,但直接從包含縮進的html文件中取出其餘行。您可以通過在線上使用str.strip()刪除不需要的空格,並將textfile.write(line)更改爲textfile.write(line.strip())

0

要從輸出到您想要的輸出,這樣做:

#a is your output 
a= '\n'.join(map(lambda x: x.strip(),a.split('\n'))) 
+0

謝謝,是'了'應該是變量'texts'還是在visible_texts' – user1328021

+0

同樣,每個'線,究竟是什麼它做什麼?它剝離回車,還有什麼? – user1328021

+0

它在\ n上分割,創建一個字符串列表,每行爲每行,然後剝離每行(這意味着它刪除開始和結尾處的空格......但是您可以將其更改爲僅在以lstrip開頭),然後再次使用\ n作爲分隔符將這些字符串連接在一起。 – LtWorf