我有以下代碼片段,它需要一個url打開它,解析出只是文本,然後搜索小部件。它檢測小部件的方式是查找單詞widget1
,然後是endwidget
,這表示小部件的結尾。寫入文件並得到奇怪的縮進
基本上,代碼一旦找到文字widget1
就會將所有文本行寫入文件,並在其讀取endwidget
時結束。但是,我的代碼在第一行widget1
行後縮進了所有行。
這是我的輸出
widget1 this is a really cool widget
it does x, y and z
and also a, b and c
endwidget
我想要的是:
widget1 this is a really cool widget
it does x, y and z
and also a, b and c
endwidget
爲什麼會出現這個缺口?這是我的代碼...
for url in urls:
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
text= soup.prettify()
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
# If the parent of your element is any of those ignore it
return False
elif re.match('<!--.*-->', str(element)):
# If the element matches an html tag, ignore it
return False
else:
# Otherwise, return True as these are the elements we need
return True
visible_texts = filter(visible, texts)
inwidget=0
# open a file for write
for line in visible_texts:
# if line doesn't contain .widget1 then ignore it
if ".widget1" in line and inwidget==0:
match = re.search(r'\.widget1 (\w+)', line)
line = line.split (".widget1")[1]
# make the next word after .widget1 the name of the file
filename = "%s" % match.group(1) + ".txt"
textfile = open (filename, 'w+b')
textfile.write("source:" + url + "\n\n")
textfile.write(".widget1" + line)
inwidget = 1
elif inwidget == 1 and ".endwidget" not in line:
print line
textfile.write(line)
elif ".endwidget" in line and inwidget == 1:
textfile.write(line)
inwidget= 0
else:
pass
謝謝,是'了'應該是變量'texts'還是在visible_texts' – user1328021
同樣,每個'線,究竟是什麼它做什麼?它剝離回車,還有什麼? – user1328021
它在\ n上分割,創建一個字符串列表,每行爲每行,然後剝離每行(這意味着它刪除開始和結尾處的空格......但是您可以將其更改爲僅在以lstrip開頭),然後再次使用\ n作爲分隔符將這些字符串連接在一起。 – LtWorf