2017-01-12 25 views
1

我有以下代碼來解析某些HTML。我需要將輸出(html結果)保存爲帶有轉義字符序列的單行代碼,例如\n,但由於單引號或輸出結果,我得到的表示不能從repr()使用寫入多行,像這樣(解釋轉義序列):保留字符串內容 n並寫入一行

<section class="prog__container"> 
<span class="prog__sub">Title</span> 
<p>PEP 336 - Make None Callable</p> 
<span class="prog__sub">Description</span> 
<p> 
<p> 
<code> 
     None 
    </code> 
    should be a callable object that when called with any 
arguments has no side effect and returns 
    <code> 
     None 
    </code> 
    . 
    </p> 
</p> 
</section> 

我需要什麼樣的(包括轉義序列):

<section class="prog__container">\n <span class="prog__sub">Title</span>\n <p>PEP 336 - Make None Callable</p>\n <span class="prog__sub">Description</span>\n <p>\n <p>\n <code>\n  None\n  </code>\n  should be a callable object that when called with any\n arguments has no side effect and returns\n  <code>\n  None\n  </code>\n  .\n </p>\n </p>\n </section> 

我的代碼

soup = BeautifulSoup(html, "html.parser") 

for match in soup.findAll(['div']): 
    match.unwrap() 

for match in soup.findAll(['a']): 
    match.unwrap() 

html = soup.contents[0] 
html = str(html) 
html = html.splitlines(True) 
html = " ".join(html) 
html = re.sub(re.compile("\n"), "\\n", html) 
html = repl(html) # my current solution works, but unusable 

以上是我的解決方案,但是對象表示不好,我需要字符串表示。我怎樣才能做到這一點?

回答

1

的HTML代碼,爲什麼不使用只repr

a = """this is the first line 
this is the second line""" 
print repr(a) 

甚至(如果我清楚你的確切輸出的問題,而文字引號)

print repr(a).strip("'") 

輸出:

'this is the first line\nthis is the second line' 
this is the first line\nthis is the second line 
+0

這工作。接受最簡單的解決方案 – lkdjf0293

2
import bs4 

html = '''<section class="prog__container"> 
<span class="prog__sub">Title</span> 
<p>PEP 336 - Make None Callable</p> 
<span class="prog__sub">Description</span> 
<p> 
<p> 
<code> 
     None 
    </code> 
    should be a callable object that when called with any 
arguments has no side effect and returns 
    <code> 
     None 
    </code> 
    . 
    </p> 
</p> 
</section>''' 
soup = bs4.BeautifulSoup(html, 'lxml') 
str(soup) 

出來:

'<html><body><section class="prog__container">\n<span class="prog__sub">Title</span>\n<p>PEP 336 - Make None Callable</p>\n<span class="prog__sub">Description</span>\n<p>\n</p><p>\n<code>\n  None\n  </code>\n  should be a callable object that when called with any\n arguments has no side effect and returns\n  <code>\n  None\n  </code>\n  .\n </p>\n</section></body></html>' 

有輸出更復雜的方式是在Document

+0

謝謝fo你的答案!這裏存在的問題與使用關於單引號的'repr()'函數相同。 – lkdjf0293

0
from bs4 import BeautifulSoup 
import urllib.request 

r = urllib.request.urlopen('https://www.example.com') 
soup = BeautifulSoup(r.read(), 'html.parser') 
html = str(soup) 

這會給你的HTML作爲一個字符串和\ n

相關問題