從HTML字符串中刪除所有間距

我試圖執行刪除所有空格和空格字符的代碼，然後計算出現在頁面中的前3個字母數字字符。我的問題是雙重的。從HTML字符串中刪除所有間距

1）我用於拆分的方法似乎沒有工作，我不知道爲什麼它不工作。據我所知，加入然後拆分應該從html源代碼中刪除所有空格和空格，但它不是（請參閱下面的amazon示例中的第一個返回值）。

2）我並不十分熟悉most_common操作，當我測試我的代碼的「http://amazon.com」我得到以下輸出：

The top 3 occuring alphanumeric characters in the html of http://amazon.com 
: [(u' ', 258), (u'a', 126), (u'e', 126)]

什麼是ü意味着在返回most_common（3 ）值？

我當前的代碼：

import requests 
import collections 


url = raw_input("please eneter the url of the desired website (include http://): ") 

response = requests.get(url) 
responseString = response.text 

print responseString 

topThreeAlphaString = " ".join(filter(None, responseString.split())) 

lineNumber = 0 

for line in topThreeAlphaString: 
    line = line.strip() 
    lineNumber += 1 

topThreeAlpha = collections.Counter(topThreeAlphaString).most_common(3) 

print "The top 3 occuring alphanumeric characters in the html of", url,": ", topThreeAlpha

來源

2017-03-02 CFalco

這意味着它是一個unicode字符串。你用''「'.join（...）''''join'（）來加入一個空字符串'」「.join（...） – AChampion

要照顧空白的，你要使用的HTMLParser.HTMLParser及其unescape方法的實例，以擺脫任何原始的HTML字符躺在附近。要計算字符數，您應該查看collections.Counter。

import requests 
from collections import Counter 
from HTMLParser import HTMLParser 

response = requests.get('http://www.example.com') 
responseString = response.text 

parser = HTMLParser() 
c = Counter(''.join(parser.unescape(responseString).split()) 

print(c.most_common()[:3])

來源

2017-03-02 02:23:05 pml

從HTML字符串中刪除所有間距

回答

相關問題