0
我試圖執行刪除所有空格和空格字符的代碼,然後計算出現在頁面中的前3個字母數字字符。我的問題是雙重的。從HTML字符串中刪除所有間距
1)我用於拆分的方法似乎沒有工作,我不知道爲什麼它不工作。據我所知,加入然後拆分應該從html源代碼中刪除所有空格和空格,但它不是(請參閱下面的amazon示例中的第一個返回值)。
2)我並不十分熟悉most_common操作,當我測試我的代碼的「http://amazon.com」我得到以下輸出:
The top 3 occuring alphanumeric characters in the html of http://amazon.com
: [(u' ', 258), (u'a', 126), (u'e', 126)]
什麼是ü意味着在返回most_common(3 )值?
我當前的代碼:
import requests
import collections
url = raw_input("please eneter the url of the desired website (include http://): ")
response = requests.get(url)
responseString = response.text
print responseString
topThreeAlphaString = " ".join(filter(None, responseString.split()))
lineNumber = 0
for line in topThreeAlphaString:
line = line.strip()
lineNumber += 1
topThreeAlpha = collections.Counter(topThreeAlphaString).most_common(3)
print "The top 3 occuring alphanumeric characters in the html of", url,": ", topThreeAlpha
這意味着它是一個unicode字符串。你用''「'.join(...)''''join'()來加入一個空字符串'」「.join(...) – AChampion