如何顯示網站上的句子？

我決定讓這個小項目學習如何使用機械化。現在它轉到urbandictionary，在搜索表單中填寫單詞「skid」，然後按提交併打印出HTML。如何顯示網站上的句子？

我想要做的是找到第一個定義並打印出來。我會如何去做那件事？

這是我的源代碼至今：

import mechanize 

br = mechanize.Browser() 
page = br.open("http://www.urbandictionary.com/") 

br.select_form(nr=0) 
br["term"] = "skid" 
br.submit() 

print br.response().read()

這裏是哪裏定義的存儲：

<div class="definition">Canadian definition: Commonly used to refer to someone who  stopped evolving, and bathing, during the 80&#x27;s hair band era. Generally can be found wearing AC/DC muscle shirts, leather jackets, and sporting a <a href="/define.php?term=mullet">mullet</a>. The term &quot;skid&quot; is in part derived from &quot;skid row&quot;, which is both a band enjoyed by those the term refers to, as well as their address. See also <a href="/define.php?term=white%20trash">white trash</a> and <a href="/define.php?term=trailer%20park%20trash">trailer park trash</a></div><div class="example">The skid next door got drunk and beat up his old lady.</div>

你可以看到它的存儲在div定義中。我知道如何在源代碼中搜索div，但我不知道如何處理標籤之間的所有內容，然後顯示它。

來源

2013-08-23 Natasha Bysouth

我不熟悉與機械化但無論如何...我首先想到的是XPath的（LXML）或beautifulsoup – Sheena

查找到[Scrapy（http://scrapy.org/）和[BeautifulSoup（HTTP ：//www.crummy.com/software/BeautifulSoup/）爲此類任務。如果網站提供了API，那可能是最好的選擇。例如，Urban Dictionary似乎有一個JSON API，但不是任何人都可以免費獲得。 –

歡迎來到StackOverflow！請查看FAQ，它會幫助我們幫助你。通常你不需要一個請求或謝謝，你的upvote就是一個衡量標準。確保你接受一個答案，如果它解決了你的問題。 – Hooked

您可以使用lxml來解析HTML片段：但是

import lxml.html as html 
import mechanize 

br = mechanize.Browser() 
page = br.open("http://www.urbandictionary.com/") 

br.select_form(nr=0) 
br["term"] = "skid" 
br.submit() 

fragment = html.fromstring(br.response().read()) 

print fragment.find_class('definition')[0].text_content()

該解決方案在div標籤內刪除並展平的文本。

來源

2013-08-23 16:14:43

哇，這個工作就像一個沙利文！猜猜我必須看看lxml。 –

你可以在這裏找到更多關於API的信息（http://lxml.de/lxmlhtml.html）。 @NatashaBysouth –

我想這個任務的正則表達式就足夠了（根據你的描述）。試試這個代碼：

import mechanize, re 

br = mechanize.Browser() 
page = br.open("http://www.urbandictionary.com/") 

br.select_form(nr=0) 
br["term"] = "skid" 
br.submit() 

source = br.response().read() 

regex = "<div class=\"definition\">(.+?)</div>" 
pattern = re.compile(regex) 
r=re.findall(pattern,source) 
print r[0]

這將顯示在標籤之間的內容（不包括例如部分，但它們是完全一樣的），但我不知道你想怎麼處理這個內容中的標籤。如果你想要他們，就是這樣。或者如果你想刪除它們，你可以使用類似re.replace（）的東西。

來源

2013-08-23 15:42:09 labyrlnth

我明白，你可能只是做了一個例子，但你真的不應該使用正則表達式來匹配HTML。看到[這個經典答案]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags）。有人建議美麗的，這是專爲這樣的事情。 –

@PeteTinkler哇，這個答案真的很酷！感謝分享。說實話，我之前沒有意識到這一點，因爲它每次都有效。我想我需要一些時間來弄清楚。謝謝:-) – labyrlnth

既然提到了，我想我會提供一個BeautifulSoup的答案。使用最好的方法。

import bs4, urllib2 

# Use urllib2 to get the html from the web 
url  = r"http://www.urbandictionary.com/define.php?term={term}" 
request = url.format(term="skid") 
raw  = urllib2.urlopen(request).read() 

# Convert it into a soup 
soup = bs4.BeautifulSoup(raw) 

# Find the requested info 
for word_def in soup.findAll(class_ = 'definition'): 
    print word_def.string

來源

2013-08-23 19:35:01 Hooked

如果鏈接等其他元素是打印元素的子元素，則此解決方案存在問題。要獲取整個字符串，請在最後一行使用word_def.text而不是「word_def.string」。 –

如何顯示網站上的句子？

回答

相關問題