是否可以讀取推特網址的推文文本而不使用twitter API？

我正在使用Goose從URL中讀取文章的標題/文本正文。但是，這不適用於Twitter的URL，我想是由於不同的HTML標籤結構。有沒有辦法從這樣的鏈接中讀取推文文本？是否可以讀取推特網址的推文文本而不使用twitter API？

一個鳴叫（簡稱鏈接）這樣的例子如下：

https://twitter.com/UniteAlbertans/status/899468829151043584/photo/1

注：我知道如何讀通過Twitter API的鳴叫。不過，我對此不感興趣。我只想通過解析HTML源代碼來獲得文本，而不需要所有的Twitter認證麻煩。

來源

2017-08-23 utengr

刮自己

打開鳴叫的URL，傳遞到您選擇的HTML解析器和提取你感興趣的XPath

拼搶中討論：http://docs.python-guide.org/en/latest/scenarios/scrape/

XPath可以通過右鍵單擊您想要的元素，選擇「Inspect」，右鍵單擊Inspector中突出顯示的行並選擇「Copy」>「Copy XPath」，如果網站的結構始終相同。否則，選擇完全定義所需對象的屬性。

你的情況：

//div[contains(@class, 'permalink-tweet-container')]//strong[contains(@class, 'fullname')]/text()

，會得到作者的姓名和

//div[contains(@class, 'permalink-tweet-container')]//p[contains(@class, 'tweet-text')]//text()

將讓你在Twitter上發佈的內容。

完整的工作示例：

from lxml import html 
import requests 
page = requests.get('https://twitter.com/UniteAlbertans/status/899468829151043584') 
tree = html.fromstring(page.content) 
tree.xpath('//div[contains(@class, "permalink-tweet-container")]//p[contains(@class, "tweet-text")]//text()')

的結果：

['Breaking:\n10 sailors missing, 5 injured after USS John S. McCain collides with merchant vessel near Singapore...\n\n', 'https://www.', 'washingtonpost.com/world/another-', 'us-navy-destroyer-collides-with-a-merchant-ship-rescue-efforts-underway/2017/08/20/c42f15b2-8602-11e7-9ce7-9e175d8953fa_story.html?utm_term=.e3e91fff99ba&wpisrc=al_alert-COMBO-world%252Bnation&wpmk=1', u'\xa0', u'\u2026', 'pic.twitter.com/UiGEZq7Eq6']

來源

2017-08-23 08:31:42 petrpulc

只是爲了澄清使用的XPath ...''// - 搜索任何地方'DIV [包含（@class， 'permalink-tweet-container'）]' - div with class'permalink-tweet-container'''//' - 以及其中的任何地方'strong [contains（@class，'fullname'）]' - strong包含class 'fullname''''' - 直接從'text（）'獲取文本。 – petrpulc

你可以在http://videlibri.sourceforge.net/cgi-bin/xidelcgi – petrpulc

上測試你自己的XPath，如果這回答你的問題，請接受它。 – petrpulc

是否可以讀取推特網址的推文文本而不使用twitter API？

回答

相關問題