解析HTML的網站刮

我無法解析這個網站上正確的HTML：https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=332857117043301 解析HTML的網站刮

我想提取行「北緯34°02'48.57」，東經117°02'09.16" 。雖然這在管路862頁面的源代碼（網頁開發工具）顯示出來，它不顯示，當我通過BeautifulSoup解析。使用lxml的分析器不產生任何所需的結果。

import requests 
import re 
from bs4 import BeautifulSoup 

page = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902') 
soup = BeautifulSoup(page.content, 'html.parser') 

print (soup.prettify())

我打印聲明頁面內容不顯示緯度/經度線。如何調整我的代碼以刮取此信息？

來源

2017-09-25 saoirse

import requests 
from bs4 import BeautifulSoup 

html = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902') 
soup = BeautifulSoup(html.text, 'lxml') 

data = soup.find_all('div', attrs={'align': 'left'}) 

latitude = ''.join(x.contents[0].split(',')[0] for x in data if 'Latitude' in x.contents[0]) 
longitude = ''.join(x.contents[0].split(',')[1].strip().replace('\n', '') for x in data if 'Longitude' in x.contents[0]) 

print(latitude) 
print(longitude)

輸出：

Latitude  34°02'48.57" 
Longitude 117°02'09.16" NAD83

來源

2017-09-25 22:06:56 mentalita

你是如何尋找特定的內容？您可以使用.findAll('div')找到的數據，然後在標籤的文本搜索"Latitude"：

import requests 
from bs4 import BeautifulSoup 

page = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902') 
soup = BeautifulSoup(page.content, 'html.parser') 

divs = soup.findAll('div') 
texts = [div.text for div in divs] 

for text in texts: 
    if "Latitude" in text: 
     data = text

在只需要幾個解析獲得號碼，並將其分配給變量的字符串得到的：

>>> print(data) 
Latitude  34°02'48.57", Longitude 117°02'09.16" 
NAD83

來源

2017-09-25 22:07:14

謝謝維尼修斯。我假定當我通過「print（soup（prettify（））」打印頁面內容時會出現內容。「你能解釋爲什麼它不會顯示在那裏，但通過findall方法工作？ – saoirse

我很高興（https://meta.stackexchange.com/a/5235）當我嘗試你的代碼時，它也顯示在'print（soup）'中]，也可以選擇最有幫助的答案並[接受它] ，因爲它應該，你有沒有試圖再次這樣做？ –

我試過了，打印語句不顯示它。我的結果html有814行，這聽起來是對的嗎？ – saoirse

這頁是純粹的混亂......只是使用正則表達式（工作示例python2）：

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import requests 
import re 


def find(prefix, string): 
    return re.search("{}&nbsp;(?:\s+|)(\d+)\&\#176\;(\d+)\'(\d+)\.(\d+)\"".format(prefix), string) 


def format_result(result): 
    return "{}°{}'{}.{}\"".format(
     result.group(1), 
     result.group(2), 
     result.group(3), 
     result.group(4) 
    ) 

page = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902') 
found_lat = find('Latitude', page.content) 
found_lon = find('Longitude', page.content) 
if found_lat and found_lon: 
    latitude = format_result(found_lat) 
    longitude = format_result(found_lon) 
    print('Cords: {} {}'.format(latitude, longitude))

結果：

Cords: 34°02'48.57" 117°02'09.16"

正如你所看到的，你可以從found_lat或found_lon讓每個號碼就像這樣：

print(found_lat.group(1)) # 34 
print(found_lat.group(2)) # 02 
print(found_lat.group(3)) # 48 
print(found_lat.group(4)) # 57

或緯度或經度這樣的：

print(latitude) # 34°02'48.57" 
print(longitude) # 117°02'09.16"

來源

2017-09-25 22:09:04 Salamek

它在那裏。如果您運行以下代碼，您將獲得谷歌縱橫，並且您可以將其複製爲經度。

divs = soup.find_all('div') 
lat_index = str(divs).find("Latitude") 
lat = str(divs)[lat_index:lat_index+22 // 'Latitude\xa0 34°02\'48.57"'

來源

2017-09-25 22:13:07 manbearpig

解析HTML的網站刮

回答

相關問題