0
我正在從網上抓取與this one(大型擊球遊戲日誌表)完全相同的多個表格,並且我需要數據框忽略以季節開始的內部標題行。忽略熊貓數據框中的內部標題行
這是到目前爲止我的腳本:
from bs4 import BeautifulSoup
import pandas as pd
import csv
import urllib2
def stir_the_soup():
player_links = open('player_links.txt', 'r')
player_ID_nums = open('player_ID_nums.txt', 'r')
id_nums = [x.rstrip('\n') for x in player_ID_nums]
idx = 0
for url in player_links:
#open the url and create bs object
player_link = urllib2.urlopen(url)
bs = BeautifulSoup(player_link, 'html5lib')
#identify which table is needed
table_id = ""
if url[-12] == 'b':
table_id = "batting"
elif url[-12] == 'p':
table_id = "pitching"
#find the table and create dataframe
table = str(bs.find('table', {'id' : (table_id + '_gamelogs')}))
df = pd.read_html(table, header=0)
df2 = df[0]
df2 = df2[df2.PA != 'PA']
#for the name of the file and file path
file_path = '/Users/kramerbaseball/Desktop/MLB_Web_Scraping_Program/game_logs_non_concussed/'
name_of_file = str(id_nums[idx])
df2.to_csv(path_or_buf=(file_path + name_of_file + '.csv'), sep=',', encoding='utf-8')
idx += 1
if __name__ == "__main__":
stir_the_soup()
我試圖以數據幀,而忽略其中PA PA ==或HR == HR但不會刪除行的行。任何幫助表示讚賞
謝謝,這個工作,但爲什麼它在「GTM」列=「日期」!?他們是單獨的列 –
不知道,但船長明顯假設列和標題以某種方式相互移動=) –