2016-02-05 38 views
0

我想讓R通過transfermarket.com上的玩家配置文件循環,我首先用以下內容獲取名冊網址。R在RCR與循環刮 - 足球統計

#/ Add the Team’s URL to scrape 

TeamScrape <- read_html("http://www.transfermarkt.com/jumplist/startseite/verein/2778") 


#// Get Club Name 

ClubName <- TeamScrape %>% 
html_nodes(".spielername-profil") %>% 
html_text() 

#// Get All Player URLs 

PlayerURLs <- TeamScrape %>% 
html_nodes(".spielprofil_tooltip") %>% 
html_attr("href") 

PlayerURLs <- unique(PlayerURLs) 
PlayerURLs <- na.omit(PlayerURLs) 

PlayerURLs <- paste0("http://www.transfermarkt.com", PlayerURLs) 

PlayerLinks = data.frame(ClubName, PlayerURLs) 

這給了我,包括我通過我的下一個刮刀要循環的URL的data.frame - 在「球員簡介刮」。

#/ Add the Player’s URL that you want to scrape 
URLLink <- PlayerURLs[13] 
PlayerTest <- read_html(URLLink) 


#// Squad No 

SquadNo <- PlayerTest %>% 
html_nodes(".rueckennummer-profil") %>% 
html_text() 


#// Name 

Name <- PlayerTest %>% 
html_nodes(".spielername-profil") %>% 
html_text() 

#// Nationality 

Nationality <- PlayerTest %>% 
html_nodes(".flaggenrahmen+ span") %>% 
html_text() 

#// Club 

Club <- PlayerTest %>% 
html_nodes(".vereinprofil_tooltip+ .vereinprofil_tooltip") %>% 
html_text() 

#// Position 

Position <- PlayerTest %>% 
html_nodes(".list+ .list tr:nth-child(3) td") %>% 
html_text() 

#// DOB 

DOB <- PlayerTest %>% 
html_nodes(".wsnw") %>% 
html_text() 

#// Age 

Age <- PlayerTest %>% 
html_nodes(".profilheader .hide-for-small td") %>% 
html_text() %>% 
as.numeric() 

#// Value 

Value <- PlayerTest %>% 
html_nodes(".marktwert a") %>% 
html_text() 

#// Matches Played this Season 

Matches <- PlayerTest %>% 
html_nodes(".hide.hide-for-small+ .zentriert") %>% 
html_text() %>% 
as.numeric() 

#// Goals Scored this Season 

Goals <- PlayerTest %>% 
html_nodes("#yw1 tfoot .zentriert:nth-child(4)") %>% 
html_text() %>% 
as.numeric() 

#// Assists Made this Season 

Assists <- PlayerTest %>% 
html_nodes("tfoot .zentriert:nth-child(5)") %>% 
html_text() %>% 
as.numeric() 

#// Mins Played this Season 

Minutes <- PlayerTest %>% 
html_nodes("tfoot .zentriert:nth-child(7)") %>% 
html_text() %>% 
as.numeric() 

#// Some Cleaning Up of the Data 

# to_remove_SquadNo <- paste(c("#")) 
# SquadNo <- gsub(to_remove_SquadNo, "", SquadNo) 

# Minutes <- regmatches(Minutes, gregexpr("[[:digit:]]+", Minutes)) 
# as.numeric(unlist(Minutes)) 

#// Create the Data Frame 

output = data.frame(SquadNo, Name, Nationality, Club, Position, DOB, Age, Value, Matches, Goals, Assists, Minutes) 

我的目標是根據來自Team Scraper的URL循環播放器配置文件刮板。我嘗試了許多不同的循環嘗試,我迷路了!真的很感謝一些建議!

回答

0

通過

lapply(PlayerURLs, FUN=function(URLLink){ 

更換

URLLink <- PlayerURLs[13] 

,並在結尾處加上

output 
}) 
+0

HubertL - 感謝您的快速反應。我做了你說的,我得到了這個:data.frame中的錯誤(SquadNo,Name,Nationality,Club,ContractUntil,Position,: argument implyly different of rows:0,1 另外:警告消息: 1:在function_list [[k]](value)中:通過強制引入NAs 2:在函數列表[[k]](值)中:通過強制引入的NA 3:在函數列表[[k]](值)強制 調用方:data.frame(SquadNo,Name,Nationality,Club,ContractUntil,Position, DOB,Age) – user1593995

+0

這是因爲有些數據丟失了 – HubertL

+0

奇怪的是,即使我將報廢減少爲2個因素,行值不同... – user1593995