Could anyone please explain the below R web scraping code?
I found the below code on Stack Overflow to scrape in Apple financial information from Yahoo Finance.
Specifically:
step 2, how to find the ".fi-row" node? Using inspect within Google Chrome I was unable to find it. How is this node found in practise?
step 4, how does the code within this loop actually work? it seems to be doing all the scraping. Could anyone explain what is going on in this code?
step 5, how are the headings being scraped? the code seems super complicated.
Please note that the comments above the code were written by me to help me understand the code and could be incorrect.
library(rvest)
library(stringr)
library(magrittr)
## 1 ## read URL into HTML
url <- read_html('https://finance.yahoo.com/quote/AAPL/financials?p=AAPL')
## 2 ## set specific nodes
nodes <- url %>%html_nodes(".fi-row")
## 3 ## create blank dataframe
df = NULL
## 4 ## loop within nodes to extract tabular financial data
for(i in nodes){
r <- list(i %>%html_nodes("[title],[data-test='fin-col']")%>%html_text())
df <- rbind(df,as.data.frame(matrix(r[[1]], ncol = length(r[[1]]), byrow = TRUE), stringsAsFactors = FALSE))
}
## 5 ## extract column heading names
matches <- str_match_all(url%>%html_node('#Col1-3-Financials-Proxy')%>%html_text(),'\\d{1,2}/\\d{1,2}/\\d{4}')
## 6 ## combine custom column names with column names from step 5
headers <- c('Breakdown','TTM', matches[[1]][,1])
## 7 ## set dataframe column names
names(df) <- headers
View(df)
Any clarification is very much appreciated.
Cheers