I am scrapping a database of companies available in this format, where each company is under a different website, defined by the number at the end of the url (example above is 15310; see url). I am using rvest
.
I want to extract all entries shown under "Organización". Each variable name is in bold, followed by the value in normal text. In the example above there are 16 variables to extract.
There are two problems with these websites:
- the normal text is not inside an element (whereas variable names are). In effect, code goes something like this (notice "Value" is outside
label
):
<div class="form-group">
<label for="variable_code">variable_name</label>
Value
</div>
- not all companies have the same number of variables (16 in the example above). Some have more, others have less. Still,
variable_code
andvariable_name
are the same throughout the database.
I can think of two options to scrap the data. One is to scrap based on fixed positions. For this, I can use "nth-child" type of CSS selectors to get each variable. However, because number of variables change across companies, I need to save both the variable name and value as R variables. This is shown in the code below (for one website; for more just need to add loop, irrelevant here):
library(xml2)
library(rvest)
library(stringr)
url <- "https://tramites.economia.gob.cl/Organizacion/Details/15310"
webpage <- read_html(url) #read webpage
title_html <- html_nodes(webpage, "body > div > div:nth-child(5) > div > div:nth-child(1) > div:nth-child(3)") # this selects by element in division, after which I extract both the variable name and value as elements. Ideally, you want only the value, to allocate to a variable in a dataframe.
title <- html_text(title_html)
variable_name <- trimws(strsplit(title, "\r\n")[[1]][2])
value <- trimws(strsplit(title, "\r\n")[[1]][3])
So, the above works, but it's time consuming, since it saves variable names as variables, after which I need to transform the data.
Another option is to scrap based on labels. This is, to search for each variable in the code and get its value. Something like:
title_html <- html_nodes(webpage, "body > div > div:nth-child(5) > div > div:nth-child(1) label[for=RazonSocial]")
The problem with this approach is that the value of each variable is a free text (i.e. outside a specific element). Thus, it cannot be obtained through CSS selectors, as explained in many places (e.g. here, here, or here). Evidently, I cannot change the html code.
What can I do to improve the scrapping process? Am I stuck with the brute force, first method, extracting everything as variables? Or can I somehow gain efficiency?
PS: one way I was thinking of is to somehow get the position where the label is found using the second method and then get the value using the first. But I doubt R has this option (like address
or cell
in excel).