Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201945

Remove a specific html node from a set of nodes

$
0
0

I want to scrape the reports from https://paidabribe.com/reports/paid in R. Everything works quit good with my following code besides some reports contain an embedded element under the report which is part of the CSS node of the report text.

For example https://paidabribe.com/reports/paid?page=10 has one embedded text "How to get a LPG gas connection".

Therefore I end up with character vectors of different length for different pages depending on the amount of reports with embedded elements. My question is how can I remove this specific element of the node and just scrape the text of the report

SelectorGadget told me that this node can be called by using "em". So I tried the following :

#DO NOT RUN
scraper <- function(pages){
bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", pages, sep = "=")) 
bribe <- bribe %>% html_nodes("em") 
bribe <- xml_remove(bribe)
all.nodes <- c(".paid-amount span", ".date", ".location", ".transaction a", ".body-copy-lg")
map(all.nodes, ~ html_nodes(bribe, .x) %>% html_text())
}

pages <- seq(10, 50, by = 10)
bribe.test <- map(pages,~scraper(.x))

The problem here seems to be that the embedded text can't be selected using the node "em". So how can I remove this embedded node.

MWE (Produces a list of the scraped content. As you can see the vector of the reports sometime differ in length compared to the other character vectors. :

scraper <- function(pages){
bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", pages, sep = "=")) 
all.nodes <- c(".paid-amount span", ".date", ".location", ".transaction a", ".body-copy-lg")
map(all.nodes, ~ html_nodes(bribe, .x) %>% html_text())
}

pages <- seq(10, 50, by = 10)
bribe.test <- map(pages,~scraper(.x)) 
```R

Viewing all articles
Browse latest Browse all 201945

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>