Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all 201780 articles
Browse latest View live

How to deal with this website in a webscraping format?

$
0
0

I am trying to webscrape this website.

I am applying the same code that I always use to webscrape pages:

url_dv1 <- "https://ec.europa.eu/commission/presscorner/detail/en/qanda_20_171?fbclid=IwAR2GqXLmkKRkWPoy3-QDwH9DzJiexFJ4Sp2ZoWGbfmOR1Yv8POdlLukLRaU"

url_dv1 <- paste(html_text(html_nodes(read_html(url_dv1), "#inline-nav-1 .ecl-paragraph")), collapse = "")

For this website, thought, the code doesn't seem to be working. In fact, I get Error in UseMethod("read_xml") : no applicable method for 'read_xml' applied to an object of class "c('xml_document', 'xml_node')".

Why is it so? How can I fix it?

Thanks a lot!


How to make each row a new set of variables and rename them dynamically in r

$
0
0

First, I want to convert this data:

datinput = read.table(header = TRUE, text = "
var1 var2 var3
A 3 10
B 2 6
")

datinput 
  var1 var2 var3
1    A    3   10
2    B    2    6

into this format:

datoutput = read.table(header = TRUE, text = "
var2.A var3.A Var2.B var3.B
3 10 2 6
")

  var2.A var3.A Var2.B var3.B
1      3     10      2      6

I tried reshape2::dcast, but does not deliver the desired output.

Instead, dcast gives this:

datinput%>%reshape2::dcast(var1~var2, value.var="var3")

    var1  2  3
    1    A NA 10
    2    B  6 NA

datinput%>%reshape2::dcast(var1, value.var=c("var2", "var3"))
Error in is.formula(formula) : object 'var1' not found

datinput%>%reshape2::dcast(var1~var1, value.var=c("var2", "var3"))
Error in .subset2(x, i, exact = exact) : subscript out of bounds
In addition: Warning message:
In if (!(value.var %in% names(data))) { :
  the condition has length > 1 and only the first element will be used

Then, I want to make the names_from come first in the new names.

I want to have these new columns named A.var2 B.var2 A.var3 B.var3. This is because I want to arrange the resulting variables using the variable names by alphabetical order into A.var2 A.var3 B.var2 B.var3

Thanks for any help.

Cleaning Data & Association Rules - R

$
0
0

I am trying to tidy the following dataset (in link) in R and then run an association rules below.

https://www.kaggle.com/fanatiks/shopping-cart

install.packages("dplyr")
library(dplyr)

df <- read.csv("Groceries (2).csv", header = F, stringsAsFactors = F, na.strings=c("","","NA"))
install.packages("stringr")
library(stringr)
temp1<- (str_extract(df$V1, "[a-z]+"))
temp2<- (str_extract(df$V1, "[^a-z]+"))
df<- cbind(temp1,df)
df[2] <- NULL
df[35] <- NULL
View(df)

summary(df)
str(df)

trans <- as(df,"transactions")

I get the following error when I run the above trans <- as(df,"transactions") code:

Warning message: Column(s) 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34 not logical or factor. Applying default discretization (see '? discretizeDF').

summary(trans)

When I run the above code, I get the following:

transactions as itemMatrix in sparse format with
 1499 rows (elements/itemsets/transactions) and
 1268 columns (items) and a density of 0.01529042 

most frequent items:
  V5= vegetables   V6= vegetables temp1=vegetables   V2= vegetables 
             140              113              109              108 
  V9= vegetables          (Other) 
             103            28490 

The attached results is showing all the vegetable values as separate items instead of a combined vegetable score which is obviously increasing my number of columns. I am not sure why this is happening?

fit<-apriori(trans,parameter=list(support=0.006,confidence=0.25,minlen=2))
fit<-sort(fit,by="support")
inspect(head(fit))

about the failure of replication in tidygraph

$
0
0

I have a question. I am using igraph and tidygraph. In igraph, the information of node is like this,

1     A young
2     B young
3     C young
4     D adult
5     E adult
6     F   old
7     G   old
8     H   old
9     I   old
10    J   old

However, when I used the same data for tidygraph, node C is labeled as adult, not young....... What's wrong with my code?? How can I appropriately assign nodes$carac?

#https://www.r-graph-gallery.com/249-igraph-network-map-a-color.html

# library
library(igraph)
set.seed(1)
# create data:
links <- data.frame(
  source=c("A","A", "A", "A", "A","J", "B", "B", "C", "C", "D","I"),
  target=c("B","B", "C", "D", "J","A","E", "F", "G", "H", "I","I"),
  weight=(sample(1:4, 12, replace=T))

)
nodes <- data.frame(
  name=LETTERS[1:10],
  carac=c( rep("young",3),rep("adult",2), rep("old",5))
)

# Turn it into igraph object
network <- graph_from_data_frame(d=links, vertices=nodes, directed=F) 

# Make a palette of 3 colors
library(RColorBrewer)
coul  <- brewer.pal(3, "Set1") 

# Create a vector of color
my_color <- coul[as.numeric(as.factor(V(network)$carac))]

# Make the plot
plot(network, vertex.color=my_color)



library(ggraph)
library(tidygraph)
#
g<-as_tbl_graph(links, directed = FALSE)

g %>%
  mutate(degree = centrality_degree(),
         community = as.factor(V(network)$carac) )%>%
  ggraph(layout = "lgl") +
  geom_edge_link(aes(width = 1),
                 alpha = 0.8,
                 colour = "lightgray") +
  scale_edge_width(range = c(0.1, 1)) +geom_node_point(aes(colour = community, size = degree)) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_graph()

Split Dataframe into list of one-row Dataframes

$
0
0

I would like to split a dataframe

df <- data.frame(a = 1:4, b = letters[1:4])

  a b
1 1 a
2 2 b
3 3 c
4 4 d

into a list of one-row dataframes

list(
    data.frame(a = 1, b = letters[1])
    , data.frame(a = 2, b = letters[2])
    , data.frame(a = 3, b = letters[3])
    , data.frame(a = 4, b = letters[4])
)

[[1]]
  a b
1 1 a

[[2]]
  a b
1 2 b

[[3]]
  a b
1 3 c

[[4]]
  a b
1 4 d

Is there an elegant solution to this?

Rounding behavior of updateSliderInput in R shiny

$
0
0

I am trying to use the following R Shiny code to use the first slider to for updates in the second slider. However, when the updateSliderInput function is called, it seems to overwrite the round = T in the original sliderInput. I know that, since I am dividing by 9 in the updateSliderInput function, the step size will not be an integer for some values of the first slider, but is there a way to show a rounded value in the recalculated slider so that I don't get 16 digits of precision?

ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      p("The first slider controls the second"),
      sliderInput(inputId = "value", label = "The independent slider",
                  min = 1000, max = 1500, value = 1000, step = 100, round = T
      ),
      sliderInput(inputId = "value2", label = "The dependent slider",
                      min = 5, max = 500, value = 50, round = T
      )
    ),
    mainPanel()
  )
)

server <- function(input, output, session) {
  observe({
    val <- input$value

    updateSliderInput(session, "value2", value = (val * 0.3),
                      min = (val * 0.005), 
                      max = (val * 0.5), 
                      step = floor((val * 0.5) - floor(val * 0.005))/9)

  })     
}

shinyApp(ui, server)

Right now, I see this, no matter what I try:

What I see in the recalculated slider as-is

R tmap sf Error: arguments imply differing number of rows when viewing map

$
0
0

I am trying to create a map of all school districts in each state. The code below works for all states, except in Florida I get this error: Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 67, 121

require(dplyr)
require(sf)
library(tmap)

  temp <- tempfile()  ### create a temporary file to download zip file to
  temp2 <- tempfile() ### create a temporary file to put unzipped files in
  download.file("https://s3.amazonaws.com/data.edbuild.org/public/Processed+Data/SD+shapes/2018/shapefile_1718.zip", temp) # downloading the data into the tempfile

  unzip(zipfile = temp, exdir = temp2) # unzipping the temp file and putting unzipped data in temp2

  filename <- list.files(temp2, full.names = TRUE) # getting the filename of the downloaded data

  shp_file <- filename %>%
    subset(grepl("*.shp$", filename)) ## selecting only the .shp file to read in 

  state_shape <- sf::st_read(shp_file) %>% 
    filter(State == "Florida")

  povertyBlues <-  c('#dff3fe', '#92DCF0', '#49B4D6', '#2586a5', '#19596d')

  map <- tm_shape(shape.clean) + 
    tm_fill("StPovRate", breaks=c(0, .1, .2, .3, .4, 1), title = "Student Poverty",
            palette = povertyBlues, 
            legend.format=list(fun=function(x) paste0(formatC(x*100, digits=0, format="f"), " %"))) +
    tm_shape(shape.clean) +
    tm_borders(lwd=.25, col = "#e9e9e9", alpha = 1) +
    tm_layout(inner.margins = c(.05,.25,.1,.05)) 

  map  ### view the map

The length of the tm_shape$shp and state_shape are both 67. Does anyone know what could be causing the "arguments imply differing number of rows: 67, 121"?

Thanks!!

How can I print a multiple sequences with different length.out in Rstudio

$
0
0

Let's say I have a vector,

n <- c(1:100)

and I want to output multiple sequences/vectors for each value n takes in the vector above, I tried doing something like this:

x <- seq (0,5, length.out = n+1)
x <- x[-1]

I get the error: Warning Message: In seq.default(0,5, length.out = n+1) First element used of length.out argument

I then want to use 'x' for a calculating in:

fx <- dnorm(x, mean = 0 , sd = 4)

Where fx[1] will output the first value of fx and so on up to fx[n]

Sorry for overcomplicating :)


Remove axis labels from a plot in R after the plot has already been created

$
0
0

I am using the plot function of a particular package, namely the SPEI library. This function does not appear to accept any parameters to change the way the plot looks when it is generated.

I would like to know how to remove axis values, add new ones, and (ideally) rename the x-axis after the plot has already been created.

Please note that I have seen the other similar topics (e.g: Remove plot axis values) and they are not applicable to my situation. I know that when calling the base plot functions in R, you can set xaxt = "n", axes= FALSE, etc.

Here is a quick version of what I mean:

library(SPEI)
data(wichita)
x <- spei(wichita[,'BAL'], 1) 
plot.spei(x, main = "Here's a plot")
plot.spei(x, main = "Also a plot", xaxt = "n") #Note that xaxt does not affect output

Function for adjusted Rsquare on test data?

$
0
0

w.r.t R/Rstudio I know there is RMSE and R2 function which i can leverage to calculate RMSE and Rsquare on test data. Is there a similar function for adjusted R square on test data?

Using a .R file as a resource and create a batch directly in C#

$
0
0

I have a number of shiny applications with the file structure global.R, ui.R, server.R and something I call batchTrigger.R. The contents of the latter is simply the following-

.libPath(*Path to my R Package Repository*)
require('shiny')
runApp(*Path to the folder with the aforementioned files*)

I created a batch file called application.cmd with the following code-

cls
@pushd ""
:::::::::::::::::::
@echo off
ECHO Loading...Please, wait. The Application will open automatically. 
ECHO --- 
ECHO Do not close this console window for the whole duration of your session 
ECHO in the application.
ECHO ---
@echo off

"C:\Program Files\R\bin\Rscript.exe"".../**batchTrigger.R**"

:::::::::::::::::::
@popd
cmd /k

This batch file is working just fine. Then I went one step further, and decided to create a windows form with multiple R Applications. I have two buttons in the form, each of which goes something like this-

 private void application1_click(object sender, EventArgs e)
    {
        System.Diagnostics.Process cmd = new Process();
        cmd.StartInfo.UseShellExecute = false;
        cmd.StartInfo.FileName = "...\\**application1.cmd**";
        cmd.StartInfo.Arguments = "/K";
        cmd.StartInfo.CreateNoWindow = false;
        cmd.StartInfo.RedirectStandardInput = true;
        cmd.Start();
    }

So far, so good. Both the buttons work exactly as they were supposed to. I want to go one more step ahead, but since I am very new at C#, I need help. What I am hoping to get is a dynamic location for the R files and the cmd files within the thus deployed application, within the solution. In other words, I should be able to write the contents of the batch file within the C# code, and the path of the batchTrigger.R should be something which changes with the location of the windows form application (which will be a self contained deployed executable file). The idea is that the R package repository and R installation may remain static and can be pointed at by the batchTrigger.R and application.cmd respectively, but the location of batchTrigger.R itself along with other R files move with the application. I think that resource.resx can do something about this, but how exactly can I go about doing it, I don't seem to get. Any suggestion would be highly appreciated.

matrix by vector multiplication - R versus Matlab

$
0
0

I observed that matrix by vector multiplication with a large matrix is significantly slower in R (version 3.6.1) than in Matlab (version 2019b), while both languages rely on the (same?) libblas library. See below a minimal example:

In Matlab:

n=900; 
p=900; 
A=reshape(1:(n*p),[n,p]); 
x=ones(p,1); 
tic()
for id = 1:1000
  x = A*x; 
end
toc()

In R:

n=900
p=900
A=matrix(c(1:(n*p)),nrow=n,ncol=p)
x=rep(1,ncol(A))
t0 <- Sys.time()
for(iter in 1:1000){
  x = A%*%x
}
t1 <- Sys.time()
print(t1-t0)

I get a running execution time of roughly 0.05sec in Matlab versus 3.5sec in R using the same computer. Any idea of the reason of such difference?

Thanks

Turning numbers into one piece of string vector in R

$
0
0

When I use as.character(1:5), my current output is: "1""2""3""4""5". Is it possible to instead get the following desired output when we input any numeric vector such as 1:5?

Desired output:

"1 2 3 4 5"

Plot points from one df, plot errorbar from another

$
0
0

Raw data looks like:

Restaurant     Question               rating

McDonalds      How was the food?      5       
McDonalds      How were the drinks?   3     
McDonalds      How were the workers?  2     
Burger_King    How was the food?      1       
Burger_King    How were the drinks?   3       
Burger_King    How were the workers?  4      

Averages looks like:

restaurant    average_rating    error
McDonalds     3.13              0.7
Burger_King   2.37              0.56

How do I make a plot of points with the raw data, then plot the error bars on top of it?

tribbles for convenience:

tribble(
  ~restaurant, ~question,  ~rating

  "McDonalds", "How was the food?", 5,
  "McDonalds", "How were the drinks?", 3,
  "McDonalds", "How were the drinks?", 2,
  "BurgerKing", "How was the food?", 1,
  "BurgerKing", "How were the drinks?", 3,
  "BurgerKing", "How were the drinks?", 4,
)
tribble(
  ~restaurant, ~average_rating, ~error,
  "McDonalds", 3.13, 0.7,
  "Burger_King", 2.37, 0.56
)

dplyr separate with regex

$
0
0

I have the following dataframe. It has 1 column of text that I would like to separate into multiple columns using the separate function from dplyr.

df <- structure(list(CPT.Codes = structure(c(2L, 1L, 3L, 4L, 5L), .Label = c("28296 - CORRECTION OF BUNION...., 64445P - N BLOCK INJ, SCIATIC, SNG, 76942P - US GUIDE, NEEDLE PLACEMENT", 
"36821 - AV FUSION DIRECT ANY SITE, 99100P - ANESTHESIA FOR PT OF EXTREME AGE", 
"41899 - DENTAL SURGERY PROCEDURE", "50593 - PERC CRYO ABLATE RENAL TUM, 99100P - ANESTHESIA FOR PT OF EXTREME AGE", 
"64721 - CARPAL TUNNEL SURGERY"), class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))

My desired output is the dataframe below. Each 5-digit number or 5-digit number + letter represents the code and the text following the dash is the code's description. Sometime's the code's description has single-digit numbers and multiple commas, so the regular expression will need to recognize the 5-digit number as a new code.

dfDesired <- structure(list(CPTcode1 = c(36821L, 28296L, 41899L, 50593L, 64721L
), CPTdescrip1 = structure(c(1L, 3L, 4L, 5L, 2L), .Label = c("AV FUSION DIRECT ANY SITE", 
"CARPAL TUNNEL SURGERY", "CORRECTION OF BUNION....", "DENTAL SURGERY PROCEDURE", 
"PERC CRYO ABLATE RENAL TUM"), class = "factor"), CPTcode2 = structure(c(2L, 
1L, NA, 2L, NA), .Label = c("64445P", "99100P"), class = "factor"), 
    CPTdescrip2 = structure(c(1L, 2L, NA, 1L, NA), .Label = c("ANESTHESIA FOR PT OF EXTREME AGE", 
    "N BLOCK INJ"), class = "factor"), CPTcode3 = structure(c(NA, 
    1L, NA, NA, NA), .Label = "76942P", class = "factor"), CPTdescrip3 = structure(c(NA, 
    1L, NA, NA, NA), .Label = "US GUIDE NEEDLE PLACEMENT", class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))

I have tried variations of the code below. It is wrong. I am new to regular expressions, and cannot figure this out with existing examples.

 CPT %>%
  separate(CPT.Codes, 
           into = c("CPTcode1", "CPTdescrip1", "CTPcode2", "CPTdescrip2", "CPTcode3", "CPTdescrip3"),
           sep = "(?<=[A-Z]) ?(?=[0-9])", remove = F) %>% 
  glimpse

Thanks in advance.


overlay discrete and continuous layer in ggplot - surprised that layer order matters

$
0
0

consider the following example dataset:

library(dplyr)
library(ggplot2)

d = mtcars %>% 
 as_tibble(rownames = "name") %>% 
 mutate(wt.cat = cut(wt, seq(1.5, 5.5, by = 1))) %>%
 group_by(wt.cat) %>%
 summarize(
   Mean = mean(mpg),
   Min = min(mpg),
   Max = max(mpg)
 )

Say I want to plot points for the "mean" value of each category in wt.cat and a ribbon showing the range. This works:

ggplot(d, aes(x = wt.cat)) + 
  geom_point(aes(y= Mean)) +
  geom_ribbon(aes(x = as.numeric(wt.cat), ymin = Min, ymax = Max), fill = "blue") 

example 1

But the points are masked by the ribbon. However, if I change the order of the layers so that the points are plotted on top of the ribbon, I get an error:

ggplot(d, aes(x = wt.cat)) + 
  geom_ribbon(aes(x = as.numeric(wt.cat), ymin = Min, ymax = Max), fill = "blue") +
  geom_point(aes(y= Mean))
## Error: Discrete value supplied to continuous scale

So even though I'm specifying the discrete axis as the "default" aesthetic, it gets overridden by the specification of the first plotted layer. The only way I can find around this is to plot a dummy point layer first:

ggplot(d, aes(x = wt.cat)) + 
  geom_point(aes(y= Mean), shape = NA) +
  geom_ribbon(aes(x = as.numeric(wt.cat), ymin = Min, ymax = Max), fill = "blue") +
  geom_point(aes(y= Mean))
## Warning message:
## Removed 4 rows containing missing values (geom_point). 

example 2

Is there a more "proper" or correct way of combining discrete and continuous layers? Is there a solution that doesn't require creating a dummy layer?

How to get rid of anomalies using lapply in R

$
0
0

I have a list of data frames and I am trying to use lapply to get rid of anomalies in my data, trying to make the code as robust as possible as the data inputted will be constantly different.

I am trying to use:

newdata <- lapply(ChaseSubSet, function(){
  anomalies <- 0.02 > ChaseSubSet[,1] > 0.03
  anomalies = na
})

However a) this doesn't work and b) I'm thinking it would be more robust to get rid of values more than 0.1 away from the mean. I would have to apply different rules to each column of the data but have it apply through all the data.frames in the list. I want to use lapply to result in a list at the end.

I am still very new so I apologise if any of this is incorrect.

Thanks, Tash

R: how can I split one row of a time period into multiple rows based on day and time

$
0
0

Hi I am trying to split rows in an excel file based on day and time. The data is from a study which participants will need to wear a tracking watch. Each row of the data set is started with participants put on the watch (Variable: 'Wear Time Start ') and ended with them taking off the device (Variable: 'Wear Time End'). I need to calculate how many hours of each participant wearing the device on each day (NOT each time period in one row).

Data set before split:

   ID          WearStart                WearEnd
1  01           2018-05-14 09:00:00      2018-05-14 20:00:00
2  01           2018-05-14 21:30:00      2018-05-15 02:00:00
3  01           2018-05-15 07:00:00      2018-05-16 22:30:00
4  01           2018-05-16 23:00:00      2018-05-16 23:40:00
5  01           2018-05-17 01:00:00      2018-05-19 15:00:00
6  02           ...

Some explanation about the data set before split: the data type of 'WearStart' and 'WearEnd' are POSIXlt.

Desired output after split:

  ID         WearStart                WearEnd                Interval
1 01         2018-05-14 09:00:00      2018-05-14 20:00:00    11
2 01         2018-05-14 21:30:00      2018-05-15 00:00:00    2.5
3 01         2018-05-15 00:00:00      2018-05-15 02:00:00    2                
4 01         2018-05-15 07:00:00      2018-05-16 00:00:00    17
5 01         2018-05-16 00:00:00      2018-05-16 22:30:00    22.5
4 01         2018-05-16 23:00:00      2018-05-16 23:40:00    0.4
5 01         2018-05-17 01:00:00      2018-05-18 00:00:00    23
6 01         2018-05-18 00:00:00      2018-05-19 00:00:00    24
7 01         2018-05-19 00:00:00      2018-05-19 15:00:00    15

Then I need to accumulate hours based on day:

  ID         Wear_Day        Total_Hours
1 01         2018-05-14      13.5
2 01         2018-05-15      19
3 01         2018-05-16      22.9                
4 01         2018-05-17      23
5 01         2018-05-18      24
4 01         2018-05-19      15

I have stuck in this split step for hours. Any help from you would greatly appreciated.

Scraping with Rvest and Glue packages in R specifically

$
0
0

I'm trying to scrape multiple pages of sports data using rvest and glue packages. I'm having trouble with the nesting and I think it's because the table from the website has a two line header (some headers are one line some are two). Here's the code I have started with. I checked to make sure the site allowed scraping with python and all good there.

library(tidyverse) 
library(rvest) # interacting with html and webcontent
library(glue)

webpage: https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1

Function to scrape a selected week 1:17 and position 1:4:

salary_scrape_19 <- function(week, position) {

Sys.sleep(3)  

cat(".")

url <- glue("https://fantasy.nfl.com/research/scoringleaders?position={position}&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek={week}")
read_html(url) %>% 
    html_nodes("table") %>% 
    html_table() %>%
    purrr::flatten_df() %>% 
    #set_names(need to clean headers before I can set this)
}

scraped_df <- scaffold %>% 
mutate(data = map2(week, position, ~salary_scrape_19(.x, .y))) 

scraped_df

Ultimately, I want to build a scrape function to get all the positions with the same columns which are QB, RB, WR, and TE for all weeks in 2019. (want to add a third variable to glue {year} eventually, but need to get this first.

Again, I think the issue has to do with the wonky headers of the table on the site as some are one row and other headings are two rows.

How to change Gage R&R plot formatting in R using qualityTools package

$
0
0

I'm doing Gage R&R calculations in R, using the quality tools package. The "plot(gdo)" statement outputs the 6 typical GR&R plots. My issue is that the default formatting makes most of the plots difficult to read and interpret. Does anyone have any experience with editing the gageR&R source code in this package to make the plots look better? I believe I can do the actual edits to the plot formatting, but I'm not sure how to find the correct source file to do the edits in.

Viewing all 201780 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>