Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all 201814 articles
Browse latest View live

Shiny: Background picture for ggplot

$
0
0

I produced a ggplot with a png as background. Local this workspace without a problem. But as a shiny the plot is not working.

Server.R

library(shiny)
library(shinydashboard)
library(shinyWidgets)
library(dashboardthemes)
library(DT)
library(png)
library(rasterImage)
library(ggpubr)
library(plotly)

# Define server logic required to draw a histogram
shinyServer(function(input, output, session) {

    output$ShootPosition <- renderPlotly({

        data <- data.frame(x = rnorm(100),
                           y = rnorm(100))
        ggplot(data, aes(x, y), tooltip = TRUE) +
               background_image(readPNG("test.png")) +
               geom_point()
      }) 

    }
)

ui.R

library(shiny)
library(shinydashboard)
library(shinyWidgets)
library(dashboardthemes)
library(DT)
library(png)
library(rasterImage)
library(ggpubr)
library(plotly)

header <- dashboardHeader(
    title = "Test"
)

sidebar <- dashboardSidebar(
    sidebarMenu(
        menuItem("Tables", icon = icon("table"), 
                 menuSubItem("Players", tabName = "Tables_Players")
        )
    )
)

body <- dashboardBody(
    tabItems(
        tabItem("Tables_Players",
                fluidPage(
                    titlePanel("Charts Players"),
                    fluidRow(
                        plotlyOutput("ShootPosition", height = '800px')
                    )
                )
        )
    )
)


ui = dashboardPage(
    header,
    sidebar,
    body)

The goal is to produce a shot map for ice hockey.


Combine two datasets, based on POSIXct values

$
0
0

I am struggling with combining two datasets with each other.

Dataset1 is containing a "before time" and a "after time", and a "channel".

Dataset2 is containing just one "time" and a "channel" column as well.

I want to add an binary column (Yes/No) to Dataset1, with this logic: If there is a row in Dataset2, where the channel == channel, and Time is within the "before" and "after" time, I want to have "YES". Else "NO".

Data1

ID   Channel   before_time   after_time 
1       A1  2019-09-02 20:13:00 2019-09-02 20:33:00
2       B1  2019-09-02 20:03:00 2019-09-02 20:23:00
3       C1  2019-09-02 20:23:00 2019-09-02 20:43:00
4       D1  2019-09-02 20:23:00 2019-09-02 20:43:00

Data2

ID_B     Channel_B    Time_B
Hallo       A1        2019-09-02 20:23:00
Hi          B2        2019-09-02 20:05:00
Hoi         C1        2019-09-02 22:23:00

Desired Output

ID   Channel   before_time   after_time                     Available
1       A1  2019-09-02 20:13:00 2019-09-02 20:33:00         Yes  # Channel == Channel, Time between before & after
2       B1  2019-09-02 20:03:00 2019-09-02 20:23:00          No  # Channel != Channel
3       C1  2019-09-02 20:23:00 2019-09-02 20:43:00          No  # Time is not between before & after
4       D1  2019-09-02 20:23:00 2019-09-02 20:43:00          No  # There is no matching data where channel is D1

Desired Output 2 (comments Solutions)

Adding extra columns from the second data set (Data2).

ID   Channel   before_time   after_time                     Available   ID_B     
1       A1  2019-09-02 20:13:00 2019-09-02 20:33:00          Yes        Hallo       
2       B1  2019-09-02 20:03:00 2019-09-02 20:23:00          No         x 
3       C1  2019-09-02 20:23:00 2019-09-02 20:43:00          No         x
4       D1  2019-09-02 20:23:00 2019-09-02 20:43:00          No         x

Reproducible example (the data):

ID <- c("1", "2", "3", "4")
channel <- c("A1", "B1", "C1", "D1)
#startdate <- as.POSIXct(c("2019-09-02 20:23:00", "2019-09-02 20:13:00", "2019-09-02 20:33:00", "2019-09-02 20:33:00"))
before_time <- as.POSIXct(c("2019-09-02 20:13:00", "2019-09-02 20:03:00", "2019-09-02 20:23:00", "2019-09-02 20:23:00"))
after_time  <- as.POSIXct(c("2019-09-02 20:33:00", "2019-09-02 20:23:00", "2019-09-02 20:43:00","2019-09-02 20:43:00"))
data1 <- data.frame(ID, channel,   before_time, after_time)
View(data1)


ID_B <- c("Hallo", "Hi", "Hoi")
channel_B <- c("A1", "B2", "C1")
Time_B <- as.POSIXct(c("2019-09-02 20:23:00", "2019-09-02 20:05:00", "2019-09-02 22:23:00"))
data2 <- data.frame(ID_B, channel_B, Time_B)
View(data2)

R equivalent to the SAS "BY" statement in PRINCOMP Procedure

$
0
0

I am using R princomp for PCA, however, I have a dataset with a factor variable, and I would like to run princomp on each factor.

This can be done in SAS with the "BY" statement that "performs BY group processing, which enables you to obtain separate analyses on grouped observations" (from https://support.sas.com/rnd/app/stat/procedures/princomp.html)

Can this be done by princomp in R or do I have to split my data into several datasets and run princomp on each?

All the best,

Change count to percentage on faceted, filled geom_bar()/stat_count() plot in ggplot2 R

$
0
0

I have this dataset from a survey:

                         Var1                 by variable value
1           Strongly disagree  Cluster 1 (n = 9)        A     0
2           Strongly disagree Cluster 2 (n = 15)        A     0
3           Somewhat disagree  Cluster 1 (n = 9)        A     0
4           Somewhat disagree Cluster 2 (n = 15)        A     0
5  Neither agree nor disagree  Cluster 1 (n = 9)        A     2
6  Neither agree nor disagree Cluster 2 (n = 15)        A     0
7              Somewhat agree  Cluster 1 (n = 9)        A     1
8              Somewhat agree Cluster 2 (n = 15)        A     0
9              Strongly agree  Cluster 1 (n = 9)        A     6
10             Strongly agree Cluster 2 (n = 15)        A    15
11          Strongly disagree  Cluster 1 (n = 9)        B     1
12          Strongly disagree Cluster 2 (n = 15)        B     0
13          Somewhat disagree  Cluster 1 (n = 9)        B     0
14          Somewhat disagree Cluster 2 (n = 15)        B     0
15 Neither agree nor disagree  Cluster 1 (n = 9)        B     1
16 Neither agree nor disagree Cluster 2 (n = 15)        B     0
17             Somewhat agree  Cluster 1 (n = 9)        B     4
18             Somewhat agree Cluster 2 (n = 15)        B     1
19             Strongly agree  Cluster 1 (n = 9)        B     3
20             Strongly agree Cluster 2 (n = 15)        B    14
21          Strongly disagree  Cluster 1 (n = 9)        C     0
22          Strongly disagree Cluster 2 (n = 15)        C     0
23          Somewhat disagree  Cluster 1 (n = 9)        C     0
24          Somewhat disagree Cluster 2 (n = 15)        C     0
25 Neither agree nor disagree  Cluster 1 (n = 9)        C     3
26 Neither agree nor disagree Cluster 2 (n = 15)        C     0
27             Somewhat agree  Cluster 1 (n = 9)        C     1
28             Somewhat agree Cluster 2 (n = 15)        C     3
29             Strongly agree  Cluster 1 (n = 9)        C     5
30             Strongly agree Cluster 2 (n = 15)        C    12

I originally plotted it like so using ggplot2 to display the count of responses:

( p5 <- ggplot(q5, aes(x = Var1, y = value, fill = variable)) +
    geom_bar(stat = "identity", width = 0.5, position=position_dodge2(reverse = TRUE)) +
    coord_flip() +
    theme(plot.title = element_text(size = 16), axis.text.x = element_text(size = 16),
    axis.title.x = element_text(size = 16),      
    axis.title.y = element_text(size = 16),
    axis.text.y = element_text(size = 16),
    legend.text=element_text(size=16),
    legend.title=element_text(size=16),
    strip.text.x = element_text(size = 16)) +
    ylim(0,20) +
    scale_x_discrete(limits=c("Strongly disagree", "Somewhat disagree", "Neither agree nor disagree", "Somewhat agree", "Strongly agree")) +
    labs(x = "", y = "# of Responses", fill = "Question") +
    facet_grid(. ~ by) )

which gave me this:

enter image description here

However, I want to display the data as a percentage rather than count.

Following this post, I changed the code accordingly to:

( p5 <- ggplot(q5, aes(x = Var1, group = by, fill = variable)) +
    stat_count(mapping = aes(y = ..prop..)) +
    coord_flip() +
    theme(plot.title = element_text(size = 16), axis.text.x = element_text(size = 16),
    axis.title.x = element_text(size = 16),      
    axis.title.y = element_text(size = 16),
    axis.text.y = element_text(size = 16),
    legend.text=element_text(size=16),
    legend.title=element_text(size=16),
    strip.text.x = element_text(size = 16)) +
    scale_y_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 5L)) +
    scale_x_discrete(limits=c("Strongly disagree", "Somewhat disagree", "Neither agree nor disagree", "Somewhat agree", "Strongly agree")) +
    labs(x = "", y = "% of Responses", fill = "Question") +
    facet_grid(. ~ by) )

However, this gives me this plot:

enter image description here

It seems like the plot is not recognizing my fill argument or the ..prop.. argument for y.

How can I fix this?

How to identify values before and after a sequence of NAs

$
0
0

I have a data set of CO2 measurements taken from an instrument in the lab. Standards were also run sporadically throughout the data collection process. A mock dataset would look like this:

tibble(co2=c(464,345,389,831,374,323,486,542,429,624,359,612,738,720,520,454,499,616,952,805,582, 646,566,781,745,615,639,750,780,1119,584,1345,1020,1038,1419,1136),
number.stds=c(3,rep('NA',13),2,rep('NA',20),3),
std.value.1=c(618,rep('NA',13),534,rep('NA',20),546),
std.value.2=c(621,rep('NA',13),564,rep('NA',20),549),
std.value.3=c(625,rep('NA',34),553)) -> data

Column co2 are the measured data, number.stds is the number of standard measurements taken, and std.value.1 through std.value.3 are the different standard readings.

I want to generate a new column std.value that is the average of all standard values of adjacent standard runs, and assigned to all the samples measured in between these two standard runs.

For the example, this new column would have the value of 592.4 (mean(c(618,621,625,534,564))) for rows 1 through 15, inclusive. And it would have the value of 549.2 (mean(c(546,549,553,534,564))) for all rows from 16 to 36, inclusive.

Is there a simple way to do this with dplyr? Should the data be collected and organized in a different format to make this problem easier?

Counting continuous sequences of months

$
0
0

If I have a vector of year and month coded like this:

ym <- c(
  201401,
  201403:201412,
  201501:201502,
  201505:201510,
  201403
)

And I'd like to end up with a vector that looks like this:

 [1]  1  1  2  3  4  5  6  7  8  9 10  1  2  1  2  3  4  5  6  1

That is, I want to count continuous sequences of month records. Can anyone recommend an approach? I've spinning my wheels with something like this:

ym_date <- as.Date(paste0(ym, 01), format = "%Y%m%d")

diff(ym_date)

but haven't been able to get any farther because I'm not sure how to flag that start of a sequence when we are dealing with months. Any base R, tidyverse, data.frame centric or not solution would be welcomed.

Using table() function from base with dplyr pipe-syntax?

$
0
0

I enjoy the syntax of dplyr, but I'm struggling with easily obtaining a contingency table in the same way that I can get with the base R table() function. table() is OK, but I can't figure out how to incorporate it into the dplyr pipe syntax.

Thank you for your help.

Here is some example data that has the output I'm trying to get to.

df <- tibble(id=c(rep("A",100),rep("B",100),rep("C",100)),
               val=c(rnorm(300,mean=500,sd=100))) %>%
  mutate(val_bin=cut(val,breaks=5))

table(df$id,df$val_bin)

Output:

    (210,325] (325,440] (440,554] (554,669] (669,784]
  A         4        22        55        18         1
  B         6        19        46        24         5
  C         3        23        44        22         8

How can I correctly flag the end of the first sequence of a group?

$
0
0

This is an example of the the type of dataframe that I'm using and the desired column output.

reprEx <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,3,3,3),
             stage1 = c("open","open","open","approved","approved","open","declined","open","open","open","declined","open","approved","declined"))

Desireddf <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,3,3,3),
             stage1 = c("open","open","open","approved","approved","open","declined","open","open","open","declined","open","approved","declined"),
             desiredResult = c(0,0,1,1,0,0,1,0,0,1,1,1,1,1))

I am trying to use dplyr to correctly flag whenever a stage changes within a grouped id. The approved and declined flags are easy because I only have to flag the first case of a declined or approved appearing with:

    reprExWrong <- reprEx %>% group_by(id,stage1) %>%
  mutate(desiredResult = ifelse(stage1 == last(stage1) & stage1 == "open",1,
                                ifelse(stage1 == first(stage1) & stage1 %in% c("approved","declined"),1,0)
                                )
  )

The issue is with the open stage. I only want to apply a flag for when the first sequences of opens ends within a group of id's. With the code that I have now it is choosing the last open within the group, even if it wasn't part of the first sequence of opens. for example:

reprExWrong <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,3,3,3),
             stage1 = c("open","open","open","approved","approved","open","declined","open","open","open","declined","open","approved","declined"),
             notdesiredResult = c(0,0,0,1,0,1,1,0,0,1,1,1,1,1))

In this case, within id1 I would need the flag to show up where the sequence reaches the last open before approved appears, and not in the open after approved appears. I only need the flag in the row of the last occurrence of open if that succession of opens is the first succession within the id. Sorry for any confusion, I would be happy to further elaborate. This is just to correctly identify stage transitions for recording purposes


Exchange data.table columns with most prevalent value of columns

$
0
0

I have data

test = data.table(
  a = c(1,1,3,4,5,6), 
  b = c("a", "be", "a", "c", "d", "c"), 
  c = rep(1, 6)
)

I wish to take the unique values of column a, store it in another data.table, and afterwards fill in the remaining columns with the most prevalent values of those remaining columns, such that my resulting data.table would be:

test2 = data.table(a = c(1,3,4,5,6), b = "a", c = 1)

Column be has equal amounts of "a" and "c", but it doesn't matter which is chosen in those cases.

Attempt so far:

test2 = unique(test, by = "a")
test2[, c("b", "c") := lapply(.SD, FUN = function(x){test2[, .N, by = x][order(-N)][1,1]}), .SDcols = c("b", "c")]

EDIT: I would preferrably like a generic solution that is compatible with a function where I specify the column to be "uniqued", and the rest of the columns are with the single most prevalent value. Hence my use of lapply and .SD =)

Is there a built-in function for finding the mode?

$
0
0

In R, mean() and median() are standard functions which do what you'd expect. mode() tells you the internal storage mode of the object, not the value that occurs the most in its argument. But is there is a standard library function that implements the statistical mode for a vector (or list)?

Importing one dimensional dataset for Complete Spatial Randomness win spatstat

$
0
0

I have a set of one-dimensional data points (locations on a segment), and I would like to test for Complete Spatial randomness. I was planning to run Gest (nearest neighbor), Fest (empty space) and Kest (pairwise distances) functions on it.

I am not sure how I should import my data set though. I can use ppp by setting a second dimension to 0, e.g.:

myDistTEST<- data.frame(
  col1= sample(x = 1:100, size = 50, replace = FALSE),
  col2= paste('Event', 1:50, sep = ''), stringsAsFactors = FALSE)
myDistTEST<- myDistTEST[order(myDistTEST$col1),]                        
myPPPTest<- ppp(x = myDistTEST[,1], y = replicate(n = 50, expr = 0),
                c(1,120), c(0,0))

But I am not sure it is the proper way to format my data. I have also tried to use lpp, but I am not sure how to set the linnet object. What would be the correct way to import my data? Thank you for your kind attention.

R: List of files run in a for-loop through whole script

$
0
0

I am a beginner in R and I'm currently trying to run a script (which worked for a single .csv file) for a list of .csv files. So far I managed to read in all files as a list and tried to use a for-loop to set each file (one after the other) as the file with which the script should work. But somehow it doesn't work. I appreciate any help. Thank you! Following code with for-loop as an short example:

           # load files 
        setwd("mypath/folder") # Files are stored in folder and used as wd
        list_of_files <- list.files(pattern = "*.csv",full.names = TRUE)
        length_lof <- length(list_of_files)
        counter = 0

        for(i in 1:length_lof) {
         df <- list_of_files[1+counter]
         name <- sub(pattern = "(.*)\\..*$", replacement = "\\1_output.csv", basename(df))
         counter = counter + 1
         write.csv2(df, name)
        }

Original script which is working for a single file and now I would like to work it for a list of files/ all files in a folder with a for-loop (or something similar):

       rm(list=ls())
        setwd("C:/filepath/folder")

    # load data & adjust to R
    filename <- file.choose()
    name <- sub(pattern = "(.*)\\..*$", replacement = "\\1_output.csv", basename(filename))
    myData <- read.csv2(filename, sep= ",", skip=40, header=T )
    df <- myData[ -c(7) ]
    # get all col as numeric
    df[, 2] <- as.numeric(as.character(df[, 2]))

   #..... more code

    # Save Selected Data to new Data Frame 
    slope<-data.frame(dist[BS:BB],load[BS:BB])
    slope$ID <- ID
    colnames(slope) <- c("disp", "load","ID") # Rename Columns

    #Data Save 
    write.csv2(slope, name)

How can I count the number of elements in a vector that fall within a particular range using r?

$
0
0

Let's say we a vector a = (10,23,57,37,59,25,63,33) and we want to calculate the frequency in the bins 10-19,20-29,30-39,40-49,50-59,60-69.The output should be in the form of a vector, in this case (1,2,2,2,1).

Summing the variables with same column name in R [closed]

$
0
0

I have a dataframe with column names;

[1] "sample_id""seq91 Acinetobacter;junii"                          
[3] "seq157 Acinetobacter;lwoffii""seq139 Acinetobacter;johnsonii-lwoffii"             
[5] "seq225 Acinetobacter;johnsonii""seq224 Acinetobacter;lwoffii"                       
[7] "seq278 Acinetobacter;calcoaceticus""seq327 Acinetobacter;lwoffii"                       
[9] "seq309 Acinetobacter;lwoffii""seq508 Acinetobacter;ursingii"                      
[11] "seq394 Acinetobacter;haemolyticus""seq540 Acinetobacter;bouvetii"                      
[13] "seq558 Acinetobacter;bouvetii""seq541 Acinetobacter;lwoffii"                       
[15] "seq575 Acinetobacter;haemolyticus-johnsonii-lwoffii""seq665 Acinetobacter;junii"                         
[17] "seq707 Acinetobacter;lwoffii""seq755 Acinetobacter;haemolyticus-johnsonii-lwoffii"
[19] "seq677 Acinetobacter;marinus""seq758 Acinetobacter;johnsonii"                     
[21] "seq836 Acinetobacter;junii""seq768 Acinetobacter;septicus-ursingii"             
[23] "seq770 Acinetobacter;bouvetii-johnsonii""seq928 Acinetobacter;tjernbergiae"                  
[25] "seq864 Acinetobacter;harbinensis""seq902 Acinetobacter;parvus"

After removing seqxxx numbers, I want to sum up the values of columns with same name. As the column names must be unique, how can I perform summation after getting rid of the seq and numbers? Thank you

Note: Previous version of my question was including removing seqxxx numbers and kind of violating the rule of this website by asking more than one questions in one post. Excuse me for such unwitting posting.

How to fix toc numbering when using pandoc argument number-offset

$
0
0

TOC of html from Rmarkdown does not have the correct numbering of sections when using

--number-offset pandoc arg.

yaml of Rmarkdown and first lines:

---
title: "Untitled"
output: 
  html_document:
    toc: true
    toc_depth: 2
    pandoc_args: ["--number-offset=4"]
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# R Markdown # corresponds to number 1, changed to 5 with pandoc arg

Workaround (inside R - bash - linux)

system("sed -z 's/toc-section-number\">1/toc-section-number\">5/g' -i path/filename.html")

Error: The dbplyr package is required to communicate with database backends

R Pick First Value Given Condition

$
0
0
data=data.frame("team"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                "score"=c(4,8,10,3,10,5,4,2,7,7,5,6,5,9,1),
                "trial"=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
                "sc3"=c(0,0,0,1,0,0,0,1,0,0,0,0,0,0,1),
                "sc7"=c(0,0,0,1,0,1,1,1,1,1,1,1,1,0,1),
                "sc9"=c(1,1,0,1,0,1,1,1,1,1,1,1,1,1,1),
                "sc3trial"=c(-99,-99,-99,1,1,1,2,2,2,-99,-99,-99,3,3,3),
                "sc7trial"=c(-99,-99,-99,1,1,1,1,1,1,1,1,1,1,1,1),
                "sc9trial"=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))

I have data with column "team" and "score" and "trial". I want to create variables "sc3", "sc7", "sc9", "sc3trial", "sc7trial", "sc9trial" that follow these rules::

The rules are simple for "sc#". Note these are for each group:

  1. For "sc3": if score <= 3, sc3 = 1. Otherwise sc3 = 0.
  2. For "sc7": if score <= 7, sc7 = 1. Otherwise sc7 = 0.
  3. For "sc9": if score <= 9, sc9 = 1. Otherwise sc9 = 0.

The rules are sort of also simple for "sc#trial". Note these are for each group:

  1. For "sc3trial": If any "sc3" == 1, sc3trial records the trial when it first occurred. If no "sc3" equals to '1' then "sc3trial" equals to "-99" This logic applies the same for "sc7trial" and "sc9trial"

How to get value of column by row filtering?

$
0
0

I have the following data table in R:

threshold   ranking    size
0.70        11         100
0.65        9          102
0.60        12         150
0.55        10         110

I need to get the value of ranking for the row when threshold is equal to 0.60.

threshold_val <- 0.60
out <- as.numeric(filter(df, round(df["threshold"],2) == round(threshold_val,2))["ranking"])

But out is equal to NA instead of 12.

What is wrong in my code?

Thanks.

Python equivalent of "library(help = PackageName)" and "?FunctionName" (i.e. help() function) from R

$
0
0

There are many handy functions in R, but the two most handy functions I found were

library(help = dplyr) 
?data.frame
# So the library(help = PackageName) & ?FunctionName

Are there similar tools in Python I can make use of?

panel data with 3 indices in r

$
0
0

I struggle to understand the mistake.

My panel data looks like this:

enter image description here

What I want to estimate is:

enter image description here

where a are country fixed effects and c are individual fixed effects and h is a time dummy.

when estimating in r:

plm(y ~ x + year, index=c("ID", "country", "year"), model="within", data=df)

I get the error:

Error in pdim.default(index[[1]], index[[2]]) : 
  duplicate couples (id-time)
Additional: Warnings:
1: In pdata.frame(data, index) :
  duplicate couples (id-time) in resulting pdata.frame
 to find out which, use e.g. table(index(your_pdataframe), useNA = "ifany")
2: In is.pbalanced.default(index[[1]], index[[2]]) :
  duplicate couples (id-time)

How can I solve this problem? Any help is appreciated.

A sample:

ID <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
country <- c("DE","DE","FI","FI","HU","HU","GB","GB","DE","DE","FI","FI","HU","HU","GB","GB")
year <- c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1)
y <- 1:16 
x <- dnorm(y, mean=0, sd=2)
df <- data.frame(ID, country, year, y, x)

plm(y ~ x + year, index=c("ID", "country", "year"), model="within", data=df)
Viewing all 201814 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>