Missing levels when weightig data using ANESRake (weighting/raking)

February 18, 2020, 4:34 am

≫ Next: R: How to match strings of variable length in dataframe columns with different row numbers?

I have a survey data set and some quotes:

The population quotes are:

(1 = up to 29 years 0,00%)

2 = 30 to 39 years 18,10%

3 = 40 to 49 years 28,77%

4 = 50 to 59 years 33,11%

5 = 60 and more years 20,01%

In the data set, I have to weight category 5 is missing. Here are the stats of the variable in the data set:

2 = 32,33%  
3 = 36,56%  
4 = 31,12%

If I perform the raking I get the following error:

library(anesrake)

r = anesrake(list_weights,
             d, 
             verbose = FALSE, 
             caseid =  d$RESPID, 
             maxit = 1500,
             cap = 5,
             choosemethod = "max", 
             type = "nolim")


Error in rakeonvar.default(mat[, i], inputter[[i]], weightvec) : variables must be coded continuously from 1 to n with no missing values

Any Idea how to deal with missing levels in the data?

Here is a dput of the quotes

list(Rec_Age = c(`2` = 0.181, `3` = 0.2877, `4` = 0.3311))

and a small dput of the data

structure(list(RESPID = structure(c(459, 311, 223, 60, 613, 495, 
300, 273, 78, 170, 217, 61, 175, 619, 270, 218, 453, 492, 23, 
65, 33, 113, 532, 26, 119, 49, 208, 102, 200, 165, 435, 298, 
593, 220, 111, 53, 494, 271, 305, 420, 323, 607, 105, 19, 426, 
171, 330, 201, 332, 277), label = "RESPID - Respondent ID", format.spss = "F10.0", display_width = 0L), 
    Rec_Age = structure(c(4, 2, 4, 3, 4, 4, 4, 3, 2, 2, 3, 2, 
    3, 4, 4, 2, 4, 4, 2, 3, 2, 2, 2, 3, 3, 2, 2, 2, 2, 3, 2, 
    3, 2, 3, 4, 3, 4, 3, 2, 3, 3, 3, 4, 4, 4, 2, 2, 3, 4, 3), label = "Rec_Age - Recode Age")), row.names = c(NA, 
-50L), class = "data.frame")

↧

R: How to match strings of variable length in dataframe columns with different row numbers?

February 18, 2020, 4:34 am

≫ Next: Row wise outlier/anomaly detection based on the values given in one column (in array format)

≪ Previous: Missing levels when weightig data using ANESRake (weighting/raking)

I have two data frames in R with different number of rows. df1 has 2 columns with 12000+ rows and df2 has <200 rows. Both the data frames have strings as observations. Sample from df1:

Name = c("7PR SRO.","7PR","3XL INC","ACME INC","ACME INCORP")

abbr = c("7PR","7PR","3XL","ACME","ACME")

df1<- data.frame(Name, abbr, stringAsFactors=FALSE)

The second data frame df2 has same name column as df 1. It only had the name column but to match row entries, i merged them with df1 using the merge function. But i still ended up with around 100+ rows with NA in abbr column. Example:

df2<- data.frame(Name = c("7PR S.R.O.", "3XL LLC", "ACME Ltd"), stringAsFactors=FALSE)

df2's name column also has similar company names like df1 but there could be changes in punctuation or changes from "Corp" to "Ltd" etc. Basically I want 3XL as abbr regardless of whether its LLC or Ltd after the name. I tried using stringdist function stringdist(df2$Name,df1$Name,method ='jw') but it gives an error saying "Longer object length is not a multiple of shorter object length".

What i need to get is df1$Name is matched with df2$Name ignoring punctuation or changes in company type such as corp, inc, ltd, llc etc. Can someone please help me find a way to do this?

↧

Row wise outlier/anomaly detection based on the values given in one column (in array format)

February 18, 2020, 4:34 am

≫ Next: Minimum and maximum sequential values of a vector

≪ Previous: R: How to match strings of variable length in dataframe columns with different row numbers?

I have been trying to find a way through Python or R where in the csv file that I have, the script should go through each row and try to detect outliers or anomalies based on the array values in one of the columns.

The issue is quite complex according to me and I have exhausted myself trying to go through methods which detect anomalies or outliers.

Should I look into Machine Learning for such an issue?

There is already a question posted for similar kind of thing but it has not been answered correctly (I feel), the image of the data given there can be used to describe any plausible answer.

Thanks

↧

Minimum and maximum sequential values of a vector

February 18, 2020, 4:34 am

≫ Next: Plotting on a geographical map the provenience of our patients

≪ Previous: Row wise outlier/anomaly detection based on the values given in one column (in array format)

I need to find the minimum and the maximum values of a vector, but the minimum one needs to be positioned inside before the maximum one. So, given the vector:

v <- c (8,2,1,3,5,7,4)

the minimum value would be '1' (the 3rd element) and the maximum '7' (the sixth element). The maximum value would not be indeed '8' (the 1th element) because it occurs before the minimum '1' (the 3rd element). I looked at the functions 'min' and 'max' (and their faster versions pmin and pmax), but apparently they are only able to find the highest minimum and maximum values, with no restrictions in place. The same for the range function. Is there anyone that could help? Thanks in advance. Arturo

↧

Plotting on a geographical map the provenience of our patients

February 18, 2020, 4:41 am

≫ Next: R: Reliable conversion between gtable and ggplot objects. How to make lemon::reposition_legend() work after ggplot_build()?

≪ Previous: Minimum and maximum sequential values of a vector

I am trying to put on a Italian geographical map a dot reporting the provenience ('provincia') of our patients. Ideally, the dot size should be proportional to the number of patients coming from that 'provincia'. An example of the list I would like to plot is the following.

MI  8319
CO  537
MB  436
VA  338
BG  310
PV  254
CR  244
NO  210
RM  189
CS  179

In the first column there is the 'provincia' code: MI (Milano), CO (Como), MB (Monza-Brianza), etc. In the second column there is the number of patients from that 'provincia'. So the output should be an Italian political map where the biggest dot is around the city of Milano (MI), the second biggest dot is near the city of Como (CO), the third one is around the city of Monza-Brianza (MB),etc. Is there any package that could do the plot I am looking for? I found a tool that could do the job here, but apparently they expect that I load the geographical coordinates in order to do the plot.

https://www.littlemissdata.com/blog/maps

Thanks in advance.

↧

R: Reliable conversion between gtable and ggplot objects. How to make lemon::reposition_legend() work after ggplot_build()?

February 18, 2020, 4:42 am

≫ Next: How to sum two rows by a simple condition in a data frame

≪ Previous: Plotting on a geographical map the provenience of our patients

I have a pretty complicated case at hand with ggplot2. I tried to exemplify it with a MWE using iris data below.

I just have boxplots in facets, and wanted to move the legend to take the space of the empty facets.

This is all good, I use lemon::reposition_legend() for that and it works.

However, I then have to modify a bunch of things in the plot (namely add significant test results and other things that are not relevant for this question), and I am forced to use ggplot_build() on my output plot for that purpose.

After using ggplot_build() to modify my plot, I do not seem to be able to use reposition_legend() successfully anymore...

Check out my MWE below.

First I load the packages I need, and define a shift_legend() function (which uses reposition_legend()), based on an answer to this question.

library(tidyr)
library(ggplot2)
library(ggplotify)
library(gtable)
library(cowplot)
library(purrr)
library(lemon)
shift_legend <- function(p) {
  pnls <- NULL
  if (class(p)[1]=="gtable"){
    pnls <- p
  } else if (class(p)[2]=="ggplot"){
    pnls <- plot_to_gtable(p)
  } else{
    stop("Please provide a ggplot or a gtable object")
  }
  pnls <- gtable_filter(pnls, "panel")
  pnls <- setNames(pnls$grobs, pnls$layout$name)
  pnls <- keep(pnls, ~identical(.x, zeroGrob()))

  if(length(pnls) > 0){
    reposition_legend( p, "center", panel=names(pnls) )
  } else{
    p
  }
}

I then load the iris data and make my plot with shift_legend() successfully.

data(iris)
summary(iris)
iris_long <- gather(iris, "Variable", "Value", -Species)
P <- ggplot(iris_long, aes(x=Variable, y=Value)) +
  geom_boxplot(aes(fill=Variable), position=position_dodge(.9)) +
  facet_wrap(.~Species, ncol=2) +
  theme_light() +
  theme(legend.key.size = unit(0.5, "inch"))
out_file_name <- "test.pdf"
pdf(file=out_file_name, height=10, width=10, onefile=FALSE)
print(
  grid.draw(shift_legend(P))
)
dev.off()

This produces this output, all good till here: Note this is the arrangement I want to be able to reproduce (after using ggplot_build), with the legend taking the empty facets space.

But now I need to use ggplot_build() to add and modify things in my plot. After that I can plot it normally without using reposition_legend().

P2 <- ggplot_build(P)
#Do a bunch of things here...
out_file_name2 <- "test2.pdf"
pdf.options(reset=TRUE, onefile=FALSE)
pdf(file=out_file_name2, height=10, width=10)
print(
  plot(ggplot_gtable(P2))
)
dev.off()

Which produces this:

But I still want to reposition the legend, so I attempt to use reposition_legend() again converting the ggplot_built object into a gtable object (which, according to the function documentation it can accept also as input).

out_file_name22 <- "test22.pdf"
pdf.options(reset=TRUE, onefile=FALSE)
pdf(file=out_file_name22, height=10, width=10)
print(
  grid.draw(shift_legend(
    ggplot_gtable(P2)
  ))
)
dev.off()

Here I get this error:

Error in reposition_legend(p, "center", panel = names(pnls)) : No legend given in arguments, or could not extract legend from plot.

I tried again converting the gtable object into a ggplot one using ggplotify::as.ggplot(). This time I obtained no errors, but the legend was not repositioned as expected...

out_file_name222 <- "test222.pdf"
pdf.options(reset=TRUE, onefile=FALSE)
pdf(file=out_file_name222, height=10, width=10)
print(
  grid.draw(shift_legend(
    as.ggplot(ggplot_gtable(P2))
  ))
)
dev.off()

It produces this:

Help please!

↧

How to sum two rows by a simple condition in a data frame

February 18, 2020, 4:43 am

≫ Next: Plotting certain values in Rshiny

≪ Previous: R: Reliable conversion between gtable and ggplot objects. How to make lemon::reposition_legend() work after ggplot_build()?

I'd like to sum rows 2 by 2, in order to study the lag of certain variable.

Suppose that I have the following the data base:

So, I expected to obtain the following, where eggsare the sum of the indexes "SE"'s:

> df
    SE2    eggs
  "4+5"    3
  "6+7"    14
  "8+1"    3.5
  "2+3"    15

Where

df = data.frame(SE=c(4,6,7,8,5,1,2,3),eggs = c(2,4,10,0.5,1,3,6,9))

Obs.: Don't mater the order of the data frame, but I need to start from certain number (in this case, number 4), and then take the next number, in this case, number 5, and keep this logic. After SE 6+7, SE 8+1, SE 2+3...

Any hint on how can I do that?

↧

Plotting certain values in Rshiny

February 18, 2020, 4:46 am

≫ Next: Making adjacency matrix using group information

≪ Previous: How to sum two rows by a simple condition in a data frame

I am a novice in R and Rshiny programming and i'm currently working on an app that does principal components analysis based on any database uploaded. I am looking for a way to make an interactive plot of the eigen values obtained in my PCA by row names. I made my reaserch a little bit on the internet and I found a way to obtain the plot thanks to ggplot, but it is a static plot if I want to change the number of eigen values to plot i would have to go to the server side code and do it manually which is not the goal of all my work.So to be specific, I'm looking for a way to make a reactive barplot of all my eigen values in function of my rownames (that are my components) and be able to choose the eigen values that I want to keep, if anyone could help me it would be great !

the function is named output$eigplot the code that I have obtained so far looks like this :

library(shiny)
library(ggplot2)
library(d3heatmap)
library(DT)

shinyUI(navbarPage(
  "Spectrométrie",
  # Hea

  # Input in sidepanel:
  tabPanel(
    "Données",
    tags$style(type = 'text/css', ".well { max-width: 20em; }"),
    # Tags:
    tags$head(
      tags$style(type = "text/css", "select[multiple] { width: 100%; height:10em}"),
      tags$style(type = "text/css", "select { width: 100%}"),
      tags$style(type = "text/css", "input { width: 19em; max-width:100%}")
    ),
    fluidPage(
      fluidRow(
        column(3,
               selectInput(
                 "readFunction",
                 "Function to read data:",
                 c(
                   # Base R:
                   "read.table","read.csv","read.csv2","read.delim","read.delim2",
                   # foreign functions:
                   "read.spss","read.arff","read.dta","read.dbf","read.epiiinfo",
                   "read.mtp","read.octave","read.ssd","read.xport", "read.systat",
                   # Advanced functions:
                   "scan","readLines"
                 )
               )),
        column(4,
               htmlOutput("ArgSelect")),
        column(4,
               # Argument field:
               htmlOutput("ArgText"))
      ),
      fluidRow(
        column(4, fileInput("file", "Upload data-file:")),
        # Variable selection:
        column(4, htmlOutput("varselect")),
        column(4, textInput("name", "Dataset name:", "Data"))    
      )  
    ),
    mainPanel(dataTableOutput("table"))
  ),
  tabPanel(
    "ACP",
    fluidPage(fluidRow(column(
      12,
      p(
        "Visualisons quelques statistiques descriptives de nos variables :"
      )
    ))),
    mainPanel(
      fluidPage(fluidRow(column(
        12, dataTableOutput("table2", width = "100%")
      ))),
      fluidPage(fluidRow(
        column(6, p("La matrice de corrélations :")),
        d3heatmapOutput("heatmap", width = "100%", height =
                          "1000px")
      )),
      fluidPage(fluidRow(column(
        7, dataTableOutput("coord")
      ))),
      fluidPage(fluidRow(column(
        7, dataTableOutput("contrib")
      ))),
      fluidPage(fluidRow(column(
        7, dataTableOutput("cos2")
      ))),
      fluidPage(fluidRow(column(
        12, plotOutput("eigplot")
      ))),
      fluidPage(fluidRow(column(
        12, plotOutput("indivplot")
      )))
    )
  )
))

Server

shinyServer(function(input, output,session) {
    ### Argument names:
    ArgNames <- reactive({
        Names <- names(formals(input$readFunction)[-1])
        Names <- Names[Names!="..."]
        return(Names)
    })

    # Argument selector:
    output$ArgSelect <- renderUI({
        if (length(ArgNames())==0) return(NULL)

        selectInput("arg","Argument:",ArgNames())
    })

    ## Arg text field:
    output$ArgText <- renderUI({
        fun__arg <- paste0(input$readFunction,"__",input$arg)

        if (is.null(input$arg)) return(NULL)

        Defaults <- formals(input$readFunction)

        if (is.null(input[[fun__arg]]))
        {
            textInput(fun__arg, label = "Enter value:", value = deparse(Defaults[[input$arg]])) 
        } else {
            textInput(fun__arg, label = "Enter value:", value = input[[fun__arg]]) 
        }
    })


    ### Data import:
    Dataset <- reactive({
        if (is.null(input$file)) {
            # User has not uploaded a file yet
            return(data.frame())
        }

        args <- grep(paste0("^",input$readFunction,"__"), names(input), value = TRUE)

        argList <- list()
        for (i in seq_along(args))
        {
            argList[[i]] <- eval(parse(text=input[[args[i]]]))
        }
        names(argList) <- gsub(paste0("^",input$readFunction,"__"),"",args)

        argList <- argList[names(argList) %in% ArgNames()]

        Dataset <- as.data.frame(do.call(input$readFunction,c(list(input$file$datapath),argList)))
        return(Dataset)
    })

    # Select variables:
    output$varselect <- renderUI({

        if (identical(Dataset(), '') || identical(Dataset(),data.frame())) return(NULL)

        # Variable selection:    
        selectInput("vars", "Variables to use:",
                    names(Dataset()), names(Dataset()), multiple =TRUE)            
    })

    # Show table:
    output$table <- renderDataTable({
        datatable(Dataset()[,input$vars,drop=FALSE], rownames = FALSE)
    })


    output$table2 <- DT::renderDataTable(

        datatable(summary( Dataset()[,input$vars]),
                  rownames = FALSE,
                  options = list(columnDefs = list(list(className = 'dt-center')),
                                 pageLength = 6
                                 )
    ) 
    )

    output$heatmap <- renderD3heatmap({
        dat = Dataset()[,input$vars,drop=FALSE]
        corr = cor(dat)
        return(d3heatmap(corr, scale="column"))
    }) 

    output$fprinc <-DT::renderDataTable({
        dat = Dataset()[,input$vars,drop=FALSE]
        res.pca <- PCA(dat, graph = FALSE)
        u = res.pca["eig"]
        u = as.data.frame(u)
        names(u)[c(1:3)]<-c("valeurs propres", "Pourcentage de la variance", "pourcentage cumulé de la variance")
        datatable(u)
    })

    output$eigplot <- renderPlot({ 
        dat = Dataset()[,input$vars,drop=FALSE]
        res.pca <- PCA(dat, graph = FALSE)
        u = res.pca["eig"]
        u = as.data.frame(u)
        ggplot(u, aes(x=rownames(u), y=u[,2])) + 
            geom_bar(stat="identity", fill="steelblue", color="grey50") + coord_flip() +labs(y="Composantes", x = "% de la variance")
    })

    output$coord <-DT::renderDataTable({
        dat = Dataset()[,input$vars,drop=FALSE]
        res.pca <- PCA(dat, graph = FALSE)
        u = res.pca$var["coord"]
        u = as.data.frame(u)
        datatable(u)
    })

    output$contrib <-DT::renderDataTable({
        dat = Dataset()[,input$vars,drop=FALSE]
        res.pca <- PCA(dat, graph = FALSE)
        u = res.pca$var["contrib"]
        u = as.data.frame(u)
        datatable(u)
    })

    output$cos2 <-DT::renderDataTable({
        dat = Dataset()[,input$vars,drop=FALSE]
        res.pca <- PCA(dat, graph = FALSE)
        u = res.pca$var["cos2"]
        u = as.data.frame(u)
        datatable(u)
    })

    output$indivplot<-renderPlot({
        dat = Dataset()[,input$vars,drop=FALSE]
        res.pca <- PCA(dat, graph = FALSE)
        plot(res.pca, choix = "ind", autoLab = "yes")
    })

    output$cercle<-renderPlot({
        dat = Dataset()[,input$vars,drop=FALSE]
        res.pca <- PCA(dat, graph = FALSE)
        plot(res.pca, choix = "var", autoLab = "yes")
    }) 
})

↧

Making adjacency matrix using group information

February 18, 2020, 4:46 am

≫ Next: How to delete variables and return a new statistical model in UI?

≪ Previous: Plotting certain values in Rshiny

I am relatively new to R and I am have issues in creating an adjacency matrix using group characteristics.

I have a data frame that looks like this:

distid villageid  hhid group1 group2 group3 group4 
1        1         118  0        1     0        0
1        1         119  1        0     1        0
1.       2         120  1        1     0        1 
2        1         210  1        0     0        1
2        1         212  1        1     0        0
2        1         213  1        1     1        1
2        1         214  0        0     0        0

I need to create an adjacency matrix where if they are in the same group they should be connected. I am having couple of issues in converting the dataset into an adjacency matrix format:

Since some of these groups are at district level and some at the village level.

For eg: if group 1 is at the district level then I need something like if in district 1 and group 1 the hhid in adjacency matrix should be 1.

Simliarly, if group 2 is village level it should take the argument if in district 1 ,village 1 and group 2 then hhid in adjacency matrix should be 1.

So my final matrix should include all the groups For eg for group 1 and 2 my adjacency matrix will be something like this:

hhid  118    119  120    210    212  213 ....
118    0     1     0     0       0    0 
119    0     0     1     0       0    0 
120    0     1     0     0       0    0 
210    0     0     0     0       0    0
212    0     0     0     0       0    1
213    0     0     0     0       1    0

↧

How to delete variables and return a new statistical model in UI?

February 18, 2020, 5:12 pm

≫ Next: ggplot - x-axis shows data beyond specified range for longer time periods

≪ Previous: Making adjacency matrix using group information

I am creating a appshiny to analyze a statistical model. In it, the user places the data set and a logistic regression model is returned. I would like non-significant variables to be excluded so that I can analyze a new model. What would be my mistake in this app?
Any help or direction would be appreciated. Thanks

     library(shiny)

        ui <- fluidPage(

          titlePanel("-"),

          sidebarLayout(

            sidebarPanel(

              fileInput(inputId="arquivo", "Selecione o arquivo",accept = ".csv"),

              selectInput(inputId = "insi", "Selecione as variaveis não significativas", choices = "" ,multiple = TRUE ),

              actionButton(inputId = "acao", label = "Refaça o modelo")

            ),

            mainPanel( 
              tabsetPanel(type = "tab",
                          tabPanel("Dados",tableOutput("dados")),
                          tabPanel("Ajuste", verbatimTextOutput("model"))
              )
            )
          )
        )

        server <- function(input, output, session) {

          k <- reactive({
            if(is.null(input$arquivo))     return(NULL)
            k <- read.csv(input$arquivo$datapath, header = TRUE, stringsAsFactors = FALSE)

            updateSelectInput(session = session, inputId = "insi", choices = names(k))
            return(k)
          })

           output$dados <- renderTable({
           k <- req(k())
           return(k)
           })

          output$model <- renderPrint({
            k <- req(k())

            v <- ncol(k)

            y <- k[, 1]
            x <- k[, -1]

            mData <- data.frame(vY = y, mX = x)
            colnames(mData) <- c("vY", paste("VX", 2:v, sep = ""))

            mod <- glm(vY ~ ., data = mData, family = binomial)

            print(summary(mod))

            if (input$acao == TRUE){ 

              v <- ncol(k)
              y <- k[, 1]
              x <- k[, -1]

              insi <- input$insi
              x <- x[ , - (insi)]

              mData <- data.frame(vY = y, mX = x)
              colnames(mData) <- c("vY", paste("VX", 2:v, sep = ""))

              mod <- glm(vY ~ ., data = mData, family = binomial)

              print(summary(mod))
            }

          })

        }

        shinyApp(ui = ui, server = server)

↧

ggplot - x-axis shows data beyond specified range for longer time periods

February 18, 2020, 5:19 pm

≫ Next: How to reference Timezone (TZ) from a separate column?

≪ Previous: How to delete variables and return a new statistical model in UI?

Here's some sample data for a company's Net revenue split by two cohorts:

data <- data.frame(dates = rep(seq(as.Date("2000/1/1"), by = "month", length.out = 48), each = 2),
                   revenue = rep(seq(10000, by = 1000, length.out = 48), each = 2) * rnorm(96, mean = 1, sd = 0.1),
                   cohort = c("Group 1", "Group 2"))

I can show one year's worth of data and it returns what I would expect:

start = "2000-01-01"
end = "2000-12-01"

ggplot(data, aes(fill = cohort, x = dates, y = revenue)) +
    geom_bar(stat = "identity", position = position_dodge(width = NULL)) +
    xlab("Month") + 
    ylab("Net Revenue") +
    geom_text(aes(label = round(revenue, 0)), vjust = -0.5, size = 3, position = position_dodge(width = 25)) + 
    scale_x_date(date_breaks = "1 month", limits = as.Date(c(start, end))) +
    ggtitle("Monthly Revenue by Group")) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 60, hjust = 1), plot.title = element_text(hjust = 0.5)) +
    scale_fill_manual(values=c("#00BFC4", "#F8766D")

But if I expand the date range to two years or more and rerun the graph, it shows additional months on both sides of the x-axis despite not displaying any information on the y-axis.

start = "2000-01-01"
end = "2001-12-01"
#rerun the ggplot code from above

Note the non-existant data points for 1999-12-01 and 2002-01-01. Why do these appear and how can I remove them?

↧

How to reference Timezone (TZ) from a separate column?

February 18, 2020, 5:21 pm

≫ Next: Data Table Solution To New Variables By Group

≪ Previous: ggplot - x-axis shows data beyond specified range for longer time periods

I am attempting to assign the correct timezone (TZ) for each observation in my dataset (Attached screenshot). I have successfully been able to mutate the TZ columns (Start_TimeZone) to new columns (Start_TimeZone_New) to represent the "normal" TZ designations (i.e. "America/Los_Angeles"). The issue I am running into, is understanding how to assign this new TZ column to each date/time observation (2nd code snippet). The ultimate goal is to utilize the TZ assignment to calculate the duration between start and end date/time (end date/time not shown).

```
comp_report_tz %>% 
mutate(Start_TimeZone_New = case_when(is.na(Start_TimeZone) ~ "missing",
           Start_TimeZone == "-08:00" ~ "America/Los_Angeles",
           Start_TimeZone == "-07:00" ~ "America/Phoenix",
           Start_TimeZone == "-06:00" ~ "America/Chicago",
           Start_TimeZone == "-05:00" ~ "America/New_York",
                                       TRUE ~ "others")) %>% 
mutate(End_TimeZone_New = case_when(is.na(End_TimeZone) ~ "missing",
           Start_TimeZone == "-08:00" ~ "America/Los_Angeles",
           Start_TimeZone == "-07:00" ~ "America/Phoenix",
           Start_TimeZone == "-06:00" ~ "America/Chicago",
           Start_TimeZone == "-05:00" ~ "America/New_York",
                                       TRUE ~ "others"))
```

``` 
comp_report_adj %>% 
mutate(Start_Time_Final = as.POSIXct(comp_report_tz$Start_Date_Time, format = "%m/%d/%y 
%I:%M%p", tz=comp_report_adj$Start_TimeZone_New)
```

↧

Data Table Solution To New Variables By Group

February 18, 2020, 5:22 pm

≫ Next: How to write a proper for loop for API calls?

≪ Previous: How to reference Timezone (TZ) from a separate column?

library(data.table)
library(dplyr)

dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3),
                    "score"=c(0,8,8,7,9,4,9,2,7),
                    "time"=c(1,2,3,1,2,3,1,2,3))


dataWANT=data.frame("student"=c(1,1,1,2,2,2,3,3,3),
                    "score"=c(0,8,8,7,9,4,9,2,7),
                    "time"=c(1,2,3,1,2,3,1,2,3),
                    "score3"=c(1,1,1,0,0,0,1,1,1),
                    "timescore3"=c(1,1,1,3,3,3,2,2,2),
                    "score7"=c(1,1,1,1,1,1,1,1,1),
                    "timescore7"=c(1,1,1,1,1,1,2,2,2))




dataHAVE[, score3 := ifelse(score<=3,
                               time[which.min(score<=3)],
                               time[which.max(time)]), by=student]

I have 'dataHAVE' and want to generate 'dataWANT'

1) score3 equals to 1 if a student has any score less than or equals to 3; otherwise 0

2) score7 equals to 1 if a student has any score less than or equals to 7; otherwise 0

3) timescore3 equals to the minimum time value at which a student scores a 3 or less; if a student does not score a 3 or less as indicated by score equals to 3, then timecsore3 is the maximum time for that student.

4) timescore7 equals to the minimum time value at which a student scores a 7 or less; if a student does not score a 7 or less as indicated by score equals to 7, then timecsore7 is the maximum time for that student.

My attempt at this shows above but does not work and please note, I attempted Base R and dplyr but the dataset is so big that these take a very long time. A data.table solution is ideal.

NEW DATA TO HANDLE MISSING::

dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                    "score"=c(0,8,8,7,9,4,9,2,7,NA,4,7,NA,NA,NA),
                    "time"=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3))

Updated DATA with missing 'time'

dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7),
                    "score"=c(0,8,8,7,9,4,9,2,7,NA,4,7,NA,NA,NA,6,9,3,NA,NA,NA),
                    "time"=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,NA,2,NA,NA,NA,NA))

↧

How to write a proper for loop for API calls?

February 18, 2020, 5:34 pm

≫ Next: How to get rid of degree symbols on axis ticks in geom_polygon?

≪ Previous: Data Table Solution To New Variables By Group

I am trying to write a for loop that provides the author's short ID to a database called RePEc and gets the affiliation data. So far, this is what I have (NB: USERCODE only works on my IP):

url <- "https://api.repec.org/call.cgi?code=USERCODE&getauthorrecordraw="
for(i in 1:length(df_affiliations)){
Sys.sleep(1)
affiliation_fun <- paste(url,df_affiliations$author_reg_1[i])
affiliation_run <- fromJSON(txt=affiliation_fun) %>% select("affiliation") %>% unlist(use.names=FALSE)
affiliation_1 <- paste(unlist(affiliation_run), collapse ="")
df_affiliations$vector <- rbind(affiliation_1)
}

Every time I try this, I either get only the last value or a HTTP Error 400. For loops are not my strong suit. Can anyone figure out where I've gone wrong? Thanks in advance for your help!

↧

How to get rid of degree symbols on axis ticks in geom_polygon?

February 18, 2020, 5:44 pm

≫ Next: Use multiquantile groups from a large dataframe in a grouped dataframe in R

≪ Previous: How to write a proper for loop for API calls?

I am mapping some data using ggplot. I have imported a map with map_data using the following code

library(mapdata)
canada = map_data("worldHires", "Canada")

this is the code I have used to map it

p = ggplot(data = canada) +
  geom_polygon(data = canada, aes(x=long, y = lat, group = group), fill = "grey") +
  #coordinates of my map
  coord_sf(xlim=c(-84, -41), ylim=c(24,51), expand = FALSE) +
  #map the receiver locations
  geom_point(data = mapindividual_dets,
             mapping = aes(x = longitude, 
                           y = latitude,
                           #fill in receiver colors by their associated zone
                           fill = Zone,
                           color = Zone),
             pch = 21, size = 5) +
  #fill the zones in with viridis
  scale_fill_manual(values=c("#01579B", "#4FC3F7", "#ffa600", "#ff6361")) +
  scale_color_manual(values=c("#01579B", "#4FC3F7", "#ffa600", "#ff6361")) +


  theme(panel.background = element_rect(fill = "lightblue"),


        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),

        legend.key = element_rect(fill = NA),


        text = element_text(size = 25))

p + labs(x=expression(paste("Longitude ",degree,"W",sep="")),
         y=expression(paste("Latitude ",degree,"N",sep="")))

It gives me this image

I am trying to get rid of the degree symbols on the ticks, and only have the degree symbols for the axis titles. Does anyone know how to fix that?

You shouldn't need an example of my dataframes, but if you do I can edit the question to include them.

↧

Use multiquantile groups from a large dataframe in a grouped dataframe in R

February 18, 2020, 5:56 pm

≫ Next: ggmap and spatial data plotting issue

≪ Previous: How to get rid of degree symbols on axis ticks in geom_polygon?

I have the next problem, I have a large dataframe, in which I have to extract the quantiles from a variable but by group, by instance:

list_q <- list()

for (i in 3:5){

  tmp <- mtcars %>% 
    filter(gear == i) %>% 
    pull(mpg) %>% 
    quantile(probs = seq(0, 1, 0.25), na.rm = TRUE)

  list_q[[i]] <- tmp  

}

list_q

With this output:

[[3]]
  0%  25%  50%  75% 100% 
10.4 14.5 15.5 18.4 21.5 

[[4]]
    0%    25%    50%    75%   100% 
17.800 21.000 22.800 28.075 33.900 

[[5]]
  0%  25%  50%  75% 100% 
15.0 15.8 19.7 26.0 30.4

Now, I need to group the variable means and determine which quantile it belongs but using the original measures:

a <- mtcars %>% 
  group_by(gear, carb) %>% 
  summarize(mpg_mean = mean(mpg)) %>% 
  ungroup()

    gear  carb mpg_mean
   <dbl> <dbl>    <dbl>
 1     3     1     20.3
 2     3     2     17.2
 3     3     3     16.3
 4     3     4     12.6
 5     4     1     29.1
 6     4     2     24.8
 7     4     4     19.8
 8     5     2     28.2
 9     5     4     15.8
10     5     6     19.7
11     5     8     15

So I could do this:


g3 <- a %>% 
  filter(gear == 3) %>% 
  mutate(quantile = cut(mpg_mean, list_q[[3]], labels = FALSE, include.lowest = TRUE))

g4 <- a %>% 
  filter(gear == 4) %>% 
  mutate(quantile = cut(mpg_mean, list_q[[4]], labels = FALSE, include.lowest = TRUE))

g5 <- a %>% 
  filter(gear == 5) %>% 
  mutate(quantile = cut(mpg_mean, list_q[[5]], labels = FALSE, include.lowest = TRUE))

bind_rows(g3, g4, g5)

Obtaining:

# A tibble: 11 x 4
    gear  carb mpg_mean quantile
   <dbl> <dbl>    <dbl>    <int>
 1     3     1     20.3        4
 2     3     2     17.2        3
 3     3     3     16.3        3
 4     3     4     12.6        1
 5     4     1     29.1        4
 6     4     2     24.8        3
 7     4     4     19.8        1
 8     5     2     28.2        4
 9     5     4     15.8        1
10     5     6     19.7        2
11     5     8     15          1

I wish to know if there is a way to do this more efficiently

↧

ggmap and spatial data plotting issue

February 18, 2020, 6:05 pm

≫ Next: Conditional statement within loop using multiple datasets R

≪ Previous: Use multiquantile groups from a large dataframe in a grouped dataframe in R

I am trying to update some old code that I inherited from before the Google API days to do a fairly simple (I think) plot.

By the time I get to the plot, my data consists of 50 triplets of latitude, longitude, and $K amount of investment at that location.

head(investData)
  amount latitude longitude
1 1404   42.45909 -71.27556
2 1      42.29076 -71.35368
3 25     42.34700 -71.10215
4 1      40.04492 -74.58916
5 15     43.16431 -75.51130

at this point I use the following

register_google(key = "###myKey###")   #my actual key here

USAmap <- qmap("USA",zoom=4)

USAmap + geom_point(data=investData, aes(x=investData$longitude, y=investData$latitude,size=investData$amount))

I've been fighting all ay with establishing accounts and enabling APIs with Google, so it's entirely possible I've simply failed to enable something I need to. I have the geocoding, geolocation, and maps static APIs enabled.

I get the following output at the console

Source : https://maps.googleapis.com/maps/api/staticmap?center=USA&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
Source : https://maps.googleapis.com/maps/api/geocode/json?address=USA&key=xxx

But I get no plot.

if I simply run

qmap("USA", zoom=4)

I get the map I expect. But when I try to overlay the investment data I get zilch. I'm told by the folks who handed this to me that it worked in 2017...

Any idea where I'm going wrong?

↧

Conditional statement within loop using multiple datasets R

February 18, 2020, 6:10 pm

≫ Next: Knitr cannot find img files in /static/ folder

≪ Previous: ggmap and spatial data plotting issue

I would like to figure out who was the most recent previous owner at a location within the last two years before the current owner. The locations are called reflo (reference location). Note that there is not always an exact match for reflo.x and reflo within two years (so a solution that allows me to add additional conditions, such as to find the next closest reflo, would be extra helpful).

The conditions:

the previous owner has to have lived at the same location (lifetime_census$reflo==owners$reflo.x[i]) within two years of the current owner's year (lifetime_census$census_year<= 2 years of owners$spr_census)
if none, then assign NA

Previous owners (>20,000) are stored in a dataset called lifetime_census. Here is a sample of the data:

id    previous_owner   reflo  census_year 
16161 5587            -310     2001   
17723 5587            -310     2002      
19345 5879            -310     2003    
16848 5101             Q1      2001         
17836 6501             Q1      2002      
19439 6501             Q1      2003      
21815 6057             Q1      2004

I then have an owners dataset (here is a sample):

squirrel_id spr_census reflo.x 
6391        2005        Q1 
6130        2005       -310
6288        2005        A12

To illustrate what I am trying to achieve:

squirrel_id spr_census reflo.x  previous_owner  census_year
6391              2005  Q1      6057            2004
6130              2005 -310     5879            2003
6288              2005  A12     NA              NA

What I have currently tried is this:

n <- length(owners$squirrel_id)

for(i in 1:n) {
  last_owner <- subset(lifetime_census,
    life_census$squirrel_id != owners$squirrel_id[i] & #previous owner != current owner
    lifetime_census$reflo==owners$reflo.x[i] &
    lifetime_census$census_year <= owners$spr_census[i])  #owners can be in current or past year

    #Put it all together
    owners[i,"spring_owner"] <- last_owner

else {
owners[i, "spring_owner"] <- NA
}
}

This gives me a new column for the previous owner in any past year for reflo.x, adding NAs after all the conditions are not met. I cannot figure out how to restrict this search to the last two years.

Any ideas? (Note that there is not always an exact match for reflo.x and reflo within two years (so a solution that allows me to add additional conditions, such as to find the next closest reflo, would be extra helpful).)

↧

Knitr cannot find img files in /static/ folder

February 18, 2020, 6:12 pm

≫ Next: Loop in R: if condition A is not matched, then condition B. If no condition is matched, then NA

≪ Previous: Conditional statement within loop using multiple datasets R

I have a hugo-academic website (methods101.com) that has been working well for last year.

I just went to edit some pages and I've started getting a new error.

The code that seems to be creating the problems is the knitr::include_graphics() function.

This is an example of the text that produces an error:

{r, echo=FALSE, out.width=600, 
fig.cap="Newspaper article in Word document, next to same article on internet.", 
fig.align='center'}

knitr::include_graphics("/img/soc224_qual_analysis_eg_figure_1.png")

This is the error message:

Rendering content/docs/SOC224_qual_analysis_eg.Rmd
Quitting from lines 80-81 (SOC224_qual_analysis_eg.Rmd) 
Error in knitr::include_graphics("/img/soc224_qual_analysis_eg_figure_1.png") : 
  Cannot find the file(s): "/img/soc224_qual_analysis_eg_figure_1.png"
Calls: local ... withCallingHandlers -> withVisible -> eval -> eval -> <Anonymous>
Execution halted
<simpleError in render_page(f): Failed to render 'content/docs/SOC224_qual_analysis_eg.Rmd'>

I get the same problem on different computers, and after fresh re-installing/downloading the website contents.

The image file is definitely inside the folder:

/static/img/

↧

Loop in R: if condition A is not matched, then condition B. If no condition is matched, then NA

February 18, 2020, 6:14 pm

≫ Next: Clustered Stacked Bar Chart in R

≪ Previous: Knitr cannot find img files in /static/ folder

I would like to figure out who lived in a location before the current owner. The locations are called reflo (reference location), and they have a corresponding x and y coordinate (called locx and locy, respectively).

The conditions:

if there is not an owner at the exact same location, called reflo (i.e., lifetime_census$reflo==owners$reflo.x[i]), then the next closest owner (witin 30 meters) will do
if even after this condition is relaxed there is still no previous owner, assign NA

Previous owners (>20,000) are stored in a dataset called lifetime_census. Here is a sample of the data:

  id    previous_owner reflo  locx  locy   lifespan  
16161      5587       -310    -3     10    1810    
16848      5101        Q1     17.3   0.8     55    
21815      6077        M2     13     1.8    979

I then have an owners dataset (here is a sample):

squirrel_id      spr_census reflo.x    spring_locx      spring_locy 
6391              2005       M4           13.3           4  
6130              2005      -310          -3             10

To illustrate what I am trying to achieve:

squirrel_id spr_census reflo.x spring_locx spring_locy previous_owner  previous_owner_lifespan  
6391        2004       M4       13.3       4           6077             979
6130        2005      -310     -3          10          5587             1810

What I have currently tried is this (which successfully adds a column for previous_owner at the exact reflo):

n <- length(owners$squirrel_id)
distance <- 30

for(i in 1:n) {
  last_owner <- subset(lifetime_census,
    lifetime_census$reflo==owners$reflo.x[i]) #using the exact location

    owners[i,"previous_owner"] <- last_owner

else {
owners[i, "previous_owner"] <- NA #assigning NAs when none of these conditions match
}
}

I've also tried replacing the lifetime_census$reflo==owners$reflo.x line with a search limit to no success:

(30*owners$spring_locx[i]-30* lifetime_census$locx)^2+(30* owners$spring_locy[i]-30* lifetime_census$locy)^2<=(distance)^2  #set the search limit, which is not the same as finding the next closest previous owner

I cannot figure out how to use the distance filter only after no reflo.x match was found.

Any ideas?

↧