Changing multiple columns name, pasting at the beginning/end of column name

November 29, 2019, 8:56 am

≫ Next: Divide in multiple columns an extra-long legend, pheatmap

≪ Previous: How to make a for loop test for matches between columns of different data frames and then save to a new data frame

I have a very easy question but I am a bit struggling with it as I am not good with string manipulation, I have a dataset that looks something like this

df <- data.frame(id= c(1,1,1,2,2,2,3,3,3), time=c(1,2,3,1,2,3,1,2,3),y = rnorm(9), x1 = rnorm(9), x2 = c(0,0,0,0,1,0,1,1,1),c2 = rnorm(9))
df
#    id  time    y         x1      x2     c2
# 1  1    1  0.2849573 -2.0675484  0  -0.07262881
# 2  1    2  0.7790181 -0.7575962  0  -0.58792408
# 3  1    3  1.5612293  0.6249859  0   1.19410761
# 4  2    1  0.5001897  3.4156129  0  -0.03577452
# 5  2    2  0.7155184 -0.5672982  1  -1.22208675
# 6  2    3  0.5086272 -0.7848763  0  -0.41084467
# 7  3    1 -0.4707959  0.1159467  1   0.77233201
# 8  3    2  0.8641184  0.2498162  1   0.49336869
# 9  3    3  1.3348043 -0.6803672  1  -0.33189217

I would simply like to change all the column names from x1 onwards adding a "_0". the final dataset should look like this.

final
#   id   time     y       x1_o     x2_o    c2_o
# 1  1    1  1.1251762 -0.7191008    0  -0.07478527
# 2  1    2  0.7585758  1.8694635    0  -0.42652822
# 3  1    3 -1.3180201 -0.4336776    0   0.38417779
# 4  2    1  1.7335904  2.2968254    0  -0.35639828
# 5  2    2  0.1506950 -0.5481873    1  -0.38523601
# 6  2    3 -1.9475207 -0.5302951    0   0.21721675
# 7  3    1 -0.1024133 -0.2872962    1  -0.06347213
# 8  3    2  0.1316069  0.1463118    1  -0.19518602
# 9  3    3 -1.1037682 -0.1129085    1  -0.24011278

I am able to change column names one by one, but I would like to find a one-liner command. I have tried this, but it is only able to paste at the beginning.

dp_o<-dp_o %>% rename_at(3:5, ~paste("_o",.))

Probably it is just a variation of the code above, but I am struggling a bit to understand which variation given that I do not understand well string manipulation

thanks in advance

↧

Divide in multiple columns an extra-long legend, pheatmap

November 29, 2019, 8:58 am

≫ Next: R Rhansontable - Color Specific Cells Conditionally On Other Table

≪ Previous: Changing multiple columns name, pasting at the beginning/end of column name

I'm a bit locked with my data because I've got multiple samples and, when I draw a heatmap (with pheatmap package), legend is break because is bigger than screen (or PDF) margin... some like this:

I'd like to break my legend in multiple cols in order to be able to see all samples... any idea, please? Thanks!

↧

R Rhansontable - Color Specific Cells Conditionally On Other Table

November 29, 2019, 8:59 am

≫ Next: Function imported from dependency not found, requires library(dependency)

≪ Previous: Divide in multiple columns an extra-long legend, pheatmap

This is my first post on SO, so please bear with me if my post is not fully aligned with the rules. Though I will do my best to give a clear description of my problem, the resources I already checked and provide a reproducible example.

First of all, let me explain the problem: I would like to highlight specific cells of a Rhandsontable filled with numeric values in Shiny using different colors (red and green) according to a condition from another table (same nb of columns/rows filled with boolean values, where TRUE=green, FALSE=red).

Let's say I want to start from the following two tables:

DF = data.frame(val = 1:3, big = LETTERS[1:3])

DF_condition = data.frame(val = c(TRUE, FALSE,FALSE), big = c(FALSE,TRUE,FALSE))

I would like the cells at (1,1) and (2,2), i.e where the other table is set at TRUE to be green and all other cells to be red.

I have looked at multiple other posts within SO (and other) like these:

However, none of them solve my problem. Indeed, I am facing the three following main issues:

I am NOT trying to highlight an entire row / column but only specific cells
The conditional formatting should be done using a condition from another table
My database has hundred thousands of rows and around 20 columns (and I would like if possible to use a vectorized approach and not a loop through columns/rows)

As I am not familiar with JScript, used within rhandsontable, I am a bit stuck.

Please find below a minimal reproduceable example:

ui <- shinyUI(bootstrapPage(
  rHandsontableOutput("hot")
))

server <- shinyServer(function(input, output) {
  output$hot <- renderRHandsontable({
    DF = data.frame(val = 1:3, big = LETTERS[1:3])
    DF_condition = data.frame(val = c(TRUE, FALSE,FALSE), big = c(FALSE,TRUE,FALSE))
    col_highlight = c(1,2)
    row_highlight = c(1,3)

    rhandsontable(DF, col_highlight = col_highlight-1, row_highlight = row_highlight-1) %>%
      hot_cols(renderer = "
               function(instance, td, row, col, prop, value, cellProperties) {
               Handsontable.renderers.NumericRenderer.apply(this, arguments);
               if (instance.params) {
               hcols = instance.params.col_highlight
               hcols = hcols instanceof Array ? hcols : [hcols]
               hrows = instance.params.row_highlight
               hrows = hrows instanceof Array ? hrows : [hrows]
               }
               if (instance.params && hcols.includes(col) && hrows.includes(row)) td.style.background = 'red';
               }")
    })
})
shinyApp(ui, server)

Which highlights the two first columns and the two first rows, whereas I would need only cells (1,1) and (2,2).

Thank you very much in advance for your kind help.

↧

Function imported from dependency not found, requires library(dependency)

November 29, 2019, 9:00 am

≫ Next: Multiple parents. No root in graph. Provide graph with one parentless node: How to fix this error for ggraph?

≪ Previous: R Rhansontable - Color Specific Cells Conditionally On Other Table

I am trying to create an R package that uses functions from another package (gamlss.tr).

The function I need from the dependency is gamlss.dist::TF (gamlss.dist is loaded alongside gamlss.tr), but it is referenced in my code as simply TF within a call to gamlss.tr::gen.trun.

When I load gamlss.tr manually with library(), this works. However, when I rely on the functions of the dependency automatically being imported by my package through @import, I get an "object not found" error as soon as TF is accessed.

My attempt to be more explicit and reference the function I need as gamlss.dist::TF resulted in a different error ("unexpected '::'").

Any tips on how to use this function in my package would be much appreciated!

The code below reproduces the problem if incorporated into a clean R package (as done in this .zip), built and loaded with document("/path/to/package"):

#' @import gamlss gamlss.tr gamlss.dist
NULL

#' Use GAMLSS
#'
#' Generate a truncated distribution and use it.
#' @export

use_gamlss <- function() {
  print("gen.trun():")

  gamlss.tr::gen.trun(par=0,family=TF)
    #Error in inherits(object, "gamlss.family") : object 'TF' not found
  #gamlss.tr::gen.trun(par=0,family=gamlss.dist::TF)
    #Error in parse(text = fname) : <text>:1:1: unexpected '::'
  y = rTFtr(1000,mu=10,sigma=5, nu=5)

  print("trun():")

  truncated_dist = gamlss.tr::trun(par=0,family=TF, local=TRUE)
  model = gamlss(y~1, family=truncated_dist)
  print(model)
}

use_gamlss() will only start working once a user calls library(gamlss.tr).

↧

Multiple parents. No root in graph. Provide graph with one parentless node: How to fix this error for ggraph?

November 29, 2019, 9:00 am

≫ Next: Method(s) to speed up st_crop (sf package) on large datasets

≪ Previous: Function imported from dependency not found, requires library(dependency)

I have data like this; two files:

Edge_list

node1   node2
x1      x2    
x1      x3      
x8      x4      
x4      x5     
x6      x7

vertice_list:

node    weight
    x1   100
    x2   50
    x3   10
    x4   20
    x5   10
    x6   20
    x7   60

and I want to circlepack it, so I wrote this R code:

vertices <-read.table('vertice_list',header=TRUE,sep='\t')
edges <-read.table('Edge_list',header=TRUE,sep='\t')
ggraph(mygraph, layout = 'circlepack', weight='weight') +  geom_node_circle() +theme_void()

and I get the error:

Multiple parents. Unfolding graph
Error in FUN(X[[i]], ...) : 
  No root in graph. Provide graph with one parentless node

Could someone explain how to fix this? Thanks

↧

Method(s) to speed up st_crop (sf package) on large datasets

November 29, 2019, 9:01 am

≫ Next: Gather duplicating rows

≪ Previous: Multiple parents. No root in graph. Provide graph with one parentless node: How to fix this error for ggraph?

I need to extract information across different shapefiles for ~ 4 mio grid-cells of 1 ha. Currently I am using st_crop on each layer in a for-loop over all cells, but this runs forever. I thought to speed up the process in using a 'data.table'(DT)-sort-of-way to crop shapefiles by coordinates. Let's consider the example below, where I am looking for the extent of polygon edges in an area of interest:

require(sf)
require(data.table)
require(ggplot2)
require(tidyverse)

# load shapefile
nc = st_read(system.file("shape/nc.shp", package="sf"))


# Define a bounding-box that mimic a mowing-window or area of interest
bb <- st_bbox(c(xmin= -79, xmax=-78,ymin= 34.5, ymax= 35.5))


# Commute 'nc' into some sort of data.table object for fast subsetting, in preserving object's integrity (i.e. same id to all points of a given polygon)
nobs <- mapview::npts(nc,by_feature=T)
NC <- data.table::data.table(id=rep(1:nrow(nc),nobs),st_coordinates(nc)[,1:2])
head(NC)

# Compare cropping methods amon
library(microbenchmark)
x = runif(100)
test <- microbenchmark(
  crop_nc <- st_crop(nc,bb),
  crop_NC <- NC[X >= bb[1] & X < bb[3] & Y>= bb[2] & Y < bb[4]]
)  

print(test)
Unit: microseconds
  expr      min       lq      mean   median        uq       max neval cld
 crop_nc  5205.051 5675.807 6837.9472 5903.219 6829.0865 16046.654   100   b
 crop_NC   405.334  528.356  624.8398  576.996  656.9245  1295.361   100  a 
There were 50 or more warnings (use warnings() to see the first 50)

As expected, the DT-way of subsetting is faster. Let's now go from our DT-object back to as sf-object as follow:

crop_NC_sf <- st_as_sf(crop_NC,coords=c("X","Y"),crs=st_crs(nc))  %>% group_by(id)  %>%  summarise(i=mean(id)) %>% st_cast("POLYGON")

Now compare the perimet of polygon's included in our study area:

sum(st_length(crop_nc),na.rm=T)
1307555 [m]

sum(st_length(crop_NC_sf),na.rm=T)
2610959 [m]

Obviously not working very well...

Result

Questions:

is there another way to speed up st_crop()
is there a way to recreate a polygon from points in preserving the 'original' order points are connected to each others?

↧

Gather duplicating rows

November 29, 2019, 9:03 am

≫ Next: using ggplot to add error bars to Barplot

≪ Previous: Method(s) to speed up st_crop (sf package) on large datasets

Newbie again in need for help.

I'm stuck with a transpose that is duplicating rows.

My data is structured like this:

id date client status  agent_1 agent_2 agent_3...agent_10  flag_1 Flag_2 Flag_3  order_num_1 Order_num_2
1  xxx   yyy   01        A1        C2    E3
2  xxx  yyyy   02        B1        D2    F3
3  xxx  yyy    03        C1        E2    G3

I want:

id  date   client  status  agent flag  order_num
1    xxx    yyy     01      A1
1    xxx   yyyy     01      C2
1    xxx   yyy      01      E3
2
2
3
3

The client information will repeat and the columns _1 to _10 will become rows.

I have the code below. The first gather agent works perfectly but then when I run the next gather it duplicates my df, with all possible combinations.

 df1 <- df %>% 
      select( id,
              created_on,
              date_reported,
              client_name,
              status_code, 
              starts_with('agent'), 
              starts_with('Flag'), 
              starts_with('order_num'),
              starts_with('order_date'),
              starts_with('manufacturer'))  %>%
      gather(key, value = "agent", starts_with('agent')) %>%
      select(-starts_with('key'))%>%
      filter(!is.na(agent))
      gather(key1, value = "flag", starts_with('Flag'))%>%
      gather(key2, value = "order_num", starts_with('order_num'))  %>%
      gather(key3, value = "order_date", starts_with('order_date'))   %>%
      gather(key4, value = "manufacturer", starts_with('manufacturer'))

What am I doing wrong? I tried group_by agent but it didn't work.

↧

using ggplot to add error bars to Barplot

November 29, 2019, 9:04 am

≫ Next: Pull Data for each Dow Jones 30 company with R

≪ Previous: Gather duplicating rows

I am drawing a bar plot of means for a diversity index for a study i have done. I have calculated the index for each sample and added it to my table (which i orignally read in). I then calculated the index means for two different environments and plotted those means. However, i cannot work out how to add error bars.i understand ggplot2 is a useful tool for doing this, but cannot get my head around the explanation.

SO, basically trying to take the means of two values from a table, and plot a bargraph with error bars. this is my code at the moment

mean Shannon of river and lake

`mean_river <- mean(parasite_data$Shannon.index[1:24])
mean_lake <- mean(parasite_data$Shannon.index[25:43])`

matrix of means #

Shannon_mean <- matrix(c(mean_river, mean_lake), nrow = 1, ncol = 2, dimnames = list(c("mean"), c("River","Lake")))

plot graph #

`barplot(Shannon_mean, 

# name axis 
    xlab = "Environment", ylab = "Shannon Diversity Index", 

# title of graph 
    main = "Diversity of Parasites found on Fish from River 
            and Lake Environments", 

# size of title text and colour of bars #
    cex.main = 1, col = "gray80")`

like i said, i have attempted to use ggplot, but cannot read the data in correctly. any help would be appreciated.

↧

Pull Data for each Dow Jones 30 company with R

November 29, 2019, 9:07 am

≫ Next: R - Unexpected NaN in rcorr results

≪ Previous: using ggplot to add error bars to Barplot

I am currently facing a problem regarding the pulling of data in R. With the getSymbols() function I can already obtain data for each Dow Jones company.

for example this gives me the data for apple from a chosen start_date until end_date:

'getSymbols("AAPL", src = "yahoo", from = start_date, to = end_date'

However I would somehow like to pull the data for all 30 companies. I dont want to do every company manually, so I was wondering if there is any way to pull the data all at once. So far I have not found any information regarding my problem.

I would be very thankfull for any help.

↧

R - Unexpected NaN in rcorr results

November 29, 2019, 9:08 am

≫ Next: How do I load an rds file into R

≪ Previous: Pull Data for each Dow Jones 30 company with R

I'm working with a dataset of survey responses, where I need to derive a pearson correlation between each of the questions and the overall mean of all responses.

So the survey responses are in a matrix x, 163 rows of 48 variables, num[1:163,1:48]. The numbers contained are any of 0,25,50,75,100 or NA (interval scale choices or no answer)

I have a vector v of length 163, all values are 95.1, which is the overall mean for the dataset (sum responses / count responses) with NAs removed.

Combine to a new matrix, run rcorr, and pull the $r result.

x.matrix.final <- as.matrix(cbind(x, v))
x.corr <- rcorr(x.matrix.final, type="pearson")
x.corr.r <- rcorr$r

The resulting matrix of x.corr.r gives me NaN for the full row and column of v except their intersection which is 1.

resulting matrix

x.matrix.final is definitely a matrix [1:163, 1:49]

I'm stumped.

↧

How do I load an rds file into R

November 29, 2019, 9:10 am

≫ Next: Count occurences of lists efficiently

≪ Previous: R - Unexpected NaN in rcorr results

I'm very new to using R, so this might hopefully be a very easy question for some of you.

I'm trying load a rds file which should contain some data.

Professor says to use readRDS(), however R gives this long error message which I've not been able to decipher myself.

I'm trying to load the file and place it in an Object.

I hope that someone are able to provide some clever solution to this problem.

Canteen_clean <- readRDS("C:/Users/a_s_j/OneDrive/Studie/Cand.merc.Business Intelligence/1. Semester/R for Business Analytics/.Rproj/39 - Graphics/Exercises02/canteen_clean.rds")

Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file 'C:/Users/a_s_j/OneDrive/Studie/Cand.merc.Business Intelligence/1. Semester/R for Business Analytics/.Rproj/39 - Graphics/Exercises02/canteen_clean.rds', probable reason 'No such file or directory'

I'm using: pacman::p_load("pacman", "tidyverse") to load the packages that should be necessary.

↧

Count occurences of lists efficiently

November 29, 2019, 9:15 am

≫ Next: lapply misses a variable name when used with lm and wanting to reduce iteration?

≪ Previous: How do I load an rds file into R

I have a list containing a few millions of lists, these sublists have a few distinct possible values, maybe 10 to 100.

I want to count the number of occurrences of these values.

The code below works but it is very slow. Can we do this faster ?

count_by_list <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
  unique_lst <- unique(lst)
  res <- tibble::tibble(!!var_nm := unique_lst, !!count_nm := NA)
  for(i in seq_along(unique_lst)){
    res[[count_nm]][[i]] <- sum(lst %in% res[[var_nm]][i])
  }
  res
}

x <- list(
  list(a=1, b=2),
  list(a=1, b=2),
  list(b=3),
  list(b=3, c=4))

count_by_list(x)
#> # A tibble: 3 x 2
#>   x                    n
#>   <list>           <int>
#> 1 <named list [2]>     2
#> 2 <named list [1]>     1
#> 3 <named list [2]>     1

^{Created on 2019-11-29 by the reprex package (v0.3.0)}

I tried hashing with the library digest but it was actually slower, and getting worse as n increases :

library(digest)
count_by_list2 <- function(lst, var_nm = as.character(substitute(lst)), count_nm = "n"){
  unique_lst <- unique(lst)
  digested   <- vapply(lst, digest, character(1))
  res        <- as.data.frame(table(digested))
  names(res) <- c(var_nm, count_nm)
  res[[1]] <- unique_lst
  res
}

If you need to benchmark you can use x_big <- unlist(replicate(10000 ,x, F), recursive = FALSE).

I added the tags rcpp and parallel processing as these might help, these are not constraints on the answers.

↧

lapply misses a variable name when used with lm and wanting to reduce iteration?

November 29, 2019, 9:22 am

≫ Next: How to extract data from an interactive graph

≪ Previous: Count occurences of lists efficiently

I have used melt to combine all my 32 columns into a single column, their values into a single column, and the independent variable into a single column.

I then wanted to use lapply to generate a lm matching the rows of
Years Species Farmland

There were two ways of me wanting to do this;
1. To take the lm of one variable name i.e. Starling values across all years (1994:2013)
2. To take the lm of all variables names I.e. Starling, Skylark, Lapwing .... Farmland values together across each year.

An example of my data:

structure(list(Years = c(1994L, 1994L, 1995L, 1996L, 1997L, 1998L
), Species = structure(1:6, .Label = c("Starling", "Skylark", 
"YellowWagtail", "Kestrel", "Yellowhammer", "Greenfinch"), class = "factor"), 
    Farmland = c(13260L, 13520L, 8129L, 15575L, 18686L, 18541L
    )), row.names = c(1L, 20L, 40L, 60L, 80L, 100L), class = "data.frame")

A further example:

'data.frame':   570 obs. of  3 variables:
 $ Years   : int  1994 1995 1996 1997 1998 1999 2000 2002 2003 2004 ...
 $ Species : Factor w/ 30 levels "Starling","Skylark",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Farmland: int  13260 15551 16335 18997 18571 18376 15770 16054 15101 16276 ...

Code for lm for Q.1:

df_try <- lapply(1:n, function(x) lm(Farmland ~ Years + Species, work_practice))

The output:

Call:
lm(formula = Farmland ~ Years + Species, data = work_practice)

Coefficients:
         (Intercept)                 Years        SpeciesSkylark  
           -708278.6                 363.0                 578.8  
SpeciesYellowWagtail        SpeciesKestrel   SpeciesYellowhammer  
             -9329.8                -744.4                -238.7  
   SpeciesGreenfinch        SpeciesSwallow    SpeciesHousemartin  
               246.3                 506.6               -3928.5  
       SpeciesLinnet  SpeciesGreyPartridge     SpeciesTurtleDove  
              -680.2               -5825.1               -5417.4  
  SpeciesCornbunting      SpeciesBullfinch     SpeciesSongthrush  
            -12187.9               -5688.7                -279.1  
    SpeciesBlackbird        SpeciesDunnock    SpeciesWhitethroat  
               490.2                 299.0                 231.6  
         SpeciesRook    SpeciesReedBunting      SpeciesStockdove  
              -653.9               -6864.5               -1788.0  
    SpeciesGoldfinch        SpeciesJackdaw           SpeciesWren  
               156.6                -637.3                 553.1  
        SpeciesRobin        SpeciesBluetit       SpeciesGreatTit  
               328.7                 460.3                 384.3  
SpeciesLongtailedTit      SpeciesChaffinch        SpeciesBuzzard  
             -1359.8                 499.7               -6888.2  
  SpeciesSparrowhawk  
             -4458.5

The problem with this; Starling is missing (The first variable name), and Years is not necessary for the result (How can this be removed) this is iterated on call 19 times which I assume is because of the dataframe. Is there a way to call this only once?

I have tried doing this when the variable (Species) was in columns but the output only calls for one variable 19 times...

↧

How to extract data from an interactive graph

November 29, 2019, 9:22 am

≫ Next: Object '.' not found while piping with dplyr

≪ Previous: lapply misses a variable name when used with lm and wanting to reduce iteration?

I need to fetch datapoints from the website that provides aggregated polling number. The data was presented in an interactive graph. How should I fetch all the data points (date:number pair)for every candidate? I have tried to analyze and inspect the source code but couldn't find the data file it points to. I will be comfortable with solutions either in Python or R. Your help will be much appreciated.

↧

Object '.' not found while piping with dplyr

November 29, 2019, 9:25 am

≫ Next: Make a download button in shiny app that saves a data.frame in an excel file and downloads the excel file in the user's computer

≪ Previous: How to extract data from an interactive graph

I am trying to conduct a survival curve using the survival package. The MWE code is as follows:

df %>% 
  filter(fac <= "Limit") %>% 
  survfit(Surv(tte, !is.na(event)) ~ fac, data = .) %>% 
  ggsurvplot(fit = .)

I get the error Error in eval(fit$call$data) : object '.' not found

When I try to break this down further by:

 survfit <- df %>% 
  filter(fac <= "Limit") %>% 
  survfit(Surv(tte, !is.na(event)) ~ fac, data = .)

 ggsurvplot(fit = survfit)

I get an identical error. Is anyone able to figure out how to pipe from my dataframe all the way through a survival curve? The reason I would like to do this is to streamline the filtering of my dataframe in order to produce a multitude of different survival curves without having to create many subsetted dataframes.

↧

Make a download button in shiny app that saves a data.frame in an excel file and downloads the excel file in the user's computer

November 29, 2019, 9:25 am

≫ Next: automate ggplots while using variable labels as title and axis titles

≪ Previous: Object '.' not found while piping with dplyr

I have made an application on R shiny that I share with friends on the same network. I host the application on my remote computer and people connect to it using dedicated ports. I am trying to make a download button that saves a data.frame in an excel file and downloads the excel file in the user's computer. Currently, I am able to make the download button but it only writes the file to a folder in the remote server. I need help with the downloadhandler essentially. Thank you for your time. Here a csc reproducible example

library(shiny) 
library(e1071)
library(rminer)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggvis)
library(corrplot)
library(DT)
library(caret)
ui <- navbarPage(title = "HR Analytics         ",

                 tabPanel("Data Import",
                          sidebarLayout(sidebarPanel(
                            fileInput('file1', 'Choose CSV File to upload',
                                      accept=c('text/csv', 
                                               'text/comma-separated-values,text/plain', 
                                               '.csv')),
                            helpText("Note: Please ensure that the the file is in .csv",
                                     "format and contains headers."),
                            tags$hr(),
                            actionButton("do", "Import")
                          ),
                          mainPanel(h2(helpText("Descriptive Statistics")),
                                    verbatimTextOutput('contents'))
                          )
                 ),#tabpanel
                 tabPanel("Predictive Model",
                          sidebarLayout(sidebarPanel(
                            uiOutput("model_select"),
                            actionButton("enter", "Enter")
                          ),
                          mainPanel(h2(helpText("Model Output")),
                                    verbatimTextOutput('modelOutput'))
                          )
                 ),#tabpanel
                 tabPanel("Report",
                          sidebarLayout(sidebarPanel(
                            tags$style(type="text/css",
                                       ".shiny-output-error { visibility: hidden; }",
                                       ".shiny-output-error:before { visibility: hidden; }"
                            ),
                            helpText("Download final list of employess to be retained"),
                            br(),
                            uiOutput("modsel"),
                            helpText("Select Model"),
                            uiOutput("noselect"),
                            helpText("Select number pf employess"),
                            downloadButton('downloadData', 'Download'),
                            helpText("Download final list of employees to be retained")
                          ),

                          mainPanel(h2(helpText("Retained Employees")),
                                    dataTableOutput("reportOutput"))
                          )
                 )#tabpanel


)


library(shiny)

server <- function(input, output) {

  hr = eventReactive(input$do,{
    inFile <- input$file1

    if (is.null(inFile))
      return(NULL)

    hr = read.csv(inFile$datapath, header=T, sep=",")
  })

  output$contents <- renderPrint({
    return(summary(hr()))
  })

  output$model_select<-renderUI({
    selectInput("modelselect","Select the model",choices = c("Tree Learning"="rpart","Logistic Regression"="LogitBoost", "Naive Bayes" = "nb"))
  })

  output$modsel<-renderUI({
    selectInput("modelselect2","Select Algo",choices = c("Logistic Regression","Naives Bayes","Tree Learning"),selected = "Logistic_reg")
  })

  output$noselect<- renderUI({
    sliderInput("noselect", "Number of observations:",
              min = 0, max = 300, value = 20)})


  algo = eventReactive(input$enter,{
    return(input$modelselect)
  })


  output$modelOutput <- renderPrint({
    hr_model <- hr() %>% filter(left==0 | last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)
    hr_model$left <- as.factor(hr_model$left)
    train_control<- trainControl(method="cv", number=5, repeats=3)
    rpartmodel<- train(left~., data=hr_model, trControl=train_control, method=algo())
    # make predictions
    predictions<- predict(rpartmodel,hr_model)
    hr_model_tree<- cbind(hr_model,predictions)
    # summarize results
    confusionMatrix<- confusionMatrix(hr_model_tree$predictions,hr_model_tree$left)
    confusionMatrix
  })

  rt <- reactive(
    if(input$modelselect2== "Logistic Regression"){
      f1<-data()
      hr_model1 <- hr() %>% filter(left==0 |last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)
      hr_model1$left <- as.factor(hr_model1$left)
      train_control<- trainControl(method="cv", number=5, repeats=3)
      # Keep some data to test again the final model
      set.seed(100)
      inTraining <- createDataPartition(hr_model1$left, p = .75, list = FALSE)
      training <- hr_model1[ inTraining,]
      testing  <- hr_model1[-inTraining,]
      # Estimate the drivers of attrition
      logreg = glm(left ~ ., family=binomial(logit), data=training)
      # Make predictions on the out-of-sample data
      probaToLeave=predict(logreg,newdata=testing,type="response")
      # Structure the prediction output in a table
      predattrition = data.frame(probaToLeave)
      # Add a column to the predattrition dataframe containing the performance
      predattrition$performance=testing$last_evaluation

      predattrition$priority=predattrition$performance*predattrition$probaToLeave
      orderpredattrition=predattrition[order(predattrition$priority,decreasing = TRUE),]
      orderpredattrition <- head(orderpredattrition, n=input$noselect)
      or<- data.frame(orderpredattrition)
      or
    }
    else if(input$modelselect2== "Naives Bayes"){
      f1<-data()
      hr_model1 <- hr() %>% filter(left==0 |last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)
      hr_model1$left <- as.factor(hr_model1$left)
      train_control<- trainControl(method="cv", number=5, repeats=3)
      # Keep some data to test again the final model
      set.seed(100)
      inTraining <- createDataPartition(hr_model1$left, p = .75, list = FALSE)
      training <- hr_model1[ inTraining,]
      testing  <- hr_model1[-inTraining,]

      # Estimate the drivers of attrition
      e1071model2 = naiveBayes(left ~ ., data=training)
      # Make predictions on the out-of-sample data
      probaToLeave=predict( e1071model2,newdata=testing[,c(-7,-9,-10)],type="raw")
      # Structure the prediction output in a table
      predattrition = data.frame(probaToLeave)
      colnames(predattrition) <- c("c","probaToLeave")
      predattrition[1] <- NULL 
      # Add a column to the predattrition dataframe containing the performance
      predattrition$performance=testing$last_evaluation
      predattrition$priority=predattrition$performance*predattrition$probaToLeave
      orderpredattrition=predattrition[order(predattrition$priority,decreasing = TRUE),]
      orderpredattrition <- head(orderpredattrition, n=input$noselect)
      or<- data.frame(orderpredattrition)

    }

    else if(input$modelselect2== "Tree Learning"){
      f1<-data()
      hr_model1 <- hr() %>% filter(left==0 |last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)
      hr_model1$left <- as.factor(hr_model1$left)
      train_control<- trainControl(method="cv", number=5, repeats=3)
      # Keep some data to test again the final model
      set.seed(100)
      inTraining <- createDataPartition(hr_model1$left, p = .75, list = FALSE)
      training <- hr_model1[ inTraining,]
      testing  <- hr_model1[-inTraining,]
      # Estimate the drivers of attrition
      rpartmodel = rpart(left ~ satisfaction_level+last_evaluation+number_project+average_montly_hours+time_spend_company+Work_accident+promotion_last_5years,method = "anova",data=training)
      # Make predictions on the out-of-sample data
      probaToLeave=predict(rpartmodel,newdata=testing[,c(-7,-9,-10)],type="vector")
      # Structure the prediction output in a table
      predattrition = data.frame(probaToLeave)*0.5
      # Add a column to the predattrition dataframe containing the performance
      predattrition$performance=testing$last_evaluation

      predattrition$priority=predattrition$performance*predattrition$probaToLeave
      orderpredattrition=predattrition[order(predattrition$priority,decreasing = TRUE),]
      orderpredattrition <- head(orderpredattrition, n=input$noselect)

      or<- data.frame(orderpredattrition)
      or
    }
  )


  output$reportOutput = renderDataTable({
    rt()
  })

  output$downloadData <- downloadHandler(
    filename = function() { paste(input$modelselect2, '.csv', sep='') },
    content = function(file){
      write.csv(rt(), file)
    }
  )

}
shinyApp(ui=ui, server = server)

↧

automate ggplots while using variable labels as title and axis titles

November 29, 2019, 9:31 am

≫ Next: cannot allocate vector of size 12.9 Gb - R [duplicate]

≪ Previous: Make a download button in shiny app that saves a data.frame in an excel file and downloads the excel file in the user's computer

I'm getting a bit mixed up with NSE and a plot function. I'm trying to automate plotting a few plots at once while labeling axis etc. using variable labels (not names). Lets say we have a large dataset with all variables already labelled. Small example here:

library(tidyverse)
library(sjlabelled)
library(ggplot2)
library(cowplot)
data("diamonds")
diamonds <- diamonds %>% 
  var_labels(
  cut ="nice cut",
  color = "all colours",
  clarity = "very claity all",
  depth = "test depth")

The basic plot I want is this:

p1 <- ggplot(diamonds, aes(x = cut, y = depth)) + geom_boxplot(aes(fill = cut)) +
  theme_cowplot() + 
  lab(title = "Plot of test depth ~ nice cut",   #based on label variable
                           x = "nice cut",      #based on label variable
                           y = "test depth",    #based on label variable
                         fill = "nice cut")    #based on label variable
p1

I want to automate this plot by cycling through other variables. So I want boxplots of columns in vars separately by depth The following is what I'm trying to do.

#firstly i think i should have labels separately (wondering is there a way I can use them directly from variable label?)

my_labels <-   c(
  cut = "nice cut",
  color = "all colours",
  clarity = "very claity all",
  depth = "test depth"
)

#plot function
plot_f <- function(df, x_var, y_var, x_var_label, y_var_label) {
  ggplot(df, aes(x = {{x_var}}, y = {{y_var}})) + geom_boxplot(aes(fill = {{x_var}})) +
    theme_cowplot() + labs(title = paste("Plot of", {{x_var_label}}, "~", {{y_var_label}}),
                           x = {{x_var_label}},
                           y = {{y_var_label}},
                           fill = {{x_var_label}})
}

#variables to cycle through
vars <- c("cut", "color", "clarity")
plot_list <- vars %>% 
  pmap(~plot_f(diamonds, .x, depth, my_labels)) #need to specify y_var_label & x_var_label, is there a 
#way I can just specify my_labels here?

#Finally plot lists
grid.arrange(grobs = plot_list, ncol = 1)

Other code attempts

Thats the approach I was thinking, I wonder am I better off trying to add the labels later separately as was done here using plot_list$labels?

#Also tried a for loop which worked but the fill didnt (and also missing the variable labels)
p <- list()
for(i in vars){
  p[[i]] <- ggplot(diamonds, aes_string(x = i, y = "depth", fill = i)) + geom_boxplot() +
    #note aes_string instead of aes
    theme_cowplot()
}
grid.arrange(grobs = p, ncol = 1)

EDIT

This simpler version plots but the plot isn't capturing the fill correctly and obviously missing the variable labels (paste etc) I want:

    #plot function
    plot_f <- function(df, x_var, y_var) {
      ggplot(df, aes(x = {{x_var}}, y = {{y_var}})) + geom_boxplot(aes(fill = {{x_var}})) +
        theme_cowplot() 
    }

    plot_f(diamonds, cut, depth )  #plots fine

    #variables to cycle through
    vars <- c("cut", "color", "clarity")

#runs
    plot_list <- vars %>% 
      map(., ~plot_f(diamonds, .x, depth))

    #plots but fill isn't correct
    grid.arrange(grobs = plot_list, ncol = 1)

Any help appreciated.

↧

cannot allocate vector of size 12.9 Gb - R [duplicate]

November 29, 2019, 9:36 am

≫ Next: Create new column based on if rows from 2 data frames match

≪ Previous: automate ggplots while using variable labels as title and axis titles

This question already has an answer here:

R memory management / cannot allocate vector of size n Mb 8 answers

Hi I face a unique problem in R. Some times back I attempted Prudential Life Insurance Data Assessment for practice. I used randomforest in R and submitted my submission on kaggle. Now whenever I again run same model with same code on my PC I get error "cannot allocate vector of size 12.9 Gb" Some days ago I attempted same with doParallel package in R. First time model executed but now again it gives message "cannot allocate vector of size 12.9 Gb". I am not able to run model again. I closed down R, shut down PC, restarted, restarted R many times. But the error never goes. Can any body help here? Thanks in advance.

↧

Create new column based on if rows from 2 data frames match

November 29, 2019, 9:37 am

≫ Next: Using fasterize in R to write a raster

≪ Previous: cannot allocate vector of size 12.9 Gb - R [duplicate]

This seems simple enough, but can't figure it out. I'd like to create a new column in df2 (impute_id) that identifies whether or not the value (measurement) was imputed or if it is the raw, observed value from df1. If the rows match, then in the new column in df2, impute_id, assign the string observed and if the rows do not match, then assign the string imputed. I'd like to do this using dplyr if possible. Also to note, the rows in the data frames may not be in the same order even though they are in the example.

Example

Raw data

df1
   time protocol     measurement_type sample measurement
1     0     HPLC cis,cis-Muconic acid      a     0.57561
2     0     HPLC            D-Glucose      a          NA
3     0     HPLC cis,cis-Muconic acid      a          NA
4     0     HPLC            D-Glucose      b          NA
5     0    OD600      Optical Density      b     0.14430
6    22     HPLC cis,cis-Muconic acid      b          NA
7    22     HPLC            D-Glucose      a          NA
8    22    OD600      Optical Density      a          NA
9    24     HPLC cis,cis-Muconic acid      a          NA
10   24     HPLC            D-Glucose      b    33.95529

Imputed Data

df2
   time protocol     measurement_type sample measurement
1     0     HPLC cis,cis-Muconic acid      a     0.57561
2     0     HPLC            D-Glucose      a    33.95529
3     0     HPLC cis,cis-Muconic acid      a     0.57561
4     0     HPLC            D-Glucose      b    33.95529
5     0    OD600      Optical Density      b     0.14430
6    22     HPLC cis,cis-Muconic acid      b     0.57561
7    22     HPLC            D-Glucose      a    33.95529
8    22    OD600      Optical Density      a     0.14430
9    24     HPLC cis,cis-Muconic acid      a     0.57561
10   24     HPLC            D-Glucose      b    33.95529

Desired Output

df2
   time protocol     measurement_type sample measurement  impute_id
1     0     HPLC cis,cis-Muconic acid      a     0.57561   observed
2     0     HPLC            D-Glucose      a    33.95529    imputed
3     0     HPLC cis,cis-Muconic acid      a     0.57561    imputed
4     0     HPLC            D-Glucose      b    33.95529    imputed
5     0    OD600      Optical Density      b     0.14430   observed
6    22     HPLC cis,cis-Muconic acid      b     0.57561    imputed
7    22     HPLC            D-Glucose      a    33.95529    imputed
8    22    OD600      Optical Density      a     0.14430    imputed
9    24     HPLC cis,cis-Muconic acid      a     0.57561    imputed
10   24     HPLC            D-Glucose      b    33.95529   observed

Reproducible Data

Raw Data

df1 <- structure(list(time = c(0L, 0L, 0L, 0L, 0L, 22L, 22L, 22L, 24L, 
24L), protocol = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 
1L, 1L), .Label = c("HPLC", "OD600"), class = "factor"), measurement_type = structure(c(1L, 
2L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L), .Label = c("cis,cis-Muconic acid", 
"D-Glucose", "Optical Density"), class = "factor"), sample = c("a", 
"a", "a", "b", "b", "b", "a", "a", "a", "b"), measurement = c(0.57561, 
NA, NA, NA, 0.1443, NA, NA, NA, NA, 33.95529)), row.names = c(NA, 
-10L), class = "data.frame")

Imputed Data

df2 <- structure(list(time = c(0L, 0L, 0L, 0L, 0L, 22L, 22L, 22L, 24L, 
24L), protocol = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 
1L, 1L), .Label = c("HPLC", "OD600"), class = "factor"), measurement_type = structure(c(1L, 
2L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L), .Label = c("cis,cis-Muconic acid", 
"D-Glucose", "Optical Density"), class = "factor"), sample = c("a", 
"a", "a", "b", "b", "b", "a", "a", "a", "b"), measurement = c(0.57561, 
33.95529, 0.57561, 33.95529, 0.1443, 0.57561, 33.95529, 0.1443, 
0.57561, 33.95529)), row.names = c(NA, -10L), class = "data.frame")

↧

Using fasterize in R to write a raster

November 29, 2019, 9:41 am

≫ Next: Does do() function from (dplyr package) automatically show a progress bar in R?

≪ Previous: Create new column based on if rows from 2 data frames match

I've been using the fasterize package lately to convert sf polygons to rasters:

https://cran.rstudio.com/web/packages/fasterize/fasterize.pdf

When I am dealing with large files, it would be better for me to write directly to disk instead of memory. So for example rather than doing this:

fasterize(polygon_file, raster_template, field = 'value')

I would do this:

fasterize(polygon_file, raster_template, field = 'value', file = 'output.tif')

This does not seem possible. Does anyone have a suggestion as to how to do this?

Thanks.

↧