string abbreviation creating dublicates

February 20, 2020, 9:50 am

≫ Next: Author vs Contributor for R package - which role for small code contribution?

≪ Previous: How can I make my optimisation with ROI less sensitive to starting values?

I'm trying to use abbreviate to come up with short unique abbreviations but its returning some unexpected values. If I run:

abbreviate(c('moscowcity', 'ms'), minlength = 2)
moscowcity         ms 
    "msc""ms"

it returns "mscw" instead of a simpler two-letter abbreviation such as "mo" or "mc" or "mt" or "my"

If I change to strict = TRUE it returns duplicates.

Is there any way to get both as two letter abbreviations that are also unique?

↧

Author vs Contributor for R package - which role for small code contribution?

February 21, 2020, 5:34 am

≫ Next: Visualising two very different distributions in one plot

≪ Previous: string abbreviation creating dublicates

If someone has provided a useful block of code to an R library, what is the appropriate role for them in package documentation?

Why it matters

I have seen contributor role given to such cases, but after reviewing the definitions of author and contributor, I believe author is the appropriate role. But there may be something else more appropriate (or perhaps both author/contributor)

What library of congress says

Author:

A person, family, or organization responsible for creating a work that is primarily textual in content, regardless of media type (e.g., printed text, spoken word, electronic text, tactile text) or genre (e.g., poems, novels, screenplays, blogs). Use also for persons, etc., creating a new work by paraphrasing, rewriting, or adapting works by another creator such that the modification has substantially changed the nature and content of the original or changed the medium of expression

Contributor:

A person, family or organization responsible for making contributions to the resource. This includes those whose work has been contributed to a larger work, such as an anthology, serial publication, or other compilation of individual works. If a more specific role is available, prefer that, e.g. editor, compiler, illustrator

Possible points of confusion

When a pull request is accepted, github will refer to the creator of the PR as a 'contributor'
In day-to-day conversation, someone who provides useful input into a project could reasonably be called a 'contributor'

↧

Visualising two very different distributions in one plot

February 21, 2020, 5:36 am

≫ Next: Is there a way not to download the entire CKAN data package with ckanr in R?

≪ Previous: Author vs Contributor for R package - which role for small code contribution?

This is more of a "how would you do it" than a "how to do it" question.

I have two groups, "a" and "b". "b" consists of responses that are normally distributed, more or less, bounded between 0-1. "a", however, is mostly 1-s.

The process that generated the data is a questionnaire. People in group "b" made mistakes, people in group "a" mostly figured it out.

How do I visualise these data side by side? Boxplots are messed up because the median of one of the groups is basically 1. Violin plots mess up the widths.

Here is a reproducible example with the rough idea.

library(tidyverse)

d = tibble(
  val= c(
    rnorm(100, mean = 0.5, sd = 0.25),
    rnorm(5, mean = 0.5, sd = 0.25),
    rep(1, 95)
  ),
  var = c(
    rep('a', 100), rep('b', 100) 
  )
) %>% 
  filter(
    val >= 0,
    val <= 1
         )

d %>% 
  ggplot(aes(x = var, y = val)) +
  geom_jitter() +
  geom_violin()

# No.

d %>% 
  ggplot(aes(x = var, y = val)) +
  geom_boxplot()

# No-no.

↧

Is there a way not to download the entire CKAN data package with ckanr in R?

February 21, 2020, 5:41 am

≫ Next: How to time a reactive input update in R shiny for inputs in renderUI/uiOutput block that haven't been created yet?

≪ Previous: Visualising two very different distributions in one plot

I am unfamiliar with the CKAN and I am struggling to get data from a CKAN data API loaded into R using the ckanr package from R Cran repository. Server wise the download is limited to 100,000 entries, so trying downloading the entire 420,000 entries will not return more than the first 100,000 entries. The dataset provides an updated history for the CO2 emission from electricity consumed in Denmark measured in g/kWh in time intervals of 5 minutes.

Using the code snippet below, I am able to download the 100,000 first entries. I am interested in an interval of approximately 20,000 values, in a specific date range. The data could be downloaded manually via an webinterface, but as I will have to update the data regularly and manually download multiple CSV files each time, a data API download would be much preferred.

Any help would be much appreciated.

The webinterface:https://www.energidataservice.dk/dataset/co2emis/resource_extract/b5a8e0bc-44af-49d7-bb57-8f968f96932d

The Data API can be accessed via the following actions of the CKAN action API.

Query: https://api.energidataservice.dk/datastore_search

Query (via SQL): https://api.energidataservice.dk/datastore_search_sqlhttps://api.energidataservice.dk/datastore_search

require(ckanr)
start_date <- min(opladning$start)
end_date <- max(opladning$slut)

ckanr_setup(url = "https://energidataservice.dk")
pkco2emis <- package_show("6e05f3b6-fcd7-4b40-8100-4416b9803881", as 
= "table")


temp <- tempfile(fileext=".csv")
download.file(pkco2emis$resources$url, temp)
co2emission <- read.csv(temp)

↧

How to time a reactive input update in R shiny for inputs in renderUI/uiOutput block that haven't been created yet?

February 21, 2020, 5:43 am

≫ Next: R: capture the value of the previous row

≪ Previous: Is there a way not to download the entire CKAN data package with ckanr in R?

In an app I'm trying to implement some "restore settings" functionality, but I'm having trouble restoring the values of dynamically created inputs. I've created a simple example to illustrate my problem.

library(shiny)

ui = fluidPage(
    sidebarLayout(
        sidebarPanel(
            sliderInput("num_sliders", "Number of sliders:",
                        min = 0,max = 10, value = 0),
            actionButton("load1", "Load v1"),
            actionButton("load2", "Load v2"),
        ),

        # Show a plot of the generated distribution
        mainPanel(
           uiOutput("dynamic_sliders")
        )
    )
)

server = function(input, output, session) {
    rv = reactiveValues()
    rv$restored_inputs = FALSE
    output$dynamic_sliders = renderUI({
        req(input$num_sliders > 0)
        fluidRow(tagList(
            lapply(1:input$num_sliders,
            function (n)
                column(3, sliderInput(paste0("slider", n),
                                      paste("Slider", n),
                                      min = 0, max = 9, value = 0)))
        ))
    })

    observeEvent(input$load1, {
        updateSliderInput(session, "num_sliders", value = 3)
        updateSliderInput(session, "slider1", value = 5)
    })

    observeEvent(input$load2, {
        updateSliderInput(session, "num_sliders", value = 3)
        rv$restored_inputs = TRUE

    })

    observe({
        req(rv$restored_inputs)
        updateSliderInput(session, "slider2", value = 5)
        rv$restored_inputs = FALSE
    }, priority = -999)
}

shinyApp(ui = ui, server = server)

If I press "Load v1" it's clear the problem is that I'm trying to update an input that hasn't been created yet (the output is invalidated, but not recalculated before I try to update the value).

What's not clear is why "Load v2" doesn't solve the problem. After I update num_sliders, I instead trigger a reactive value. Both the renderUI and observe blocks get invalidated. I thought that because I lowered the observe blocks priority, the renderUI block would recalculate first, and then the input slider would exist at the time I try to update it's value.

I tried using reactLog to work through the invalidations but I still can't seem to figure it out. Anyone know how I can force the desired execution order?

The problem obviously doesn't happen for either load button if num_sliders is manually changed to be greater than 0 before testing.

I've looked into some of the native bookmarking features of Rshiny and don't think they're a solution in this case, as in the full app I need very tight control of what gets restored when in order to maintain performance.

↧

R: capture the value of the previous row

February 21, 2020, 5:43 am

≫ Next: Error in file(file, "rt") : cannot open the connection

≪ Previous: How to time a reactive input update in R shiny for inputs in renderUI/uiOutput block that haven't been created yet?

I have a code which identifies when property ownership has transitioned to a new owner. In the column, transition, the code allocates the value of "1" every time there is a transition and "0" if the property has not transitioned. The "1" is allocated to the new owner of the transition. This helps me capture data of all new buyers.

I need a code, now, which identifies the seller of the property (i.e. the row above the new owner with the value of 1). I believe this is possible with dplyr's lag function, but I am having trouble implementing it.

For example, if there is a transition between A (seller) and B (buyer), I can only currently identify the buyer B (transition=1) but want to identify the seller, too.

Here is the code for buyers:

transitions <- transitions %>%
  group_by(property) %>%
  mutate(transition = ifelse(name != dplyr::lag(name), 1, 0))

↧

Error in file(file, "rt") : cannot open the connection

February 21, 2020, 5:45 am

≫ Next: Removing groups with all NA in Data.Table or DPLYR in R

≪ Previous: R: capture the value of the previous row

I'm new to R, and after researching this error extensively, I'm still not able to find a solution for it. Here's the code. I've checked my working directory, and made sure the files are in the right directory. Appreciate it. Thanks

pollutantmean <- function(directory, pollutant = "nitrate", id= 1:332)            
{                 if(grep("specdata",directory) ==1) 
            {
                    directory <- ("./specdata")
            }
            mean_polldata <- c()
            specdatafiles <- as.character(list.files(directory))
            specdatapaths <- paste(directory, specdatafiles, sep="")
                            for(i in id) 
                    {
                    curr_file <- read.csv(specdatapaths[i], header=T, sep=",")
                    head(curr_file)
                    pollutant
                    remove_na <- curr_file[!is.na(curr_file[, pollutant]), pollutant]
                    mean_polldata <- c(mean_polldata, remove_na)
                    }
            {
                    mean_results <- mean(mean_polldata)
                    return(round(mean_results, 3))
            }
}

The error I'm getting is below:

Error in file(file, "rt") : cannot open the connection

file(file, "rt")

read.table(file = file, header = header, sep = sep, quote = quote, 
    dec = dec, fill = fill, comment.char = comment.char, ...)

read.csv(specdatapaths[i], header = T, sep = ",")

pollutantmean3("specdata", "sulfate", 1:10)

In addition: Warning message:
In file(file, "rt") :
  cannot open file './specdata001.csv': No such file or directory

↧

Removing groups with all NA in Data.Table or DPLYR in R

February 21, 2020, 5:48 am

≫ Next: Is there a size limit in DataBricks for converting an R dataframe to a Spark dataframe?

≪ Previous: Error in file(file, "rt") : cannot open the connection

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))



dataWANT=data.frame("student"=c(1,1,1,3,3,3,5,5,5),
"time"=c(1,2,3,1,2,3,NA,2,3),
"score"=c(7,9,5,NA,3,9,7,NA,5))

I have a tall dataframe and in that data frame I want to remove student IDS that contain NA for all 'score' or for all 'time'. This is just if it is all NA, if there are some NA then I want to keep all their records...

↧

Is there a size limit in DataBricks for converting an R dataframe to a Spark dataframe?

February 21, 2020, 5:52 am

≫ Next: glmnet: extracting standardized coefficients

≪ Previous: Removing groups with all NA in Data.Table or DPLYR in R

I am new to Stack overflow and tried so many ways to solve the error but without any success. My problem: I CAN convert subsets of an R dataframe to a Spark dataframe, but not the whole dataframe. Similar questions but not the same include: Not able to to convert R data frame to Spark DataFrame and Is there any size limit for Spark-Dataframe to process/hold columns at a time?

Here some information about the R dataframe:

library(SparkR)
sparkR.session()
sparkR.version()
[1] "2.4.3"

dim(df)
[1] 101368     25
class(df)
[1] "data.frame"

When converting this to a Spark Dataframe:

sdf <- as.DataFrame(df)
Error in handleErrors(returnStatus, conn) : Error in handleErrors(returnStatus, conn) : 
Error in handleErrors(returnStatus, conn) :

However, when I subset the R dataframe, it does NOT result in an error:

sdf_sub1 <- as.DataFrame(df[c(1:50000), ])
sdf_sub2 <- as.DataFrame(df[c(50001:101368), ])

class(sdf_sub1)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"

class(sdf_sub2)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"

How can I write the whole dataframe to a Spark DataFrame? (I want to saveAsTable afterwards). I was thinking about a problem with capacity but I do not have a clue how to solve it.

Thanks a lot!!

↧

glmnet: extracting standardized coefficients

February 21, 2020, 5:59 am

≫ Next: Remove horizontal scrollbar on Shiny UI's input

≪ Previous: Is there a size limit in DataBricks for converting an R dataframe to a Spark dataframe?

I am running regression models with the function cv.glmnet(). The argument standardize = TRUE standardises all x variables (predictors) prior to fitting the model. However, the coefficients are always returned on the original scale for the output / result. Is there a way of receiving standardized coefficients (beta weights) for the output, so that coefficients are comparable?

↧

Remove horizontal scrollbar on Shiny UI's input

February 21, 2020, 6:00 am

≫ Next: PDF: how to convert one column lists to multiple column data frame? - lists of people in subgroups inside groups to multiple columns

≪ Previous: glmnet: extracting standardized coefficients

This is a minimal app that reproduces my problem:

library(shiny)

ui <- fluidPage(
  sidebarLayout(

    sidebarPanel(
      sliderInput("input1", "input1", min = as.Date("2020-02-03"), max = as.Date("2020-12-30"), 
                  value = c(as.Date(Sys.Date()), as.Date("2020-12-30"))),
      hr(),
      splitLayout(checkboxGroupInput("input2", "input2", choices = c("a", "b")),
                  verticalLayout(checkboxInput("input3", "input3")))),

  mainPanel()))

server <- function(input, output, session) {

}

shinyApp(ui, server)

The app generated gives an horizontal scrollbar for input3, even when the screen size allows it to have more than enough space. Lurking on other similar questions, people recommend giving it a css property with overflow:hidden, but I can't find where to put this piece of code. Other approaches are obviously welcome.

↧

PDF: how to convert one column lists to multiple column data frame? - lists of people in subgroups inside groups to multiple columns

February 21, 2020, 6:03 am

≫ Next: Finding the maximum square sub-matrix with all equal elements

≪ Previous: Remove horizontal scrollbar on Shiny UI's input

I have near 15 PDFs containing lists of people. This PDFs are only one column width, so it is a pure list. But in some way these lists are nested (subgroups inside subgroups inside groups...). There is no numerical data apart from the first number of each person in the list (which is very important for my analysis), and similar order info.

I need to pull out from the PDF this lists and convert them into a conventional data frame.

Here is an example of the structure of one PDF:

TERRITORY ONE
1. GROUP ONE
1. Name Surname
2. Name Surname
3. Name Surname
4. Name Surname
2. GROUP TWO
1. Name Surname
2. Name Surname
3. Name Surname
4. Name Surname
TERRITORY TWO
(...)

This is the first PDF: http://bocyl.jcyl.es/boletines/1983/04/02/pdf/BOCYL-D-02041983-1.pdf

This follows as you can imagine (territory two, three, four..., with subsequent subgroups one, two, three, four,... etc.). This goes up to near 600 lines per PDF and more in the latest PDF.

I need to create a data frame that follows this example structure:

   PERSON    |    TERRITORY  |  GROUP  | POSITION IN LIST
Name Surname | TERRITORY ONE | GROUP 1 |         1
(...)
Name Surname | TERRITORY ONE | GROUP 2 |         4
(...)
Name Surname | TERRITORY TWO | GROUP 1 |         3

One row should be one person

POSITION IN LIST should refer to the order in which person Name Surname appeared in a given year (each PDF is for a year), in his TERRITORY, in his GROUP.

Consider it to be something like a ranking, in which is important the order of the person. Very few of the people of the PDF1 (year 1) will appear again in PDF2 (year 2), and then in PDF3 (year 3), etc. So, one objective behind all of this is to know how many and who does repeat year after year in this list.

And also, it is important for the analysis to know the position of that person who does repeat in every year, to draw the evolution of this person, or to know if this person disappears after year X, etc.

PS: pardon my English, is not my first language :(

↧

Finding the maximum square sub-matrix with all equal elements

February 21, 2020, 6:07 am

≫ Next: Data manipulation, kind of downsampling

≪ Previous: PDF: how to convert one column lists to multiple column data frame? - lists of people in subgroups inside groups to multiple columns

Does anybody know how it could be possible to subset the maximum K such that K x K is a submatrix with all identical elements, i.e., all the elements in this submatrix must be the same from a given a N x N matrix? I found many examples in other programming languages except R. I also prefer dplyr if you know.

There is a link to the solution with other languages: https://www.geeksforgeeks.org/maximum-size-sub-matrix-with-all-1s-in-a-binary-matrix/

But this link provides a special case when all identical elements are next to each other. It retrieves a maximum block of the same elements, not a submatrix in general. I do not want to limit subsetting with this condition.

↧

Data manipulation, kind of downsampling

February 21, 2020, 6:12 am

≫ Next: Count and summation of positive and negative number sequences

≪ Previous: Finding the maximum square sub-matrix with all equal elements

I have a large csv file, example of the data below. I will use an example of eight teams to illustrate.

home_team    away_team      home_score       away_score         year
belgium      france         2                2                  1990
brazil       uruguay        3                1                  1990
italy        belgium        1                2                  1990
sweden       mexico         3                1                  1990

france       chile          3                1                  1991
brazil       england        2                1                  1991
italy        belgium        1                2                  1991
chile        switzerland    2                2                  1991

My data runs for many years. I would like to have total number of scores of each team every year, see example below,

team            total_scores          year
belgium         4                     1990
france          2                     1990
brazil          3                     1990
uruguay         1                     1990
italy           1                     1990
sweden          3                     1990
mexico          1                     1990

france          3                     1991
chile           5                     1991
brazil          2                     1991
england         1                     1991
italy           1                     1991
belgium         2                     1991
switzerland     2                     1991

Thoughts?

↧

Count and summation of positive and negative number sequences

February 21, 2020, 6:13 am

≫ Next: How to avoid broken line in ggplot2

≪ Previous: Data manipulation, kind of downsampling

I want to write a code to count and sum any positive and negative series of numbers.
Numbers are either positive or negative(no zero).
I have written codes with for loops. Is there any creative alternative?

Data

R

set.seed(100)
x <- round(rnorm(20, sd = 0.02), 3)

python

x = [-0.01, 0.003, -0.002, 0.018, 0.002, 0.006, -0.012, 0.014, -0.017, -0.007,

     0.002, 0.002, -0.004, 0.015, 0.002, -0.001, -0.008, 0.01, -0.018, 0.046]

loops

R

sign_indicator <- ifelse(x > 0, 1,-1)
number_of_sequence <- rep(NA, 20)
n <- 1
for (i in 2:20) {
  if (sign_indicator[i] == sign_indicator[i - 1]) {
    n <- n + 1
  } else{
    n <- 1
  }
  number_of_sequence[i] <- n

}
number_of_sequence[1] <- 1

#############################

summation <- rep(NA, 20)

for (i in 1:20) {
  summation[i] <- sum(x[i:(i + 1 - number_of_sequence[i])])
}

python

sign_indicator = [1 if i > 0 else -1 for i in X]

number_of_sequence = [1]
N = 1
for i in range(1, len(sign_indicator)):
    if sign_indicator[i] == sign_indicator[i - 1]:
        N += 1
    else:
        N = 1
    number_of_sequence.append(N)

#############################
summation = []

for i in range(len(X)):
    if number_of_sequence[i] == 1:          
          summation.append(X[i])

    else:
        summation.append(sum(X[(i + 1 - number_of_sequence[i]):(i + 1)]))

result

        x n_of_sequence    sum
1  -0.010             1 -0.010
2   0.003             1  0.003
3  -0.002             1 -0.002
4   0.018             1  0.018
5   0.002             2  0.020
6   0.006             3  0.026
7  -0.012             1 -0.012
8   0.014             1  0.014
9  -0.017             1 -0.017
10 -0.007             2 -0.024
11  0.002             1  0.002
12  0.002             2  0.004
13 -0.004             1 -0.004
14  0.015             1  0.015
15  0.002             2  0.017
16 -0.001             1 -0.001
17 -0.008             2 -0.009
18  0.010             1  0.010
19 -0.018             1 -0.018
20  0.046             1  0.046

↧

How to avoid broken line in ggplot2

February 21, 2020, 6:15 am

≫ Next: Writing different .csv files based on different value and column combinations in R

≪ Previous: Count and summation of positive and negative number sequences

I try to plot a graph in R using the ggplot2 package. Let's assume I have the following data:

df1<-data.frame(xpos=c(1,2,3,4),ypos=c(0.222,0.222,0.303,0.285))

Now, I want to plot a simple line graph:

ggplot(data=df1, aes(x=xpos, y=ypos)) +
  geom_line()+
  geom_point()

Now, when I adjust the y-axis:

+scale_y_continuous("Y rates upd", breaks=seq(0,1,0.2),limits=c(0,1))

The lines get "broken"

↧

Writing different .csv files based on different value and column combinations in R

February 21, 2020, 6:17 am

≫ Next: merging palettes with colorRampPalette and plotting with leaflet

≪ Previous: How to avoid broken line in ggplot2

I want to write different .csv files based on a value and column combination. A sample tbl can be found below:

# libs 
library(tidyverse)
library(data.table) 

# tbl
tbl <- tibble(
  Record = 1:100,
  B1     = c(rep("B1", 10), rep(NA, 90)),
  B2     = c(rep("B2", 20), rep(NA, 80)),
  B3     = c(rep("B3", 40), rep(NA, 60)),
  B4     = c(rep("B4", 70), rep(NA, 30)),
  B5     = c(rep("B5", 95), rep(NA, 5))
)

tbl

Writing different csv files one by one would be done like this:

B1 <- tbl %>%
  filter(B1 == "B1") %>% 
  select(Record, B1) %>% 

  fwrite(., file = "B1.csv")

However, I want to iterate this process by making a custom function and writing the different .csv files one by one for each value column combination. I tried something like this below.

Batch <- "B1"
f_stack <- function(Batch) {

  batch <- tbl %>%
    filter(Batch == Batch) %>% 
    select(Record, Batch)

  return(batch)

}

f_stack(Batch)

However, it doesn't filter the correct records. I left out the fwrite line of code, because it doesn't return the right tbl. Does someone know how to pull this of (preferably with purrr) Any suggestions would be much appreciated.

↧

merging palettes with colorRampPalette and plotting with leaflet

February 21, 2020, 6:18 am

≫ Next: Fix total problem when ggplot position "stack"

≪ Previous: Writing different .csv files based on different value and column combinations in R

I'm trying to merge two colorRampPalette schemes to use in leaflet and have been following this nice example. That example works fine but I can't seem to get it to work for my work, reproducible example below. I'm using RdYlGn palette and I want numbers below the threshold to be dark green and numbers above the threshold to more red (skipping some of the inner colors).

For my example my cut-off is nc$PERIMETER< 1.3 so I want numbers under this value to be green and everything above more red (color #FDAE61 onwards).

library(sf)  
library(leaflet)
library(RColorBrewer)

#palette im using
palette <- rev(brewer.pal(11, "RdYlGn"))
# [1] "#006837""#1A9850""#66BD63""#A6D96A""#D9EF8B""#FFFFBF""#FEE08B""#FDAE61""#F46D43""#D73027""#A50026"
previewColors(colorNumeric(palette = palette, domain = 0:10), values = 0:10)


# preparing the shapefile
nc <- st_read(system.file("gpkg/nc.gpkg", package="sf"), quiet = TRUE) %>% 
  st_transform(st_crs(4326)) %>% 
  st_cast('POLYGON')
nc

x <- sum(nc$PERIMETER < 1.3)  
x # number of values below threshold = 21


### Create an asymmetric color range
## Make vector of colors for values smaller than 1.3 (21 colors)
rc1 <- colorRampPalette(colors = c("#006837", "#1A9850"), space = "Lab")(x)    #21 

## Make vector of colors for values larger than 1.3 
rc2 <- colorRampPalette(colors = c("#FDAE61", "#A50026"), space = "Lab")(length(nc$PERIMETER) - x)

## Combine the two color palettes
rampcols <- c(rc1, rc2)

mypal <- colorNumeric(palette = rampcols, domain = nc$PERIMETER)
previewColors(colorNumeric(palette = rampcols, domain = NULL), values = 1:length(nc$PERIMETER))

looking at the preview it seems to have worked (21 values under 1.3 should be green):

plotting it:

leaflet() %>%
  addTiles() %>%
  addPolygons(data = nc,
              fillOpacity = 0.7,
              fillColor = ~mypal(PERIMETER),
              popup = paste("PERIMETER: ", nc$PERIMETER) )

plots ok but doesn't give the right color, the one highlighted is above the threshold (1.3) and so shouldn't be green but it is:

I thought the way I was creating the palettes was wrong but the preview seems to suggest I've done it right?

anyone have any ideas? thanks

↧

Fix total problem when ggplot position "stack"

February 21, 2020, 6:19 am

≫ Next: Newbie Q- Will this R code count how many survey submissions had values for all 4 of these variables?

≪ Previous: merging palettes with colorRampPalette and plotting with leaflet

When using the "stack" style (not "dodge") as with geom_bar or geom_col. The totals get compromised. I manage to represent the correct total in a simple way when one of the values is conspicuously more frequent than others, see Workaround (not log). But, the total problem remains for other cases and log scales. I would ask for a universal solution.

Case 1. Similar frequencies

mydf<-data.frame(date=c(rep("2020-02-01",5),rep("2020-02-01",4),rep("2020-02-02",5),rep("2020-02-02",4) ),
           value= c(rep(LETTERS[1:3],6) ) )#,"A","A" )
mydf
library(data.table)
setDT(mydf)[, .N, by=.(date, value)]
#          date value N
# 1: 2020-02-01     A 3
# 2: 2020-02-01     B 3
# 3: 2020-02-01     C 3
# 4: 2020-02-02     A 3
# 5: 2020-02-02     B 3
# 6: 2020-02-02     C 3

library(ggplot2)
library(scales)

simple1<-ggplot(mydf, aes(date, fill = value)) + 
  geom_bar() + scale_y_continuous(breaks= breaks_pretty())

simple1log<-ggplot(mydf, aes(date, fill = value)) + 
  geom_bar() +  scale_y_continuous(trans='log2', breaks = log_breaks(7), 
                                   labels= label_number_auto()
  )

# Total count problem, real total is 9
{
  require(grid)
  grid.newpage()
  pushViewport(viewport(layout = grid.layout(1, 2)))
  pushViewport(viewport(layout.pos.col = 1, layout.pos.row = 1))
  print(simple1,newpage=F) 
  popViewport()
  pushViewport(viewport(layout.pos.col = 2, layout.pos.row = 1))
  print( simple1log, newpage = F )
}

Case 2: One value more frequent, same problem, workaround.

mydf2<-data.frame(date=c(rep("2020-02-01",25),rep("2020-02-01",20),rep("2020-02-02",25),rep("2020-02-02",20) ),
                 value= c(rep(LETTERS[1],39),rep(LETTERS[1:3],4),rep(LETTERS[1],39) ) , stringsAsFactors = FALSE)
setDT(mydf2)[, .N, by=.(date, value)]
dateValueCount<-setDT(mydf2)[, .N, by=.(date, value)]
#          date value  N
# 1: 2020-02-01     A 41
# 2: 2020-02-01     B  2
# 3: 2020-02-01     C  2
# 4: 2020-02-02     A 41
# 5: 2020-02-02     B  2
# 6: 2020-02-02     C  2

prevalent1<-ggplot(mydf2, aes(date, fill = value)) + 
  geom_bar() + scale_y_continuous(breaks= breaks_pretty())
# total value = 45 
prevalent1log<-ggplot(mydf2, aes(date, fill = value)) + 
  geom_bar() +  scale_y_continuous(trans='log2', breaks = log_breaks(7), 
                                   labels= label_number_auto()
  )
# total Problem, real total is 45
{
  require(grid)
  grid.newpage()
  pushViewport(viewport(layout = grid.layout(1, 2)))
  pushViewport(viewport(layout.pos.col = 1, layout.pos.row = 1))
  print(prevalent1,newpage=F) 
  popViewport()
  pushViewport(viewport(layout.pos.col = 2, layout.pos.row = 1))
  print( prevalent1log, newpage = F )
  }

# workaround:

# get the most frequent per group
mydf2Max<-dateValueCount[, .SD[  N== max(N) ] , by=date]  
mydf2Max
#          date value  N
# 1: 2020-02-01     A 41
# 2: 2020-02-02     A 41

# totals per group
dateCount<-mydf2[, .N, by=.(date)]
#          date  N
# 1: 2020-02-01 45
# 2: 2020-02-02 45

# transfer column to previous table
mydf2Max$totalDay <- dateCount$N[match(mydf2Max$date, dateCount$date)]

# the final height of A will be dependent on the values of B and C
mydf2Max$diff<-mydf2Max$totalDay-mydf2Max$N

# shrinkFactor for the upper part of the plot which begins in threshold
shrinkFactor<-.05
threshold<-6

# part of our frequent value (A) count must not be shrinked
mydf2Max$notshrink <- threshold - mydf2Max$diff

# part of A data (> threshold) must be shrinked
mydf2Max$NToShrink<-mydf2Max$N-mydf2Max$notshrink

mydf2Max$NToShrinkShrinked<-mydf2Max$NToShrink*shrinkFactor

# now sum the not-shrinked part with the shrinked part to obtain the transformed height
mydf2Max$NToShrinkShrinkedPlusBase<-mydf2Max$NToShrinkShrinked+mydf2Max$notshrink

# transformation function  - works for "dodge" position
# https://stackoverflow.com/questions/44694496/y-break-with-scale-change-in-r
trans <- function(x){pmin(x,threshold) + shrinkFactor*pmax(x-threshold,0)}
# dateValueCount$transN <- trans(dateValueCount$N)

setDF(dateValueCount)
setDF(mydf2Max)

# pass transformed column to original d.f.
dateValueCount$N2 <- mydf2Max$NToShrinkShrinkedPlusBase[match(interaction( dateValueCount[c("value","date")]) ,
                                                             interaction( mydf2Max[c("value","date") ] )  )]

# substitute real N with transformed values
dateValueCount[which(!is.na(dateValueCount$N2)),]$N <- dateValueCount[which(!is.na(dateValueCount$N2)),]$N2

yticks <- c(0, 2,4,6,40,50)

ggplot(data=dateValueCount, aes(date, N, group=value, fill=value)) + #group=longName
  geom_col(position="stack") +
  geom_rect(aes(xmin=0, xmax=3, ymin=threshold, ymax=threshold+1), fill="white") +
  scale_y_continuous(breaks = trans(yticks), labels= yticks)

↧

Newbie Q- Will this R code count how many survey submissions had values for all 4 of these variables?

February 21, 2020, 6:20 am

≫ Next: How to replicate a vector in R

≪ Previous: Fix total problem when ggplot position "stack"

I'm teaching myself R and I think this code counts the number of times a survey has a value (not NA) for all 4 of these variables in ()?

Can someone confirm or correct me? Thanks for helping out a nervous newbie. I need this number for a denominator (surveys without missing data). Thanks!

sum(!is.na(Both_hwstations) & 
    !is.na(Both_latrines) & 
    !is.na(rapid_unique$hf_ipcfocal) & 
    !is.na(rapid_unique$water_avail))

↧