Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all 209932 articles
Browse latest View live

selecting variables with hierarchic preference

$
0
0

I have a dataframe with multiple data from the same persons (column ID identical) but with different exam results. I want to extract for each person (unique ID) the exam with the highest frequency and the highest prevalence.

df:

day ID Rate Prevalence 
1 1234 3 Occasional 
2 1234 2 Frequent 
1 4567 2.5 Rare
2 7899 1.5 Abundant 
2 7899 4.5 Frequent

I was thinking to make a loop:

uniqueID <- unique(df$ID)
for (count in 1:lenght(uniqueID)){
 Curr_ID <- uniqueID[count]
 ID_set <- subset(df, ID==Curr_ID)
Prevalence <- 

**Here are the problem I would like to choose in the ID_set for Prev with a hierarchic preferences and do not know how to do it : if "Abundant" set "Abundant", if no "Abundant" look for "Frequent" and if "Frequent" set "Frequent", if no "Frequent" look for "Occasional" and if "Occasionnal" set "Occasional" and so on **

I hope the problem is clear.


Irregular/missing data when clustering time series

$
0
0

Whether it is dynamic time warping or some sort of Euclidean k-means clustering of a time series, it is (nearly?) always required to consider irregular spacing of data, unequal lengths of data and/or missing-ness of data.

While realizing that each of these issues have considerations unto themselves, is there a general reason why pre-processing each time series with a spline to interpolate (or very minimally extrapolate) the data to ameliorate these issues cannot be done?

optim function in r returning a list with an error message

$
0
0

I ran the optim function on a function that has 3 numeric parameters.

It did return what looks to be the correct values. However, in the message element of the returned list, I got this message:

$par
[1] 0.8974235 9.1095283 0.9110162

$value
[1] -614452.9

$counts
function gradient 
      60       60 

$convergence
[1] 52

$message
[1] "ERROR: ABNORMAL_TERMINATION_IN_LNSRCH"

What does that mean? How can it be remedied?

Downloading multiple file as parallel in R

$
0
0

I am trying to download 460,000 files from ftp server ( which I got from the TRMM archive data). I made a list of all files and separated them into different jobs, but can any one help me how to run those jobs at the same time in R. Just an example what I have tried to do

my.list <-readLines("1998-2010.txt") # lists the ftp address of each file
job1 <- for (i in 1: 1000) { 
            download.file(my.list[i], name[i], mode = "wb")
        }
job2 <- for (i in 1001: 2000){ 
            download.file(my.list[i], name[i], mode = "wb")
        }
job3 <- for (i in 2001: 3000){ 
            download.file(my.list[i], name[i], mode = "wb")
        }

Now I m stuck on how to run all of the Jobs at the same time.

Appreciate your help

r User-Defined Function Arguments - what can be defined as the argument?

$
0
0

I'm trying to write R functions that do similar tasks as macros in SAS, such as

  1. process variables
  2. name a new variable
  3. name a new data frame

I've tried some basic functions using the built-in df "iris" and dplyr as pasted below.

Function f3 and f4 try to take in a variable name and process it. the error messages are "Error: object 'Species' not found " and "In mean.default(var) : argument is not numeric or logical: returning NA".

Function f5 and f6 try to name a new variable or a new df. Once running the function the new variable or df was named after the argument name.

f7 tried to name part of the variable using the function.

library(dplyr)

data(iris)
view(iris)

### Char Variable
f3 <- function(var){
      iris %>% filter(var == "setosa")
}
f3(Species)

f4 <- function(var){
      iris %>% summarise(
            avg = mean(var)
      )
}
f4("Sepal.Length")

### Variable Name
f5 <- function(name){
      iris %>% 
            mutate(name = 1)
}
f5("newname")

### df Name
f6 <- function(dfname){
      dfname <- iris 
}
f6("newdf")

f7 <- function(name){
      test <- iris %>% 
            mutate(
                  v_name = 1
            )
}
f7("1")

R devtools not reading Windows path with spaces in it, not found in $PATH variable

$
0
0

I'm working with the devtools package in R, and I'm trying to use it to install a GitHub repo for my class. However, I have a path with spaces in it, and here is the error message I'm getting:

Installing package into ‘C:/Users/Kaelan McGurk/Documents/R/win-library/3.6’
(as ‘lib’ is unspecified)
* installing *source* package 'buildings' ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** byte-compile and prepare package for lazy loading
Fatal error: cannot open file 'C:\Users\Kaelan': No such file or directory

But here is the output of Sys.getenv("PATH"):

C:\\Program Files\\R\\R-3.6.0\\bin\\x64;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath;C:\\ProgramData\\Oracle\\Java\\javapath;C:\\Windows\\System32;C:\\Windows;C:\\Windows\\System32\\wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Program Files (x86)\\PharosSystems\\Core;C:\\Program Files (x86)\\Skype\\Phone\\;C:\\Program Files\\PuTTY\\;C:\\Program Files (x86)\\Wolfram Research\\WolframScript\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files (x86)\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files\\Intel\\Intel(R) Management Engine Components\\DAL;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Intel\\WiFi\\bin\\;C:\\Program Files\\Common Files\\Intel\\WirelessCommon\\;C:\\Program Files\\MySQL\\MySQL Shell 8.0\\bin\\;C:\\Users\\Kaelan McGurk\\Documents\\MATH335\\M335_WI20_McGurk_Kael\\%USERPROFILE%\\AppData\\Local\\Microsoft\\WindowsApps

So I'm not sure what's going on here,

How can I avoid the fatal error "cannot open file"

Cumulativ sums of subset of columns in data.table using .SD

$
0
0

I have a data.table with a couple of IDs and a long list of numerical columns that represent probabities that add up to one for each row. Minimal example:

library(data.table)
DT = data.table(
  ID = c("a","b","c"),
  ProbA = c(0.6, 0.25, 0.55),
  ProbB = c(0.25, 0.55, 0.35),
  ProbC = c(0.15, 0.2, 0.1)
);DT

   ID ProbA ProbB ProbC
1:  a  0.60  0.25  0.15
2:  b  0.25  0.55  0.20
3:  c  0.55  0.35  0.10

Now I want to add columns for each of the probabilities with the cumulative probabities with respect to their column index, like so:

  ID ProbA ProbB ProbC ProbAcum ProbBcum ProbCcum
1:  a  0.60  0.25  0.15     0.60     0.85        1
2:  b  0.25  0.55  0.20     0.25     0.80        1
3:  c  0.55  0.35  0.10     0.55     0.90        1

I've tried it with lapply and subsetting the columns:

Probcols <- c("ProbA", "ProbB", "ProbC")
DT[, (paste0(Probcols,"cum")):=lapply(.SD, function(x){rowSums(DT[, 2:which(Probcols==x)])}), .SDcols=Probcols]

can I make that work somehow?

How to select the remaining values after creating a selection?

$
0
0

I have a group of values called "estados" (the States from USA), and I'm creating a selection of them which I'll call "pena_si" (each number is a State)

pena_si<-estados[c(10,14,11,27,41,36,16,42,51,34,45,28,17,25,18,46,29,43,3,37,4,44,19,26,2,48,31,9,5)]

So, now I'd like to create a selection of the values that aren't in "pena_si" but nothing I try works, so I'd like to know how would you do it. I've tried things like:

estados[estados!==pena_si]

But, as I said, it doesn't work.


Fix total problem when ggplot position "stack" in split charts

$
0
0

When using the "stack" style (not "dodge") as with geom_bar or geom_col. The totals get compromised. I manage to represent the correct total in a simple way when one of the values is conspicuously more frequent than others, see Workaround (not log). But, the total problem remains for other cases and log scales. I would ask for a universal solution.

EDIT: After reading ggplot scale_y_log10() issue, I found it makes no sense to use log. So the answer to this question, should be how to generalize the split approach = workaround - not only for one frequent group -.

Case 1. Similar frequencies

mydf<-data.frame(date=c(rep("2020-02-01",5),rep("2020-02-01",4),rep("2020-02-02",5),rep("2020-02-02",4) ),
           value= c(rep(LETTERS[1:3],6) ) )#,"A","A" )
mydf
library(data.table)
setDT(mydf)[, .N, by=.(date, value)]
#          date value N
# 1: 2020-02-01     A 3
# 2: 2020-02-01     B 3
# 3: 2020-02-01     C 3
# 4: 2020-02-02     A 3
# 5: 2020-02-02     B 3
# 6: 2020-02-02     C 3

library(ggplot2)
library(scales)

simple1<-ggplot(mydf, aes(date, fill = value)) + 
  geom_bar() + scale_y_continuous(breaks= breaks_pretty())

simple1log<-ggplot(mydf, aes(date, fill = value)) + 
  geom_bar() +  scale_y_continuous(trans='log2', breaks = log_breaks(7), 
                                   labels= label_number_auto()
  )

# Total count problem, real total is 9
{
  require(grid)
  grid.newpage()
  pushViewport(viewport(layout = grid.layout(1, 2)))
  pushViewport(viewport(layout.pos.col = 1, layout.pos.row = 1))
  print(simple1,newpage=F) 
  popViewport()
  pushViewport(viewport(layout.pos.col = 2, layout.pos.row = 1))
  print( simple1log, newpage = F )
}

enter image description here

Case 2: One value more frequent, same problem, workaround.

mydf2<-data.frame(date=c(rep("2020-02-01",25),rep("2020-02-01",20),rep("2020-02-02",25),rep("2020-02-02",20) ),
                 value= c(rep(LETTERS[1],39),rep(LETTERS[1:3],4),rep(LETTERS[1],39) ) , stringsAsFactors = FALSE)
setDT(mydf2)[, .N, by=.(date, value)]
dateValueCount<-setDT(mydf2)[, .N, by=.(date, value)]
#          date value  N
# 1: 2020-02-01     A 41
# 2: 2020-02-01     B  2
# 3: 2020-02-01     C  2
# 4: 2020-02-02     A 41
# 5: 2020-02-02     B  2
# 6: 2020-02-02     C  2

prevalent1<-ggplot(mydf2, aes(date, fill = value)) + 
  geom_bar() + scale_y_continuous(breaks= breaks_pretty())
# total value = 45 
prevalent1log<-ggplot(mydf2, aes(date, fill = value)) + 
  geom_bar() +  scale_y_continuous(trans='log2', breaks = log_breaks(7), 
                                   labels= label_number_auto()
  )
# total Problem, real total is 45
{
  require(grid)
  grid.newpage()
  pushViewport(viewport(layout = grid.layout(1, 2)))
  pushViewport(viewport(layout.pos.col = 1, layout.pos.row = 1))
  print(prevalent1,newpage=F) 
  popViewport()
  pushViewport(viewport(layout.pos.col = 2, layout.pos.row = 1))
  print( prevalent1log, newpage = F )
  }

enter image description here

# workaround:

# get the most frequent per group
mydf2Max<-dateValueCount[, .SD[  N== max(N) ] , by=date]  
mydf2Max
#          date value  N
# 1: 2020-02-01     A 41
# 2: 2020-02-02     A 41

# totals per group
dateCount<-mydf2[, .N, by=.(date)]
#          date  N
# 1: 2020-02-01 45
# 2: 2020-02-02 45



# transfer column to previous table
mydf2Max$totalDay <- dateCount$N[match(mydf2Max$date, dateCount$date)]

threshold <- 6 # splitting threshold

# remove groups with total lower than threshold
mydf2Max<-mydf2Max[which(mydf2Max$totalDay>threshold),]

# the final height of A will be dependent on the values of B and C
mydf2Max$diff<-mydf2Max$totalDay-mydf2Max$N

# shrinkFactor for the upper part of the plot which begins in threshold
shrinkFactor<-.05

# part of our frequent value (A) count must not be shrinked
mydf2Max$notshrink <- threshold - mydf2Max$diff

# part of A data (> threshold) must be shrinked
mydf2Max$NToShrink<-mydf2Max$N-mydf2Max$notshrink

mydf2Max$NToShrinkShrinked<-mydf2Max$NToShrink*shrinkFactor

# now sum the not-shrinked part with the shrinked part to obtain the transformed height
mydf2Max$NToShrinkShrinkedPlusBase<-mydf2Max$NToShrinkShrinked+mydf2Max$notshrink

# transformation function  - works for "dodge" position
# https://stackoverflow.com/questions/44694496/y-break-with-scale-change-in-r
trans <- function(x){pmin(x,threshold) + shrinkFactor*pmax(x-threshold,0)}
# dateValueCount$transN <- trans(dateValueCount$N)

setDF(dateValueCount)
setDF(mydf2Max)

# pass transformed column to original d.f.
dateValueCount$N2 <- mydf2Max$NToShrinkShrinkedPlusBase[match(interaction( dateValueCount[c("value","date")]) ,
                                                             interaction( mydf2Max[c("value","date") ] )  )]

# substitute real N with transformed values
dateValueCount[which(!is.na(dateValueCount$N2)),]$N <- dateValueCount[which(!is.na(dateValueCount$N2)),]$N2

yticks <- c(0, 2,4,6,40,50)

ggplot(data=dateValueCount, aes(date, N, group=value, fill=value)) + #group=longName
  geom_col(position="stack") +
  geom_rect(aes(xmin=0, xmax=3, ymin=threshold, ymax=threshold+1), fill="white") +
  scale_y_continuous(breaks = trans(yticks), labels= yticks)

enter image description here

Remove special characters from entire dataframe in R

$
0
0

Question:

How can you use R to remove all special characters from a dataframe, quickly and efficiently?

Progress:

This SO post details how to remove special characters. I can apply the gsub function to single columns (images 1 and 2), but not the entire dataframe.

Problem:

My dataframe consists of 100+ columns of integers, string, etc. When I try to run the gsub on the dataframe, it doesn't return the output I desire. Instead, I get what's shown in image 3.

df <- read.csv("C:/test.csv")
dfa <- gsub("[[:punct:]]", "", df$a) #this works on a single column
dfb <- gsub("[[:punct:]]", "", df$b) #this works on a single column
df_all <- gsub("[[:punct:]]", "", df) #this does not work on the entire df
View(df_all)

df - This is the original dataframe:

Original dataframe

dfa - This is gsub applied to column b. Good!

gsub applied to column b

df_all - This is gsub applied to the entire dataframe. Bad!

gsub applied to entire dataframe

Summary:

Is there a way to gsub an entire dataframe? Else, should an apply function be used instead?

Changing a list of strings based on certain conditions

$
0
0

I have a string list here,

List <- c('C8 H12 O1 Na1', 'C15 H20 O7 Na1', 'C18 H24 O6', 'C24 H32 O9 Na1', 'C26 H38 O5 Na1')

And I would like to change it to

Listnew <- c('C8 H12 O1', 'C15 H20 O7', 'C18 H23 O6', 'C24 H32 O9', 'C26 H38 O5')

Where any string containing Na had it removed and any string that did not have Na had the H in the string reduced by 1. In this case, from List, 'C18 H24 O6' was changed to 'C18 H23 O6'. This list is contained in a matrix. I know how to changed strings based on one condition

I think that I need to create a True/False column on whether or not Na exists within the column's string first, then use that to either subtract '1' from the H string or remove the Na. However, I have tried to look for similar questions but I could not find an answer that worked for me.

How can I change individually the fill of a legend in scatter plot to match the label colors? [duplicate]

$
0
0

How can I change the filling of each key in the legend to match the labels?

If I do it in geom_label_repel using show.legend = TRUE it doesn't look very good and it puts a letter "a" in place of dots.

enter image description here

Yellow is for injured players, blue for owned players, green for free players and red for hobbits players.

Here's the code used for the plot:

ggplot(fim, aes(Price, Average, 
                      label = Player, 
                      colour = color, 
                      fill = color,
                      #alpha = ih, 
                      group = Position
              )) +
    geom_smooth(method = "lm", se = FALSE, color = "lightblue", show.legend = FALSE) +
    geom_point(aes(fill = factor(color))) + # 
    geom_label_repel(size = 3.25,
                     family = "Bahnschrift",
                     #fontface = 'bold',
                     label.size = NA,
                     segment.alpha = 0.5,
                     alpha = 0.9,
                     show.legend = FALSE,
                     #label.padding = unit(.22, 'lines'),
                     #hjust = 0,
                     #vjust = 0,
                     #label.r = 0,
                     box.padding = unit(0.20, "lines"),
                     point.padding = unit(0.20, "lines"),
                     #force = 4
                     ) +
    #nudge_y = 0.005,
    #nudge_x = 0) +
    scale_x_continuous(labels=function(y) format(y, big.mark = ".", scientific = FALSE)) +
    ggtitle("Price and Average Points in LaLiga Fantasy",
            paste("Top", nrow(fim), pos, "by market value with at least", minapps, "appearances, excluding Messi & Benzema")) +
    labs(y="Average Points (Sofascore Rating System)",
         x = "Price (Market Value in Euros)",
         caption = "Sources: Biwenger, Jornada Perfecta plot by Weldata") +
    scale_color_manual(values = c("Hobbits" = WT,
                                  "Free" = WT,
                                  "Injured" = BK,
                                  "Owned" = WT)) +
    scale_fill_manual(values = c("Hobbits" = cl,
                                 "Free" = MF,
                                 "Injured" = GK,
                                 "Owned" = DF)) +
    scale_alpha(0.1) +
    dark_theme_gray() +
    theme(plot.title = element_text(family = "Bahnschrift", 
                                    face = "bold", 
                                    size = 18, 
                                    colour = YL),
          plot.background = element_rect(fill = BK),
          panel.background = element_blank(),
          panel.grid.major = element_line(color = "grey30", size = 0.2),
          panel.grid.minor = element_line(color = "grey30", size = 0.2),
          legend.title = element_blank(),
          #legend.background = element_blank(),
          axis.ticks = element_line(colour = "grey30"),
          axis.title = element_text(family = "Bahnschrift", size = 14, colour = WT),
          axis.text = element_text(size = 12, colour = "grey80", face = 'bold'),
          legend.position = c(0.9, 0.2), #legend.position = "none",
          plot.tag = element_text(),
          plot.caption = element_text(color = YL, face = "italic")
          )

How to select the data for the x axis to run a graph in R?

$
0
0

With this dataframe


            id power     hr    fr    VE     VO2    VCO2  PETCO2 percent_VO2 percent_power group
1  AC12-PRD-C1    25  88.75 22.75 22.75 0.73900 0.66700 39.2925    49.34068      21.73913   CHD
2  AC12-PRD-C1    40  93.25 23.00 23.75 0.81975 0.71500 39.6200    54.73210      34.78261   CHD
3  AC12-PRD-C1    55  99.75 22.75 26.75 0.95125 0.85400 41.4100    63.51193      47.82609   CHD
4  AC12-PRD-C1    70 109.75 23.00 32.50 1.07525 1.04700 42.0150    71.79102      60.86957   CHD
5  AC12-PRD-C1    85 118.75 22.75 39.50 1.19900 1.25125 41.8425    80.05341      73.91304   CHD
6  AC12-PRD-C1   100 127.00 26.00 48.25 1.34575 1.51850 41.0950    89.85144      86.95652   CHD
7  AC12-PRD-C1   115 135.75 28.00 55.75 1.49775 1.76025 40.7275   100.00000     100.00000   CHD
8  AL13-PRD-C1    25  69.50 16.50 24.00 0.66125 0.58050 31.2275    41.36691      19.23077   CHD
9  AL13-PRD-C1    40  73.00 17.50 26.50 0.74850 0.66425 32.1025    46.82515      30.76923   CHD
10 AL13-PRD-C1    55  83.25 15.50 29.00 0.85500 0.79425 33.6650    53.48764      42.30769   CHD
11 AL13-PRD-C1    70  93.75 16.00 36.50 0.98450 0.99925 34.5325    61.58899      53.84615   CHD
12 AL13-PRD-C1    85 104.50 16.00 44.75 1.14950 1.23475 34.4225    71.91117      65.38462   CHD
13 AL13-PRD-C1   100 114.25 19.25 55.25 1.34650 1.48375 33.1800    84.23522      76.92308   CHD
14 AL13-PRD-C1   115 125.25 20.75 63.75 1.45100 1.65775 32.6450    90.77260      88.46154   CHD
15 AL13-PRD-C1   130 136.25 24.75 78.00 1.59850 1.89075 30.9000   100.00000     100.00000   CHD
16 BM06-PRD-S1    25 119.25 18.25 19.00 0.61675 0.58225 37.6425    48.87084      25.00000 noCHD
17 BM06-PRD-S1    40 126.00 18.00 20.75 0.71700 0.65950 39.2175    56.81458      40.00000 noCHD
18 BM06-PRD-S1    55 133.50 20.75 25.00 0.86275 0.82750 41.2150    68.36371      55.00000 noCHD
19 BM06-PRD-S1    70 147.25 18.25 29.00 0.98575 1.04550 41.7050    78.11014      70.00000 noCHD
20 BM06-PRD-S1    85 158.50 22.25 39.25 1.13000 1.30525 41.1425    89.54041      85.00000 noCHD
21 BM06-PRD-S1   100 168.75 27.75 51.00 1.26200 1.61150 38.8925   100.00000     100.00000 noCHD
22 CB19-PRD-S1    25  98.75 18.50 25.00 0.88350 0.80475 40.7550    36.15715      13.15789 noCHD
23 CB19-PRD-S1    40  98.25 20.00 25.50 0.94575 0.82900 41.4675    38.70473      21.05263 noCHD
24 CB19-PRD-S1    55 102.00 19.75 28.50 1.08125 0.95800 42.2775    44.25005      28.94737 noCHD
25 CB19-PRD-S1    70 107.50 20.50 34.25 1.24400 1.14275 42.6450    50.91058      36.84211 noCHD
26 CB19-PRD-S1    85 111.00 21.25 35.50 1.30475 1.19925 43.3600    53.39677      44.73684 noCHD
27 CB19-PRD-S1   100 117.25 21.50 40.25 1.47350 1.42225 44.2650    60.30284      52.63158 noCHD
28 CB19-PRD-S1   115 123.00 22.75 47.00 1.67900 1.68475 44.6400    68.71291      60.52632 noCHD
29 CB19-PRD-S1   130 129.50 24.50 52.50 1.79075 1.87950 44.3425    73.28627      68.42105 noCHD
30 CB19-PRD-S1   145 135.50 25.25 59.50 1.96000 2.13525 44.7300    80.21281      76.31579 noCHD
31 CB19-PRD-S1   160 145.25 26.75 64.50 2.04050 2.28350 43.8825    83.50726      84.21053 noCHD
32 CB19-PRD-S1   175 151.25 30.50 83.00 2.34425 2.76050 41.6025    95.93820      92.10526 noCHD
33 CB19-PRD-S1   190 161.75 33.75 92.25 2.44350 2.96850 40.0400   100.00000     100.00000 noCHD

I am running this code:

ggscatter(dftest, x = "percent_power", y = "PETCO2", color = "group") +
  stat_cor(label.x = 20, label.y = 2.8) +
  stat_regline_equation(label.x = 20, label.y = 0.5, 
                        formula = y ~ poly(x, 2),
                        aes(label =  paste(..eq.label.., ..adj.rr.label.., sep = "~~~~")),) +
  geom_smooth(aes(colour=group), method = "lm", formula = y ~ poly(x, 2)) +
  xlab("Percentage of power (%)") + 
  ylab(expression(paste("PETC", O[2]," (mmHg)")))

I would like to run the code but just by selecting the data from 65 to 75 % percent of power (x axis). We would get a new equation based on the new selection.

Thanks a lot!

Dyplr & purrr - Overlapping Time Intervals by Group

$
0
0

I need to use the following code, which looks at overlapping time intervals. The overlaps function below works great, but I need to apply to each group in my tibble. The group could be the MemberID column. Not sure how to use the group_map() function, perhaps that could work here?

enter image description here

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))

memberships <- tibble::tibble(
    memberID     = c("A", "A", "A", "B"),
    membershipID = 1:4 %>% as.factor,
    start_date   = ymd(c(20100101, 20101220, 20120415, 20110605)),
    end_date     = ymd(c(20101231, 20111231, 20130430, 20120531)),
    mo_dur       = interval(start_date, end_date) %>%
        as.duration() / dyears() * 12
)

memberships <- tibble::rowid_to_column(memberships)

overlaps <- purrr::map(memberships$rowid, function(id) {
    if (id == nrow(memberships)) {
        NA
    } else {
        row <- memberships[memberships[["rowid"]] == id, ]
        intv <- lubridate::interval(row$start_date, row$end_date)
        # these are the id's of the rows following id
        other_ids <- (id):(nrow(memberships))
        ol <- purrr::map_int(other_ids, function(other_id) {
            other_row <- memberships[memberships[["rowid"]] == other_id, ]
            # either on end is inside of interval or start and end span it
            if (other_row$start_date %within% intv |
                    other_row$end_date %within% intv |
                    (other_row$start_date <= row$start_date &
                     other_row$end_date >= row$end_date)) {
                as.integer(other_row$rowid)
            } else {
                NA_integer_
            }
        })
        # drop the NA's
        ol <- ol[!is.na(ol)]
        # if nothing overlapped return NA
        if (length(ol > 0)) {
            ol
        } else {
            NA
        }
    }
})

# make it a tibbleso youcan bind it
overlaps <- tibble::tibble(following_overlaps = overlaps)
# add as column
memberships <- dplyr::bind_cols(memberships, overlaps)

https://community.rstudio.com/t/counting-overlapping-records-per-day-using-dplyr/4289

How do I get a paged_table without column types displayed?

$
0
0

I need to print a paged_table in a HTML Rmd document, but I don't want the column types displayed.

?paged_table indicates that there are printing options, but the only options I can find documented are about the maximum numbers of rows/columns to print and whether or not to print row names.

Reproducible example in RMD:

---
output:
  html_document:
    df_print: paged
editor_options:
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


```{r cars}
datasets::mtcars
```

Can I arrange my study labels in a forest plot using study years after specifying the byvar to be something different? (Meta package-R)

$
0
0

Can I arrange all my study labels within subgroups in my forest plot with the year of publication after specifying that I want subgroups divided by a certain variable?

Here is the code I am currently using.

brugia.forest <- metaprop(event = no.positive, n = no.tested, studlab = studylabel, data = brugia, byvar = diagnostics, bylab = c("direct detection", "direct and indirect detection", "indirect detection"), print.byvar = F, sm = "PLO", method.tau = "REML", title = "", hakn = T)

I would like the studies within the "diagnostics" groups to be arranged from the oldest to the most recent and not alphabetically as is currently the case. I am using the meta package of R because of its user-friendliness and would like to continue using it (so, metafor suggestions may not be too helpful)

Thanks.

Combine two datasets with Interval time condition in R (I would like to avoid combinations and just have unique matches)

$
0
0

I have two separate datasets: df1 and df2. I would like to create a new dataset, df3 that would match the endtime column of df1 with the sent column of df2 if the datetimes are within 20 seconds of each other.

 df1

 endtime                     ID

 1/7/2020  1:35:08 AM         A
 1/7/2020  1:39:00 AM         B
 1/20/2020 1:45:00 AM         C



 df2

sent                         ID

1/7/2020  1:35:20 AM          E
1/7/2020  1:42:00 AM          F
1/20/2020 1:55:00 AM          G
1/20/2020 2:00:00 AM          E

This is my desired output for df3. There is only one row, because there are only two values that match the condition of being within 20 seconds of the endtime and sent columns. I would like unique matches, not a combination. Essentially a merge with a time condition.

endtime                  sent 

1/7/2020 1:35:08 AM      1/7/2020  1:35:20 AM       

Here is the dput:

df1

structure(list(endtime = structure(c(2L, 3L, 1L), .Label = c("1/10/2020 1:45:00 AM", 
"1/7/2020 1:35:08 AM", "1/7/2020 1:39:00 AM"), class = "factor"), 
ID = structure(1:3, .Label = c("A", "B", "C"), class = "factor")), class = "data.frame", row.names =   c(NA, 
 -3L))





 df2

 structure(list(sent = structure(c(3L, 4L, 1L, 2L), .Label = c("1/20/2020 1:55:00 AM", 
 "1/20/2020 2:00:00 AM", "1/7/2020 1:35:20 AM", "1/7/2020 1:42:00 AM"
 ), class = "factor"), ID = structure(c(1L, 2L, 3L, 1L), .Label = c("E", 
"F", "G"), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

This is what I have tried:

I am thinking of performing a left join and matching the values, or I can use merge(), but the tricky part is matching the values with the conditional statement. Any suggestion is appreciated.

library(dplyr)
left_join(df1, df2)

Regression equation produces model outside of all data

$
0
0

I'm fairly confused to why I produce a regression equation that is so outside of the range of all data in dataset. I have a feeling the equation is very sensitive to data with a big spread but Im still confused. Any assistance would be greatly appreciated, stats certainly isn't my first language!

For reference this is a geochemical thermodynamics problem: Im trying to fit the Maier-Kelley equation to some experimental data. The Maier-Kelley equation describes how

The Maier-Kelley equation describes how the equilibrium constant (K), in this case dolomite dissolving in water, changes with temperature (T in this case in Kelvin).

log K = A + B.T + C/T + D.logT + E/T^2

The experimental data uses groundwater data from different locations and different depths (so identified by FIELD and DepthID - which are my random variables).

I have included 3 datasets

(Problem)Dataset 1:https://pastebin.com/fe2r2ebA

(Working)Dataset 2:https://pastebin.com/gFgaJ2c8

(Working)Dataset 3:https://pastebin.com/X5USaaNA

Using the following code, for dataset 1

> dat1 <- read.csv("PATH_TO_DATASET_1.txt", header = TRUE,sep="\t")
> fm1 <- lmer(Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) + (1|FIELD) +(1|DepthID),data=dat1)

Warning messages:
1: Some predictor variables are on very different scales: consider rescaling 
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.0196619 (tol = 0.002, component 1)
3: Some predictor variables are on very different

> summary(fm1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) +      (1 | FIELD) + (1 | DepthID)
   Data: dat1

REML criterion at convergence: -774.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.5464 -0.4538 -0.0671  0.3736  6.4217 

Random effects:
 Groups   Name        Variance Std.Dev.
 DepthID  (Intercept) 0.01035  0.1017  
 FIELD    (Intercept) 0.01081  0.1040  
 Residual             0.01905  0.1380  
Number of obs: 1175, groups:  DepthID, 675; FIELD, 410

Fixed effects:
                   Estimate Std. Error         df t value Pr(>|t|)
(Intercept)       3.368e+03  1.706e+03  4.582e-02   1.974    0.876
kelvin            4.615e-01  2.375e-01  4.600e-02   1.943    0.876
I(kelvin^-1)     -1.975e+05  9.788e+04  4.591e-02  -2.018    0.875
I(log10(kelvin)) -1.205e+03  6.122e+02  4.582e-02  -1.968    0.876
I(kelvin^-2)      1.230e+07  5.933e+06  4.624e-02   2.073    0.873

Correlation of Fixed Effects:
            (Intr) kelvin I(^-1) I(10()
kelvin       0.999                     
I(kelvn^-1) -1.000 -0.997              
I(lg10(kl)) -1.000 -0.999  0.999       
I(kelvn^-2)  0.998  0.994 -0.999 -0.997
fit warnings:
Some predictor variables are on very different scales: consider rescaling
convergence code: 0
Model failed to converge with max|grad| = 0.0196619 (tol = 0.002, component 1)

For Dataset 2

> summary(fm2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) +      (1 | FIELD) + (1 | DepthID)
   Data: dat2

REML criterion at convergence: -1073.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0816 -0.4772 -0.0581  0.3650  5.6209 

Random effects:
 Groups   Name        Variance Std.Dev.
 DepthID  (Intercept) 0.007368 0.08584 
 FIELD    (Intercept) 0.014266 0.11944 
 Residual             0.023048 0.15182 
Number of obs: 1906, groups:  DepthID, 966; FIELD, 537

Fixed effects:
                   Estimate Std. Error         df t value Pr(>|t|)
(Intercept)      -9.366e+01  2.948e+03  1.283e-03  -0.032    0.999
kelvin           -2.798e-02  4.371e-01  1.289e-03  -0.064    0.998
I(kelvin^-1)      2.623e+02  1.627e+05  1.285e-03   0.002    1.000
I(log10(kelvin))  3.965e+01  1.067e+03  1.283e-03   0.037    0.999
I(kelvin^-2)      2.917e+05  9.476e+06  1.294e-03   0.031    0.999

Correlation of Fixed Effects:
            (Intr) kelvin I(^-1) I(10()
kelvin       0.999                     
I(kelvn^-1) -0.999 -0.997              
I(lg10(kl)) -1.000 -0.999  0.999       
I(kelvn^-2)  0.998  0.994 -0.999 -0.997
fit warnings:
Some predictor variables are on very different scales: consider rescaling
convergence code: 0
Model failed to converge with max|grad| = 0.0196967 (tol = 0.002, component 1)

For Dataset 3

> summary(fm2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) +      (1 | FIELD) + (1 | DepthID)
   Data: dat3

REML criterion at convergence: -1590.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.2546 -0.4987 -0.0379  0.4313  4.5490 

Random effects:
 Groups   Name        Variance Std.Dev.
 DepthID  (Intercept) 0.01311  0.1145  
 FIELD    (Intercept) 0.01424  0.1193  
 Residual             0.03138  0.1771  
Number of obs: 6674, groups:  DepthID, 3422; FIELD, 1622

Fixed effects:
                   Estimate Std. Error         df t value Pr(>|t|)
(Intercept)       1.260e+03  1.835e+03  9.027e-02   0.687    0.871
kelvin            1.824e-01  2.783e-01  9.059e-02   0.655    0.874
I(kelvin^-1)     -7.289e+04  9.961e+04  9.044e-02  -0.732    0.866
I(log10(kelvin)) -4.529e+02  6.658e+02  9.028e-02  -0.680    0.872
I(kelvin^-2)      4.499e+06  5.690e+06  9.104e-02   0.791    0.860

Correlation of Fixed Effects:
            (Intr) kelvin I(^-1) I(10()
kelvin       0.999                     
I(kelvn^-1) -1.000 -0.997              
I(lg10(kl)) -1.000 -0.999  0.999       
I(kelvn^-2)  0.998  0.994 -0.999 -0.998
fit warnings:
Some predictor variables are on very different scales: consider rescaling
convergence code: 0
unable to evaluate scaled gradient
Model failed to converge: degenerate  Hessian with 1 negative eigenvalues

I've plotted 'all the data' but for the regression analysis there is no data above the red line or bellow the green line. Only points with a log_ca_mg value between the red and green line at any temperature are included in the regression analysis.

enter image description here

So looking at the regressions on a plot dataset 1 is just way off but as there is no data above the red line this just confuses me no end. The regression is sitting in an area where there is no data. For the other two datasets this isn't a problem. Even for datasets with smaller sizes (n=200) its approximately in the same area. The three datasets look relatively similar when plotted individually.

Im kind of lost. Any help in understanding this would be appreciated.

How to compute a single mean of multiple columns?

$
0
0

I've a database with 4 columns and 8 observations:

enter image description here

> df1
  Rater1 Rater2 Rater4 Rater5
1      3      3      3      3
2      3      3      2      3
3      3      3      2      2
4      0      0      1      0
5      0      0      0      0
6      0      0      0      0
7      0      0      1      0
8      0      0      0      0

I would like to have the mean, median, iqr, sd of all Rater1 and Rater4 observations (16) and all Rater2 and Rater5 observations (16) without creating a new df with 2 variables like this:

> df2
   var1 var2
1     3    3
2     3    3
3     3    3
4     0    0
5     0    0
6     0    0
7     0    0
8     0    0
9     3    3
10    2    3
11    2    2
12    1    0
13    0    0
14    0    0
15    1    0
16    0    0

I would like to obtain this (without a new database, just working on the first database):

> stat.desc(df2)
                   var1       var2
nbr.val      16.0000000 16.0000000
nbr.null      8.0000000 10.0000000
nbr.na        0.0000000  0.0000000
min           0.0000000  0.0000000
max           3.0000000  3.0000000
range         3.0000000  3.0000000
sum          18.0000000 17.0000000
median        0.5000000  0.0000000
mean          1.1250000  1.0625000
SE.mean       0.3275541  0.3590352
CI.mean.0.95  0.6981650  0.7652653
var           1.7166667  2.0625000
std.dev       1.3102163  1.4361407
coef.var      1.1646367  1.3516618

How can I do this in R?

Thank you in advance

Using "Margins" Package to get the results from a "PLM" random effects model

$
0
0

good day everyone!

I am using PLM Package to run 6 mixed models (random effects). If you would compare, I am using the xtreg command from STATA.

Besides getting two way different R2 (STATA I get ~.28 and R I am getting ~.57), in Stata I can run the margins command and plot the results, which is very handy. I have installed the package "Margins" in R and I could not proceed with the analysis. Could I get any help please? Thank you!

Image from my R

Viewing all 209932 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>