Non-linear regression with very noisy data with nls() in R [migrated]

May 16, 2024, 4:02 pm

≫ Next: Partykit predict function throws warning when predicting with new data

≪ Previous: In R, how to reform a nested list from a dataframe after conversion with tibble::enframe %>% unlist?

I am trying to fit noisy data to a specific model with two parameters which I would like to estimate. Unfortunately, the model fit is just terrible with added noise. Is there anything I can do to improve this fit?

The formula/model looks like this:

model_form <- as.formula(y ~ 1/((1/i)-(r*x)))

Below I have created some sample data with added noise. Without this noise, the nls() fit works well end estimates r and i good enough. But with the added noise, the model fit and parameter estimation is poor despite visible patterns in the heatmap.

Create sample data with example parameters i and r (these are unknown in reality, but constrained to a specific known interval):

my_i <- 0.5   # example i parameter my_r <- 100   # example r parameterd <- data.frame(x=c(-rexp(500,rate=10),                    seq(-1,0,length.out = 500))) %>%     mutate(y=jitter(1/((1/my_i) - (my_r * x)),1000))noise <- data.frame(x=runif(1000,-1,0),                    y=runif(1000,0,0.5))d <- bind_rows(d,noise)

The following function fits a nls to this data using model_form from above.

fitPlot <- function(data) {    fit <- nls(model_form,data,               start=list(r=10,i=0.3))    fit_r <- summary(fit)$coefficients["r",1]    fit_i <- summary(fit)$coefficients["i",1]    predicted <- data.frame(x=seq(-1,0,length.out=1000)) %>%        mutate(y=1 / (1/fit_i - fit_r * x))    p <- ggplot(data)+        geom_bin2d(aes(x,y),bins=14)+        geom_line(data=predicted,aes(x,y))+        annotate("label",x=-0.7,y=0.5,                 label=paste0("r: ",fit_r,", i: ",fit_i))    p}fitPlot(d)

Is there anything I can do at this point? I have tried

grid search with possible i and r values = same result as nls
optim with different optimization methods, function to minimize SS
nlrob from robustbase with different methods

I am lost at this point, somehow there must be a way to create a robust model? Somehow I need to penalize outliers less, but how? Any help is welcome!

↧

Partykit predict function throws warning when predicting with new data

May 16, 2024, 4:34 pm

≫ Next: How do these two ways of creating an S4 object differ?

≪ Previous: Non-linear regression with very noisy data with nls() in R [migrated]

Im trying to predict new data with a ctree object. I get this warning message when I run the function:

b1b2_party <- ctree(factor(final_category_bin) ~ ., data = train, control = ctree_control(maxsurrogate = 3))

predict.party(b1b2_party, newdata = test, type = "response")

Warning message:In retid[indx] <- fitted_node(kids_node(node)[[i]], data, vmatch, :number of items to replace is not a multiple of replacement length

The result of predict.party is a vector of all 1s (my target variable is a factor with levels 0 and 1), which leads me to believe that this warning is really messing with the results.

The test data was handled the same way as the training data. I haven't found a way to create a reproducible example for this issue, and I apologize for that. But I am wondering if anyone else has encountered this warning and what their way forward was.

I tried to predict new data with a ctree object. I expected a mix of conclusions but got all 1s returned and a warning.

↧

How do these two ways of creating an S4 object differ?

May 16, 2024, 4:59 pm

≫ Next: Using a custom regression model with terra

≪ Previous: Partykit predict function throws warning when predicting with new data

I have seen instances of each of these in various packages.

foo <- setClass('foo',[slots defined])bar <- foo([data ])

or2)

setClass('foo',[slots defined])bar <- new('foo',[data])

Is there any difference in performance, or in the material which would be required in a NAMESPACE file?

↧

Using a custom regression model with terra

May 16, 2024, 5:05 pm

≫ Next: R: Alternatives/approaches to read_html() + html_text() that also work on strings without HTML/XML tags

≪ Previous: How do these two ways of creating an S4 object differ?

I would like to fit a custom regression model (mblm package) to each cell in a time series raster (each layer is a year). My raster manipulation is all happening with terra, so I'd prefer to continue in that package world if possible.

Example dataset:

d<-rast(lapply(1:10, FUN = function(x){  rast(matrix(rnorm(100, mean=x), ncol=10))}))names(d)<-1:10

I've looked through the regress function, but I think I might be misunderstanding something about how it works. For example, if I create a hypothetical 10-layer year raster:

d1<-rast(lapply(names(d)[1:10], function(x) {rast(matrix(x, nrow=10, ncol=10))}))names(d1)<-1:10

and use the built-in lm within regress:

plot(regress(d, d1))

The slope raster (labelled 'x' comes out as all zero).

↧

R: Alternatives/approaches to read_html() + html_text() that also work on strings without HTML/XML tags

May 16, 2024, 5:29 pm

≫ Next: Is there a way to find the combined support across for association rule mining?

≪ Previous: Using a custom regression model with terra

In this solution to removing HTML tags from a string, the string is passed to rvest::read_html() to create an html_document object and then the object is passed to rvest::html_text() to return "HTML-less text."

However, read_html() throws an error if the string does not contain HTML tags because the string is treated as a file/connection path, as demonstrated below. This can be problematic when attempting to remove HTML from many strings that may not contain any tags.

library(rvest)# Example datadat <- c("<B>Positives:</B> Rangy, athletic build with room for additional growth. ...","Positives: Better football player than his measureables would indicate. ...")# Success: produces html_document objectrvest::read_html(dat[1])#> {html_document}#> <html>#> [1] <body>\n<b>Positives:</b> Rangy, athletic build with room for additional  ...# Errorrvest::read_html(dat[2])#> Error in `path_to_connection()`:#> ! 'Positives: Better football player than his measureables would#>   indicate. ...' does not exist in current working directory#>   ('C:/LONG_PATH_HERE').

Is there a fast way to ensure read_html() treats each string as xml even if it does not contain any tags or alternatively to remove HTML to the same effect as read_html() |> html_text()?

One idea was to simply append "" or "\r" to the end of each string. However, I imagine there is either a more efficient approach that returns the string without any computation when the string lacks any HTML or some way of accomplishing this using the function's arguments. Other alternatives would involve using regex to remove tags, although doing so violates the "don't use regex on html" principle.

↧

Is there a way to find the combined support across for association rule mining?

May 16, 2024, 5:40 pm

≫ Next: In below `gt` nest table ,how can i change the text color and size

≪ Previous: R: Alternatives/approaches to read_html() + html_text() that also work on strings without HTML/XML tags

I am working on a problem aiming to identify a rule set for the approval of applications. I am using the apriori algorithm in the arules package to find association rules, which align to said approvals from historic data.What I want to do is understand if the rule set I have covers all approvals I have in my dataset, as opposed to the supports for individual rules.As a theoretical example, using the iris data, trying to find all rules for predicting Species = versicolor:

rules_1 <- apriori(iris, parameter = list(support = 0.02,                                                      confidence = 0.95,                                                      target = 'rules'),                                     appearance = list(rhs = "Species=versicolor"))inspect(rules_1)

The output looks like the following (truncated to 2 lines, with there being 21 other rules)

lhs	rhs	support	confidence	coverage	lift	count
{Sepal.Length=[4.3,5.4), Petal.Width=[0.867,1.6)}	=> {Species=versicolor}	0.03333333	1	0.03333333	3	5
{Sepal.Width=[2.9,3.2), Petal.Width=[0.867,1.6)}	=> {Species=versicolor}	0.11333333	1	0.11333333	3	17

The idea is, with there being only 1 rhs, how can I extract the lhs column in a way that lets me filter the data by all of these rules at once, and then see how many rows I get (knowing that there are 50 rows for versicolor).

↧

In below `gt` nest table ,how can i change the text color and size

May 16, 2024, 5:58 pm

≫ Next: Improving performance of iteration in R

≪ Previous: Is there a way to find the combined support across for association rule mining?

In below gt nest table ,how can i change the text color and size ?I added color and size to gt , it's can't work. Anyone can help ? Thanks!

library(gt)library(tidyverse)nested_data <- diamonds %>% head(10)%>% group_by(color) %>%   summarise(x=list(x),y=list(y))nested_data %>% gt() # result as attached image

↧

Improving performance of iteration in R

May 16, 2024, 6:01 pm

≫ Next: Deploying a Shiny App, using deployApp call

≪ Previous: In below `gt` nest table ,how can i change the text color and size

I originally created a for loop to calculate numerous variables that were sometimes dependent on previous iterations e.g. [i] and [i-1].

To improve the performance for larger datasets I attempted to use vectorization and produced the following code, which is quicker. The dataframe I hope to use this code on has a length of ~200,000 observations. Is there anyway I can improve the performance of this code further? I hope to run different scenarios in the future, which will also increase computing time.

# Load required packageslibrary(dplyr)library(purrr)# Set seed for reproducibilityset.seed(42)# Generate random data for stormn <- 10000  # number of rowsstorm <- data.frame(  date_time = seq.POSIXt(from = as.POSIXct("2021-01-01"), by = "day", length.out = n),  temp = runif(n, min = -10, max = 35),  Rain = runif(n, min = 0, max = 50),  PET = runif(n, min = 0, max = 10),  Qin_m3 = runif(n, min = 0, max = 100))# Placeholder functionsTSA_depth.fn <- function(volume, TSA_dimensions) {  # Example behavior: return volume divided by some constant  return(volume / TSA_dimensions)}soil_infiltration.fn <- function(volume, MRC_df, MRC_name, pipe_base_vol) {  # Example behavior: return volume times some fraction  return(volume * 0.1)}outlet_pipe.fn <- function(diameter, waterDepth, pipe_base, Cd, maxTSA_height) {  # Example behavior: return some function of the inputs  return(diameter * waterDepth * Cd)}# ConstantsmaxTSA_area <- 1000TSA_dimensions <- 50pipe_diameter <- 0.5pipe_base_m <- 0.1Cd <- 0.6maxTSA_height <- 2.5maxTSA_volume <- 2000MRC_df <- data.frame()  # Example placeholderMRC_name <- "example"pipe_base_vol <- 0.1# Initial data preparationTSA_model <- storm %>%  select(date_time, temp, Rain, PET, Qin_m3) %>%  mutate(Qin_m3 = replace(Qin_m3, 1, 0))# Initialize new columns with 0 and precompute PET_m3TSA_model <- TSA_model %>%  mutate(S = 0, dS = 0, depth = 0, PET_m3 = round((PET / 1000) * maxTSA_area, digits = 3),          soil_m3 = 0, pipe_m3 = 0, overflow_m3 = 0, Qout = 0)# Function to calculate the values for the rowscalculate_values <- function(df) {  n <- nrow(df)  for (i in 2:n) {    df$dS[i] <- df$Qin_m3[i] - df$Qout[i - 1]    df$S[i] <- df$S[i - 1] + df$dS[i]    df$depth[i] <- TSA_depth.fn(volume = df$S[i], TSA_dimensions = TSA_dimensions)    soil_infiltration <- soil_infiltration.fn(volume = df$S[i], MRC_df = MRC_df, MRC_name = MRC_name, pipe_base_vol = pipe_base_vol)    df$soil_m3[i] <- max(soil_infiltration - df$PET_m3[i], 0)    df$soil_m3[i] <- ifelse(df$S[i] - df$soil_m3[i] < 0, df$S[i], df$soil_m3[i])    pipe_outflow <- outlet_pipe.fn(diameter = pipe_diameter, waterDepth = df$depth[i], pipe_base = pipe_base_m, Cd = Cd, maxTSA_height = maxTSA_height)    df$pipe_m3[i] <- max(pipe_outflow, 0)    df$overflow_m3[i] <- ifelse(df$S[i] > maxTSA_volume,                                max(df$S[i] - maxTSA_volume - df$pipe_m3[i] - df$soil_m3[i] - df$PET_m3[i], 0),                                0)    df$Qout[i] <- df$PET_m3[i] + df$soil_m3[i] + df$pipe_m3[i] + df$overflow_m3[i]    df$Qout[i] <- ifelse(df$S[i] - df$Qout[i] < 0, df$S[i], df$Qout[i])  }  return(df)}# Calculate the new valuesTSA_model <- calculate_values(TSA_model)# Display the updated TSA_modelstr(TSA_model)summary(TSA_model)

↧

Deploying a Shiny App, using deployApp call

May 16, 2024, 6:03 pm

≫ Next: R dplyr summarise over intervals

≪ Previous: Improving performance of iteration in R

This code to upload my "Hello World" shiny App to ShinyApps.io works:

rsconnect::deployApp(account = 'richardrogers', appDir = "*********************************************/SuperSimpleShiny", appFiles = c("app.R", "config.yml"))

I receive this as a response:

-- Preparing for deployment ------------------------------------------------------------v Re-deploying "supersimpleshiny" using "server: shinyapps.io / username: richardrogers"i Looking up application with id "12017803"...v Found application <https://richardrogers.shinyapps.io/supersimpleshiny/>i Bundling 2 files: app.R and config.ymli Capturing R dependencies with renvv Found 30 dependenciesv Created 18,284b bundlei Uploading bundle...v Uploaded bundle with id 8625650-- Deploying to server -----------------------------------------------------------------Waiting for task: 1417136163  building: Building image: 10503736  building: Fetching packages  building: Installing packages  building: Installing files  building: Pushing image: 10503736  deploying: Starting instances  rollforward: Activating new instances  unstaging: Stopping old instances-- Deployment complete -----------------------------------------------------------------v Successfully deployed to <https://richardrogers.shinyapps.io/supersimpleshiny/>

I make only one change to the deployApp call, changing "app.R" to "SuperSimpleShiny.R":

rsconnect::deployApp(account = 'richardrogers', appDir = "*********************************************/SuperSimpleShiny", appFiles = c("SuperSimpleShiny.R", "config.yml"))

I receive this as a response:

-- Preparing for deployment ------------------------------------------------------------v Re-deploying "supersimpleshiny" using "server: shinyapps.io / username: richardrogers"Error in `quartoInspect()`:! `quarto` not found.i Check that it is installed and available on your `PATH`.Run `rlang::last_trace()` to see where the error occurred.

I get the exact same results whether I execute these commands in RStudio or the standalone R Console.I have used the fc (file compare) command line utility to compare "app.R", and "SuperSimpleShiny.R", and they are identical.There must be some special significance of the file name "app.R".Is it similar to the "C" main() function?

I've looked at similar posts about this problem, including this one:r shinyapps deployment error when doing it manually

Maybe I've found a bug in the rsconnect::deployApp() function.I'm stumped.

Any pointers most appreciated.

Richard

↧

R dplyr summarise over intervals

May 16, 2024, 6:16 pm

≫ Next: Non-NA values being replaced as NA during Excel File Import into R

≪ Previous: Deploying a Shiny App, using deployApp call

This is probably an Rdplyrsummarise question.I have a data.frame with values recorded for subjects at 5 minute time intervals and it has these three columns: id: subject ID, value: the recorded value at the time point, and cum_time: the cumulative time value for each id:

library(dplyr)set.seed(1)df <- data.frame(id = c(rep("id1", 100), rep("id2", 100), rep("id3", 100)),                 value = runif(300, 10, 20)) %>%  dplyr::group_by(id) %>%  dplyr::mutate(cum_time = 5 * (dplyr::row_number()-1))

I'd like to compute a data.frame with the medians of value over 60 minute intervals to give this resulting data.frame:

rbind(data.frame(id = "id1", median_value = c(median(dplyr::filter(df, id == "id1" & cum_time >= 0 & cum_time <= 60)$value),                                              median(dplyr::filter(df, id == "id1" & cum_time >= 65 & cum_time <= 120)$value),                                              median(dplyr::filter(df, id == "id1" & cum_time >= 125 & cum_time <= 180)$value),                                              median(dplyr::filter(df, id == "id1" & cum_time >= 185 & cum_time <= 240)$value),                                              median(dplyr::filter(df, id == "id1" & cum_time >= 245 & cum_time <= 300)$value)),                 cum_time = c(60, 120, 180, 240, 300)),      data.frame(id = "id2", median_value = c(median(dplyr::filter(df, id == "id2" & cum_time >= 0 & cum_time <= 60)$value),                                              median(dplyr::filter(df, id == "id2" & cum_time >= 65 & cum_time <= 120)$value),                                              median(dplyr::filter(df, id == "id2" & cum_time >= 125 & cum_time <= 180)$value),                                              median(dplyr::filter(df, id == "id2" & cum_time >= 185 & cum_time <= 240)$value),                                              median(dplyr::filter(df, id == "id2" & cum_time >= 245 & cum_time <= 300)$value)),                 cum_time = c(60, 120, 180, 240, 300)),      data.frame(id = "id3", median_value = c(median(dplyr::filter(df, id == "id3" & cum_time >= 0 & cum_time <= 60)$value),                                              median(dplyr::filter(df, id == "id3" & cum_time >= 65 & cum_time <= 120)$value),                                              median(dplyr::filter(df, id == "id3" & cum_time >= 125 & cum_time <= 180)$value),                                              median(dplyr::filter(df, id == "id3" & cum_time >= 185 & cum_time <= 240)$value),                                              median(dplyr::filter(df, id == "id3" & cum_time >= 245 & cum_time <= 300)$value)),                 cum_time = c(60, 120, 180, 240, 300)))    id median_value cum_time1  id1     15.72853       602  id1     15.74687      1203  id1     14.87811      1804  id1     16.00048      2405  id1     14.57858      3006  id2     15.98761       607  id2     14.65317      1208  id2     15.36035      1809  id2     15.16835      24010 id2     13.90954      30011 id3     12.68951       6012 id3     15.79852      12013 id3     14.03968      18014 id3     14.29187      24015 id3     15.11250      300

↧

Non-NA values being replaced as NA during Excel File Import into R

May 16, 2024, 6:21 pm

≫ Next: Bar plots with identical bar width *and* identical bar spacing across *several* plots

≪ Previous: R dplyr summarise over intervals

I have an excel file which I have imported into R using read_excel. The excel file contains about 8000+ records with 58 columns. There are many cells with NA values. Across the different variables, they may be entered as "n/a", "N/A", "n/A", "N/a","na","NA", "n a","(blank)".

When I'm importing the file, I use this code:

read_excel("path", col_names = TRUE, na = c("n/a", "N/A", "n/A", "N/a","na","NA", "n a",""))

In one column, I have records with this value NR, which stands for No Record and is not the same thing as NA.

When I run the read_excel code above, the NR becomes NA. If I take out the na = c("n/a", "N/A", "n/A", "N/a","na","NA", "n a","") section, the NR is imported as it should. How do I get R to recognize that NR is not the same thing? As a side note, I cannot change the NR notation to something else.

↧

Bar plots with identical bar width and identical bar spacing across several plots

May 17, 2024, 12:17 pm

≫ Next: I am having problem with my logistics regression machine learning model

≪ Previous: Non-NA values being replaced as NA during Excel File Import into R

Similarly to this question, I want to generate several and separate bar plots that always have the same bar width. That said, I also want that in each of these plots, the spacing between the bars is the same.

I also need these plots to be saved individually, and that each saved image size is proportional to the number of bars. I do not want to use facets or grids (which means similar questions on SO does not apply to me). Furthermore, I need a geom_bar() solution together with ggsave().

REPRODUCIBLE EXAMPLE

Libraries

library(tidyverse)library(ggplot2)

Mock data (in reality, I have hundreds of variables and up to 15 levels for some factors)

df <- structure(list(Animal = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 1L,                                           1L, 3L, 3L), levels = c("Cat", "Dog", "Horse"), class = "factor"),                      Sex = structure(c(2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L), levels = c("Female", "Male"), class = "factor")), row.names = c(NA, 10L), class = "data.frame")

Function to generate bar plots

# Create countercounter <- 0# Functionmake_bar_plots <- function(df){   # We loop over the columns  for (col in colnames(df)){     # (1) Preparation ##########################################################    # Store N    N <- sum(!is.na(df[col]))    # Store number of levels    n_levels <- length(unique(df[!is.na(df[col]), col]))[1]    # Update counter at each iteration    counter <- counter + 1    # (2) Basic bar plot #######################################################    # We only plot factors    if (is.factor(df[[col]])) {      p <- ggplot(data = df[col] %>% filter(!is.na(df[col])),                  aes(x=fct_rev(fct_infreq(!!sym(col)))))      # Compute percents for bars      p <- p + geom_bar(aes(y = (..count..)/sum(..count..)),                        width = 0.35*n_levels/3, # We make column width proportional to number of levels!                        fill = 'black',                        size = 1.5      )       # (3) Annotations ########################################################      # Add percent as annotation      p <- p + geom_text(aes(label = scales::percent((..count..)/sum(..count..),                                                     accuracy = 0.1L),                             y = ((..count..)/sum(..count..))),                         stat="count", hjust = -0.1, size=4.2)      # Add a title      p <- p + labs(title = paste(col),                    subtitle = paste0('N = ', format(N, big.mark = ',')                    ))      # (4) Aesthetics #########################################################      p <- p + theme_minimal()      p <- p + theme(axis.title.y = element_blank(),                     axis.title.x = element_blank(),                     axis.text.x = element_blank(),                     axis.text.y = element_text(colour='black', size = 12.5),                     plot.margin = margin(r=25, l=1, b=0, t=1),                     plot.title = element_text(size = 16.5),                     plot.subtitle = element_text(size = 12.5),                     panel.grid = element_blank()                     )      # Flip coordinates      p <- p + coord_flip(clip = 'off')      # (5) Print and export   #################################################      print(p)      # Export      ggsave(paste(counter, '_', col, '.png'), path = 'plots/',              bg = 'white'      )    }   }}

Output

make_bar_plots(df)

Discussion

We see that bar width is the same in both plots. This was achieved with the argument width = 0.35*n_levels/3 that is included in the function.

That said, the spacing between bars is different for Cat, Dog, and Horse (Plot 1), compared to Male and Female (Plot 2).

What I have tried:

In the ggsave() call, to adjust the height of the generated image using a proportion (e.g., height = 2*n_levels/3). That said, the resulting plots have different bar width, see illustration below.
In the aesthetics part of the function, to adjust aspect ratio (e.g., p + theme(aspect.ratio = 1/4)). This too makes the width of the bars different across plots.
A combination of the above, with the same undesired results.

Illustration when using height = 2*n_levels/3

My 2 questions:

How do we obtain the same spacing between bars across all plots, while also keeping bar width identical across all plots?
Furthermore, how do we save an image whose height will be automatically adjusted to the number of bars? In my example (first two plots of this post), the height of the Sex plot (2 bars) would be smaller than that of the Animal plot (3 bars). Currently, both have the same height.

↧

I am having problem with my logistics regression machine learning model

May 17, 2024, 12:22 pm

≫ Next: Creating Loop for Triangles

≪ Previous: Bar plots with identical bar width *and* identical bar spacing across *several* plots

My model accuracy is pretty bad. This data is taken from https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original which shows logistics regression model accuracy of 96%, so really the problem is in my model. I have build the following model in R.

# Importing dataset tumor_study <- read.csv("breast-cancer-wisconsin.data", header = FALSE, na.strings = "NA")# Adding column namesfeatures <- c("id_number", "ClumpThickness", "Uniformity_CellSize","Uniformity_CellShape", "MarginalAdhesion","SingleEpithelial_CellSize", "BareNuclei", "Bland_Chromatin","Normal_Nucleoli", "Mitoses", "Class")colnames(tumor_study) <- features # Cleaning data# Remove the 1st column (id_number)tumor_study <- tumor_study[,-1]# Convert "?" to NA in BareNuclei column and then to numerictumor_study$BareNuclei[tumor_study$BareNuclei == "?"] <- NAtumor_study$BareNuclei <- as.numeric(tumor_study$BareNuclei)# Remove rows with missing values in BareNucleitumor_study <- tumor_study[!is.na(tumor_study$BareNuclei),]# Convert Class to a factortumor_study$Class <- factor(tumor_study$Class, levels = c(2, 4), labels = c("Benign", "Malignant"))# Splitting the dataset into training and test setslibrary(caTools)set.seed(123)split <- sample.split(tumor_study$Class, SplitRatio = 0.8)training_set <- tumor_study[split == TRUE,]test_set <- tumor_study[split == FALSE,]# Applying feature scalingtraining_set[, 1:9] <- scale(training_set[, 1:9])test_set[, 1:9] <- scale(test_set[, 1:9])# Building the logistic regression modelclassifier <- glm(formula = Class ~ ., family = binomial, data = training_set)# Predicting probabilities for the training setprob_y_train <- predict(classifier, type = 'response', newdata = training_set[,-10])predicted_y_training <- ifelse(prob_y_train >= 0.5, "Benign", "Malignant")# prediction using test_setprob_y_test <- predict(classifier, type = 'response', newdata = test_set[,-10])predicted_y_test <- ifelse(prob_y_test >= 0.5, "Benign", "Malignant")# Checking the accuracy with confusion matrixcm_test <- table(test_set[,10], predicted_y_test)print(cm_test)## if you check the accuracy ... it is close to 2%

How can I figure out the problem in my model?

↧

Creating Loop for Triangles

May 17, 2024, 12:30 pm

≫ Next: problem with entering picture in Rmarkdown

≪ Previous: I am having problem with my logistics regression machine learning model

I have multiple objects that I created which include data I need for each triangle. I now want to create triangles and export to csv, but I have so many datasets that I'd like to do this in a loop. Bonus points if I can also name each csv the same as the triangle name (for example below, LiabilityTriangleMonthly). My current code that works on a one-by-one basis is as follows:

    Liability_Triangle_Monthly = as.triangle(LiabilityData_Monthly,                               dev = "MaturityMonth",                               origin = "AY",                               value = "IncurredTotal")    LiabilityTriangleMonthly=as.matrix(Liability_Triangle_Monthly)    LiabilityTriangleMonthly[is.na(LiabilityTriangleMonthly)]<-0    write.csv(LiabilityTriangleMonthly,"//filepath/LiabilityTriangleMonthly.csv")

For reference, the names of 2 of my objects:LiabilityData_MonthlyPhysDamData_Monthly

Tried to create a list to loop through... but very new to R and need a simpler version than what I've seen on this site already.

↧

problem with entering picture in Rmarkdown

May 17, 2024, 1:06 pm

≫ Next: Remove value but keep color for insignificant correlations using ggcorrplot

≪ Previous: Creating Loop for Triangles

I am using latex syntax since I am creating .pdf in RMardown to enter .png file in text.

title: "Untitled"author: "Unnamed"date: "2024-05-17"output:  pdf_document: default  word_document: default---```{r setup, include=FALSE}knitr::opts_chunk$set(echo = TRUE)``````{r}pckgs = c('readr', 'kableExtra', 'magrittr', 'tidyverse', 'ggplot2', 'ggpubr','flextable','janitor', 'tidyr', 'readxl', 'car', 'effects', 'pastecs', 'compute.es', 'WRS2', 'multcomp', 'Hmisc', 'ggm', 'polycor', 'rstatix','babynames', 'hrbrthemes', 'viridis', 'WRS2', 'googledrive', 'ppcor', 'lattice', 'lmtest', 'betareg', 'ggcorrplot', 'corrplot', 'RColorBrewer', 'glmmTMB', 'DHARMa', 'gtsummary', 'knitr')sapply(pckgs, function(x){if(!require(x, character.only = T)){install.packages(x);require(x, character.only = T)}  else{require(x, character.only = T)}})extxtxtxxtxtxt\textbf{Fig.~\ref{fig:figure1}}. ```{r}setwd("C:/all path to picture folder")```\begin{figure}[h]\centering\includegraphics[width=0.5\textwidth]{AA.png}\caption{Fig 1: xxxxxxx}\label{fig:figure1}\end{figure}

But I am getting the following error.

A new version of TeX Live has been released. If you need to install or update any LaTeX packages, you have to upgrade TinyTeX with tinytex::reinstall_tinytex(repository = "illinois").

tlmgr.pl: Local TeX Live (2023) is older than remote repository (2024).Cross release updates are only supported with  update-tlmgr-latest(.sh/.exe) --updateSee https://tug.org/texlive/upgrade.html for details.! Package pdftex.def Error: File `AA.png' not found: using draft setting.Errore: LaTeX failed to compile tr.tex. See https://yihui.org/tinytex/r/#debugging for debugging tips. See tr.log for more info.

I am sure that AA.png is in the right folder. Do you know how I possibly could fix it?

Thanks

↧

Remove value but keep color for insignificant correlations using ggcorrplot

May 17, 2024, 1:22 pm

≫ Next: Function containing for-loop in R not properly inserting results into data table

≪ Previous: problem with entering picture in Rmarkdown

I am using ggcorrplot to visualize a correlation matrix. How can I show the correlation coefficient value inside the plot only if it is significant? For non-significant correlations there should be no value displayed but the color should still be present.

library(ggplot2)library(ggcorrplot)# Make correlation coefficient matrix corr <- round(cor(mtcars),1)# Make p-value matrix associated with the correlation coefficientscorr_p <- cor_pmat(mtcars)# Shows all of the correlation coefficientsggcorrplot(corr,           lab = TRUE)

# Puts a X over the non-significant correlationsggcorrplot(corr,           lab = TRUE,           p.mat = corr_p)

# Removes the value of non-significant correlations but also removes color# I want to keep the colorggcorrplot(corr,           lab = TRUE,           p.mat = corr_p,           insig = 'blank')

^{Created on 2024-05-17 with reprex v2.1.0}

↧

Function containing for-loop in R not properly inserting results into data table

May 17, 2024, 1:25 pm

≫ Next: Rolling window quantile regression

≪ Previous: Remove value but keep color for insignificant correlations using ggcorrplot

I am attempting to create a function that evaluates a data table column and returns the result into a new data table. However, when I create and run the function, the created data table comes back empty. When I run the for loop contained within the function just by itself, it works, but when I put it within a funciton, it seems to break. I'm not sure what happens.

Seems like it should be a very simple fix, but I'm new to R and coding in general and can't figure out what I'm doing wrong.

Below is a stripped down function that is just trying to pull a data table row, add one, and insert it into a new data frame.

#Here's the function I'm trying to makelibrary(data.table)og.dt <- data.table(c(1, 2, 3, 4))dt<- data.table()pull<-function(vec){  for(i in 1:4){{  dd<-vec[i]+1}    dt<- rbind(dt, dd)  }}pull(og.dt)dt#dt returns an empty data table

The for loop on its own works and gives the desired output, but for some reason putting it within a function breaks it

dt<- data.table() for(i in 1:4){{  dd<-og.dt[i]}    dt<- rbind(dt, dd)  }dt

Thanks!

↧

Rolling window quantile regression

May 17, 2024, 1:26 pm

≫ Next: How to identify whether rows within a group have values in the same columns?

≪ Previous: Function containing for-loop in R not properly inserting results into data table

I'm analyzing financial time series data, specifically returns. I need to build a quantile regression model. I've applied the stepwise function to find the best predictors among the available variables, and then I plan to eliminate variables with a p-value greater than 0.05. Now I have my quantile regression model. I need to estimate future Value at Risk (VaR) and compare my regression model with other models like CAViaR and GARCH through backtesting. My issue lies in performing the rolling window for my quantile regression model. I've divided my time series into an in-sample (dim=808) and out-of-sample (dim=540) sample. Unfortunately, I'm unable to obtain estimated VaR and conduct backtesting. How should I proceed?

library(readxl)library(quantreg)# Carica i dati# Dividi i dati in-the-sample (808 osservazioni) e out-of-the-sample (le rimanenti osservazioni)                                         ETH <- ts(Crypto_index$`Ethereum (ETH)`)Gold <- ts(Crypto_index$Gold)SP500 <- ts(Crypto_index$`S&P500 (GSPC)`)BTC <- ts(Crypto_index$`Bitcoin (BTC)`)USDX <- ts(Crypto_index$`US dollar index (USDX)`)Oil <- ts(Crypto_index$`WTI crude oil`)LTC <- ts(Crypto_index$`Litecoin (LTC)`)BCH <- ts(Crypto_index$`Bitcoin cash (BCH)`)SPUSBT <- ts(Crypto_index$`S&P US Treasury Bond (SPUSBT)`)XRP <- ts(Crypto_index$`Ripple (XRP)`)ETH_in <- head(ETH, 808)Gold_in <- head(Gold, 808)SP500_in <- head(SP500, 808)BTC_in <- head(BTC, 808)USDX_in <- head(USDX, 808)Oil_in <- head(Oil, 808)LTC_in <- head(LTC, 808)BCH_in <- head(BCH, 808)SPUSBT_in <- head(SPUSBT, 808)XRP_in <- head(XRP, 808)ETH_out <- tail(ETH, 540) Gold_out <- tail(Gold, 540)SP500_out <- tail(SP500, 540)BTC_out <- tail(BTC, 540)USDX_out <- tail(USDX, 540)Oil_out <- tail(Oil, 540)LTC_out <- tail(LTC, 540)BCH_out <- tail(BCH, 540)SPUSBT_out <- tail(SPUSBT, 540)XRP_out <- tail(XRP, 540) # Addestra i modelli di regressione quantile sul campione in-the-samplerq_1 <- rq(ETH_in ~ BTC_in + LTC_in + XRP_in + SP500_in + BCH_in, tau = 0.05)                        rq_2 <- rq(SP500_in ~ Oil_in + SPUSBT_in, tau = 0.05)                                                                 rq_3 <- rq(Gold_in ~ ETH_in + SPUSBT_in, tau = 0.05)# Previsioni sul campione out-of-the-sample                                                                   predictions_ETH <- predict.rq(rq_1, newdata = data.frame(BTC = BTC_out, LTC = LTC_out, XRP = XRP_out, SP500 = SP500_out, BCH = BCH_out))                                                          predictions_SP500 <- predict.rq(rq_2, newdata = data.frame(Oil = Oil_out, SPUSBT = SPUSBT_out))                                                                                                                     predictions_Gold <- predict.rq(rq_3, newdata = data.frame(ETH = ETH_out, SPUSBT = SPUSBT_out))

I'm facing problems with the dimensionality between in-sample and out-of-sample variables. I've tried creating a "rolling window," but I'm unsure of how to proceed. How can I create a mobile window? Because I think that's the answer.

↧

How to identify whether rows within a group have values in the same columns?

May 17, 2024, 1:30 pm

≫ Next: ggplotly produces a blank plot

≪ Previous: Rolling window quantile regression

I have a dataframe where each row represents the ratings a project group received from one of three raters (advisor, mentor, other). The columns titled ELA1, ELA2, MATH1, MATH2, ESS1, ESS2 are the items groups can be rated on. However, each group only needs to be rated on three of these items, which means each group can be rated on a different set items. But all three raters should rate the same items within a group. I want to check whether all three raters rated the same set of items within a group. It would be ideal to create a dummy variable that indicates whether the group received ratings on the same set of items from each rater.

Below is an example of my data frame:

df <- data.frame(group=c("A","A","A","B","B","B","C","C","C"),                 rater=c("Advisor", "Mentor", "Other", "Advisor", "Mentor", "Other", "Advisor", "Mentor", "Other"),                 ELA1=c(1, 2, 2, NA, NA, 1, NA, NA, NA),                 ELA2=c(NA, NA, NA, 2, NA, 1, NA, NA, NA),                 MATH1=c(3, 3, 2, NA, 2, NA, 3, 3, 2),                 MATH2=c(2, 3, 2, NA, NA, 1, 3, 3, 1),                 ESS1=c(NA, NA, NA, 2, 2, NA, 3, 3, 1),                 ESS2=c(NA, NA, NA, 2, 2, NA, NA, NA, NA))

In the example data frame above, there are two groups (group A and C) that are scored on the same items by all three judges. But the ratings provided by each rater to group B were not for the same set of items. I need help figuring out how to write code that will identify instances when each rater did not rate the same set of skills within a group and ideally create an indicator column that indicates whether a group's raters rated the same set of items or not.

I have no code to add below because I don't have a clue about how to approach this at all. Is there anyway to get r to do this?

↧

ggplotly produces a blank plot

May 17, 2024, 1:39 pm

≫ Next: Squeezing the gap r basic plot with different levels of y axis

≪ Previous: How to identify whether rows within a group have values in the same columns?

I have having significant difficulty with plotly in R. The code below produces a blank plot, even though calling 'p' without the ggplotly produces a plot. I have uninstalled and reinstalled both R and RStudio but the problem persists.

library(plotly)library(ggplot2)data(mtcars)p <- ggplot(mtcars, aes(x = wt, y = mpg)) +    geom_point()ggplotly(p)

I have tried several possible fixes. R, RStudio, and plotly are all the latest versions: R 4.4.0, RStudio 2024.04.1, plotly 4.10.4.

The code below will produce a plot on a separate page in the web browser:

plot <- plot_ly(data = mtcars, x = ~wt, y = ~mpg, type = 'scatter', mode = 'markers')htmlwidgets::saveWidget(plot, "test_plot.html")browseURL("test_plot.html")

↧

Latest Images