Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 206473

Allocating tasks to parallel workers so that expected cost is roughly equal

$
0
0

I have an assignment problem where I'm trying to allocate a number of tasks with a known expected cost (runtime in seconds) to X parallel workers, subject to the constraint that each worker receives the same number of tasks (save for remainders), so that the total expected runtime per worker is roughly equal.

I'm using a data frame that defines the tasks to be executed, and for each task I can calculate a pretty accurate expected cost (runtime in seconds). E.g. something like this:

library("tibble")

set.seed(1232)
tasks <- tibble(task = 1:20, cost = runif(20, min = 1, max = 5)^2)
head(tasks)
#> # A tibble: 6 x 2
#>    task  cost
#>   <int> <dbl>
#> 1     1 22.5 
#> 2     2 20.0 
#> 3     3 21.3 
#> 4     4  8.13
#> 5     5 18.3 
#> 6     6 19.6

Created on 2019-11-21 by the reprex package (v0.3.0)

This is then used with foreach::foreach(...) %dopar% ... to execute the tasks in parallel. foreach() splits the tasks into roughly equal sized groups with size nrow(tasks)/X where X is the number of parallel workers (cores).

I'm currently shuffling the task list so that the cost is roughly equal for each worker, but there can still be substantial deviations, i.e. some workers get finished much earlier than others and thus it would have been better if they had had some more costly tasks. E.g.:

# shuffle tasks (in the original application cost is not random initially)
tasks <- tasks[sample(1:nrow(tasks)), ]

# number of workers
X <- 4
tasks$worker <- rep(1:X, each = nrow(tasks)/X)

# expected total cost (runtime in s) per worker
sapply(split(tasks$cost, tasks$worker), sum)
#>        1        2        3        4 
#> 77.25278 35.25026 66.09959 64.05435

Created on 2019-11-21 by the reprex package (v0.3.0)

The second worker finishes in half the time as the other workers, so its capacity is wasted and the thing overall takes longer to finish.

What I'd like to do instead is have a way of re-ordering the task data frame so that when foreach splits it into X groups the total expected cost per group is more even.

I imagine this is a super-well known kind of problem and I just don't know the right verbiage to google (nor how to do it in R). Thanks for any help.

(EDIT) Mostly better alternative

For now, a relatively simple alternative that seems to do better than random shuffling. This orders the tasks by cost, assigns the first X tasks to workers 1 to X, then assigns the next chunk of X tasks in reverse order to workers X to 1, etc (this is "alt1" below).

library("tibble")
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library("ggplot2")
library("tidyr")

race <- function() {
  N_TASKS = 100
  X = 4
  tasks <- tibble(task = 1:N_TASKS, cost = runif(N_TASKS, min = 1, max = 10)^2)

  # random shuffle
  tasks$worker <- rep(1:X, each = nrow(tasks)/X)
  rando <- max(sapply(split(tasks$cost, tasks$worker), sum))

  # alternative 1
  tasks <- tasks[order(tasks$cost), ]
  tasks$worker <- rep(c(1:X, X:1), length.out = nrow(tasks))
  alt1 <- max(sapply(split(tasks$cost, tasks$worker), sum))

  # modified version of ivan100sic's answer
  # sort by descending cost, after initial allocation, allocate costly tasks
  # first to workers with lowest total cost so far
  group <- factor(rep(1:(ceiling(nrow(tasks)/4)), each = X))
  tasks <- tasks[order(tasks$cost, decreasing = TRUE), ]
  tasks$worker <- c(1:X, rep(NA, length.out = nrow(tasks) - X))
  task_sets <- split(tasks, group)
  task_sets[[1]]$worker <- 1:X
  for (i in 2:length(task_sets)) {
    # get current total cost by worker
    total <- task_sets %>% 
      bind_rows() %>%
      filter(!is.na(worker)) %>%
      group_by(worker) %>%
      summarize(cost = sum(cost)) %>%
      arrange(cost)
    task_sets[[i]]$worker <- total[["worker"]]
  }
  tasks <- bind_rows(task_sets)
  alt2  <- max(sapply(split(tasks$cost, tasks$worker), sum))

  c(rando = rando, alt1 = alt1, alt2 = alt2)
}

set.seed(24332)
sims <- replicate(1e3, race())
sims %>%
  t() %>%
  as_tibble() %>%
  pivot_longer(rando:alt2, names_to = "Method") %>%
  ggplot(aes(x = value, color = Method)) + 
  geom_density() +
  scale_x_continuous(limits = c(0, max(sims))) +
  labs(x = "Total runtime (s)")

Created on 2019-11-28 by the reprex package (v0.3.0)

  • "rando": randomly shuffle the task list
  • "alt1": sort tasks by cost and alternate assigning to worker 1 to X, X to 1, etc.
  • "alt2": based on ivan100sic's answer below, after the first allocation to workers 1 to X, allocate based on total cost per worker so far

Viewing all articles
Browse latest Browse all 206473

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>