i'm new to big data and hadoop thing. I try to find median with mapreduce. From what i know, mapper pass data to 1 reducer then 1 reducer sort and find the middle value using median()
function.
R running in memmory, so what if data too big to store in 1 reducer, which is running on 1 computer?
here is the example of my code to find median with RHadoop.
map <- function(k,v) {
key <- "median"
keyval(key, v)
}
reduce <- function(k,v) {
keyval(k, median(v))
}
medianMR <- mapreduce (
input= random, output="/tmp/ex3",
map = map, reduce = reduce
)