Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201977

Randomly split dataset but keep condition ratio 50/50 in each new dataset

$
0
0

My dataset looks like this (subjectID actually is the 2d column and condition the third). Condition is 0 and 1 (this is not the whole dataframe).

   SubjectID Conditie      V3     V4      V5      V6      V7      V8     V9     V10   V11    V12     V13     V14     V15    V16     V17
1          1        0  -20.44  -1.40  -10.93   86.06  -23.92  -0.359    6.1  75.216  44.4   83.0    4.18   24.01  -13.29  80.84   -5.85
2          2        0   49.20 -35.96    7.96   -6.47   46.89 -22.181   27.8 -16.873 -61.5    3.8   32.54  -15.80   17.19   2.32  -10.71
3          3        0  -38.00  19.68   31.21 -114.47  -43.31  66.569  -45.3 -86.840 102.4  253.7   -3.12    2.81  -21.25 -69.42  -40.64
4          4        0   51.88  52.34  -83.92  157.60   98.02  99.576   20.7 157.324 135.0  104.9  113.69   18.17  -18.41 137.90  -52.75
5          5        0  -25.46  30.87  -30.35   61.04   -1.29 175.212  -80.7 101.502  46.4  183.0  -35.98    9.91  -35.62  79.46  -66.25
6          6        0    2.85  18.49   41.63   99.97   22.35  64.988   35.4 122.737 113.8   89.8   -5.36   70.16  -32.12 140.91    1.45
7          7        0   21.13 -29.08  -34.97  -74.16   13.41 -63.383   15.3 -58.425 -59.0   58.9  -71.47 -105.64 -118.21 -64.33  -48.09
8          8        0  113.27 543.65  615.94   14.38  145.73 854.745 -140.9 851.710 725.8 -722.2 -221.21  652.29 -378.17 824.00  -54.44
9          9        0 -150.88 101.24 -199.41   -7.63 -130.06 117.425 -162.4 179.808  55.9 -200.9    1.12 -114.71 -231.54  17.47 -253.13
10        10        0 -179.69 272.76  174.75  -15.12 -162.90 207.947  186.1  94.898 268.6  634.0   29.91  -62.72  192.38 252.75  -92.70
11        11        0  417.49 101.05   11.69  -70.23  147.65 -19.403  536.7 285.809 283.5 -284.3  116.10  -68.84  214.01 181.62   56.99
12        12        0  -12.03   2.69   22.07  -39.80  -14.13   0.240   28.0 -24.242  20.2  123.7   14.48  -12.79   17.38  58.10  -38.29
13        13        0  -48.99  51.37  -48.54   82.99  -77.09  56.406  -39.6 113.484 -34.4   51.0  -39.91   -6.11   -7.92  32.38   25.54
14        14        0  -27.10  71.17  -32.10  102.32    6.53 216.710   75.1 138.506 159.0  -52.0   40.55   47.02  -28.68 164.09   43.74
15        15        0   85.85 124.12   85.09  -49.86   88.62 151.829   95.2 -54.738  34.9  -36.7  157.22 -147.66  102.82 -40.71  134.96
16        16        0  -56.60  -8.96 -111.23   16.75  -26.90 -46.913 -102.7   0.403  26.8   10.5  -26.29   44.60 -129.68  13.74  -83.49
17        17        0  -26.80 107.27  128.63  130.91  -31.72 105.698  173.4  82.380  55.7   19.5  299.54   66.69   -5.14 216.11   15.88
18        18        0  -38.43  25.52   26.88   -5.98  -21.63  11.358   42.1 -30.672 248.2  234.6  -62.65  -17.48  -76.02   7.84   -4.32
19        19        0   16.06   2.43   72.27   49.17    4.53  28.257   33.6  76.263  59.5  -46.1   31.77   54.86   60.13  51.81   36.70
20        20        0   18.79  91.35  231.11   64.18  -37.53   6.920  165.2  56.826  76.5   58.5  204.40  181.60  181.48 -85.98   55.63

I am trying to make a Train and Validation dataset for a glmNet model & prediction. Random sampling regarding SubjectID is fine, but I want both datasets to have a 50/50 ratio regarding the conditions (0&1). The code I used (see below) doesn't satisfy my needs.

set.seed(123)
train <- sample(nrow(classData), 0.75*nrow(classData), replace = FALSE)
TrainSet <- classData[train,]
ValidSet <- classData[-train,]

I would like to hear your recommendations on this! Thanks in advance for your time.


Viewing all articles
Browse latest Browse all 201977

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>