I am dealing with a big timeseries with one column containing four different sensors and one column containing the mesured values. I need to assign an id to measurements that belong to the same time. The problem is, that the timing of measurements differs slightly for each device, thus i cannot simply group them by timestamp. In a data frame ordered by time, measurements that should be grouped can be identified by sequences of unique device Ids. The problem here is, that at one time 4 devices record a value and another time 3 devices record a value. My data looks like this.
timestamp device measurement
1 2019-08-27 07:29:20.671313 sdr_03 49.868820
2 2019-08-27 07:29:20.932043 sdr_02 54.160831
3 2019-08-27 07:29:21.839312 sdr_03 48.974476
4 2019-08-27 07:29:21.850454 sdr_02 50.808674
5 2019-08-27 08:57:01.990833 sdr_03 50.533058
6 2019-08-27 08:57:02.022798 sdr_04 51.143322
7 2019-08-27 09:16:56.454308 sdr_02 57.447151
8 2019-08-27 09:16:56.482433 sdr_04 50.012745
9 2019-08-27 09:16:56.761776 sdr_01 71.500305
10 2019-08-27 09:16:57.305510 sdr_02 56.851177
11 2019-08-27 09:16:57.333628 sdr_04 60.390141
12 2019-08-27 09:16:57.612972 sdr_01 73.470345
which you can reproduce with this:
my_data<-data.frame(timestamp = c("2019-08-27 07:29:20.671313","2019-08-27 07:29:20.932043","2019-08-27 07:29:21.839312",
"2019-08-27 07:29:21.850454", "2019-08-27 08:57:01.990833","2019-08-27 08:57:02.022798",
"2019-08-27 09:16:56.454308", "2019-08-27 09:16:56.482433", "2019-08-27 09:16:56.761776",
"2019-08-27 09:16:57.305510" ,"2019-08-27 09:16:57.333628", "2019-08-27 09:16:57.612972"),
device=c("sdr_03", "sdr_02", "sdr_03", "sdr_02", "sdr_03" ,"sdr_04", "sdr_02", "sdr_04" ,"sdr_01", "sdr_02" ,"sdr_04",
"sdr_01"),
measurement=c(49.868820, 54.160831, 48.974476, 50.808674, 50.533058, 51.143322,57.447151,50.012745, 71.500305,56.851177,
60.390141, 73.470345)
)
I need to assign the same value to consecutive rows as long as none of the elements in the previous rows of column device appears again
timestamp device measurement match_id
1 2019-08-27 07:29:20.671313 sdr_03 49.868820 1
2 2019-08-27 07:29:20.932043 sdr_02 54.160831 1
3 2019-08-27 07:29:21.839312 sdr_03 48.974476 2
4 2019-08-27 07:29:21.850454 sdr_02 50.808674 2
5 2019-08-27 08:57:01.990833 sdr_03 50.533058 3
6 2019-08-27 08:57:02.022798 sdr_04 51.143322 3
7 2019-08-27 09:16:56.454308 sdr_02 57.447151 3
8 2019-08-27 09:16:56.482433 sdr_04 50.012745 4
9 2019-08-27 09:16:56.761776 sdr_01 71.500305 4
10 2019-08-27 09:16:57.305510 sdr_02 56.851177 4
11 2019-08-27 09:16:57.333628 sdr_04 60.390141 5
12 2019-08-27 09:16:57.612972 sdr_01 73.470345 5
which you can get from:
my_data<-data.frame(timestamp = c("2019-08-27 07:29:20.671313","2019-08-27 07:29:20.932043","2019-08-27 07:29:21.839312",
"2019-08-27 07:29:21.850454", "2019-08-27 08:57:01.990833","2019-08-27 08:57:02.022798",
"2019-08-27 09:16:56.454308", "2019-08-27 09:16:56.482433", "2019-08-27 09:16:56.761776",
"2019-08-27 09:16:57.305510" ,"2019-08-27 09:16:57.333628", "2019-08-27 09:16:57.612972"),
device=c("sdr_03", "sdr_02", "sdr_03", "sdr_02", "sdr_03" ,"sdr_04", "sdr_02", "sdr_04" ,"sdr_01", "sdr_02" ,"sdr_04",
"sdr_01"),
measurement=c(49.868820, 54.160831, 48.974476, 50.808674, 50.533058, 51.143322,57.447151,50.012745, 71.500305,56.851177,
60.390141, 73.470345),match_id=c(1,1,2,2,3,3,3,4,4,4,5,5) )
I have been searching for answers for three days now. Any help is very much appreciated.
Allan Camerons dplyr solution results in match ids that reappear later in the data frame- see lines 1,2,6,9. There may be less than 4 devices recording at one time, thus solutions that always expect the same number of recording devices for each measurement won't work.
# A tibble: 12 x 4
# Groups: device [4]
timestamp device measurement new_id
<dttm> <fct> <dbl> <int>
1 2019-08-27 07:29:20.671313 sdr_03 49.9 1
2 2019-08-27 07:29:20.932043 sdr_02 54.2 1
3 2019-08-27 07:29:21.839312 sdr_03 49.0 2
4 2019-08-27 07:29:21.850454 sdr_02 50.8 2
5 2019-08-27 08:57:01.990833 sdr_03 50.5 3
6 2019-08-27 08:57:02.022798 sdr_04 51.1 1
7 2019-08-27 09:16:56.454308 sdr_02 57.4 3
8 2019-08-27 09:16:56.482433 sdr_04 50.0 2
9 2019-08-27 09:16:56.761775 sdr_01 71.5 1
10 2019-08-27 09:16:57.305510 sdr_02 56.9 4
11 2019-08-27 09:16:57.333627 sdr_04 60.4 3
12 2019-08-27 09:16:57.612972 sdr_01 73.5 2
While Sotos solution results in more consecutive match ids than unique devices exist. E.g. lines 5-9
# A tibble: 12 x 4
timestamp device measurement new_id
<chr> <fct> <dbl> <int>
1 2019-08-27 07:29:20 sdr_03 49.9 1
2 2019-08-27 07:29:20 sdr_02 54.2 1
3 2019-08-27 07:29:21 sdr_03 49.0 2
4 2019-08-27 07:29:21 sdr_02 50.8 2
5 2019-08-27 08:57:01 sdr_03 50.5 3
6 2019-08-27 08:57:02 sdr_04 51.1 3
7 2019-08-27 09:16:56 sdr_02 57.4 3
8 2019-08-27 09:16:56 sdr_04 50.0 3
9 2019-08-27 09:16:56 sdr_01 71.5 3
10 2019-08-27 09:16:57 sdr_02 56.9 4
11 2019-08-27 09:16:57 sdr_04 60.4 4
12 2019-08-27 09:16:57 sdr_01 73.5 4
Both solutions work great (thanks!) if timediffs between measurements are >0.7 sec or 4 devices recorded at the same time. Sadly, most of the time this is not the case. I think, a solution that ignores timestamps and rather checks for duplicates in consecutive rows might be better. I found many solutions for repeated values using rle() or data.table, but no solution to identify sequences of unique values. Please help me out here!