Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201867

Pandas equivalent to dplyr dot

$
0
0

I am sorry for pretty heavy explanation, but hope you will get the idea.

I'm R user and I find tidyverse capabilities in data wrangling really powerful. But recently I have started learning Python, and in particular pandas to extend my opportunities in data analysis. Instinctively I'm trying to do things in pandas as I used to do them while I was using dplyr.

So my question is whether any equivalent to dplyr dot while you are using method chaining in pandas.

Here example illustrates computing of minimum value from all values that are greater than current value in test_df['data'] per each group and than the same computing but across new column.

R's Example:

require(dplyr)
require(purrr)
test_df = data.frame(group = rep(c(1,2,3), each = 3),
                     data= c(1:9))
test_df %>%
group_by(group) %>%
mutate(., min_of_max = map_dbl(data, ~data[data > .x] %>% min())) %>%
mutate(., min_of_max_2 = map_dbl(min_of_max, ~min_of_max[min_of_max > .x] %>% min()))

Output:

# A tibble: 9 x 4
# Groups:   group [3]
group  data min_of_max min_of_max_2
<dbl> <int>      <dbl>        <dbl>
1     1     1          2            3
2     1     2          3          Inf
3     1     3        Inf          Inf
4     2     4          5            6
5     2     5          6          Inf
6     2     6        Inf          Inf
7     3     7          8            9
8     3     8          9          Inf
9     3     9        Inf          Inf

I know that dplyr doesn't even require dot, but I put it for better understanding the specific of my question

Doing the same in Pandas

Invalid Example:

import pandas as pd
import numpy as np
test_df = (
    pd.DataFrame({'A': np.array([1,2,3]*3), 'B': np.array(range(1,10))})
    .sort_values(by = ['A', 'B'])
)
(test_df.assign(min_of_max = test_df.apply(lambda x: (test_df.B[(test_df.B > x.B) &
                                                           (test_df.A[test_df.A == x.A])]).min(), axis = 1))
    .assign(min_of_max2 = 'assume_dot_here'.apply(lambda x: (test_df.min_of_max[(test_df.min_of_max > x.min_of_max) &
                                                           (test_df.A[test_df.A == x.A])]).min(), axis = 1)))

In this example putting dot in a second .assign would be great ability but it doesn't work in pandas.

Valid Example, which ruins chain:

test_df = test_df.assign(min_of_max = test_df.apply(lambda x: 
(test_df.B[(test_df.B > x.B) & (test_df.A[test_df.A == x.A])]).min(), axis = 1))

test_df = test_df.assign(min_of_max2 = test_df.apply(lambda x : 
(test_df.min_of_max[(test_df.min_of_max > x.min_of_max) & (test_df.A[test_df.A 
== x.A])]).min(), axis = 1))

Output:

   A  B  min_of_max  min_of_max2
0  1  1         4.0          7.0
3  1  4         7.0          NaN
6  1  7         NaN          NaN
1  2  2         5.0          8.0
4  2  5         8.0          NaN
7  2  8         NaN          NaN
2  3  3         6.0          9.0
5  3  6         9.0          NaN
8  3  9         NaN          NaN

So is there any convenient way to call object from previous part of chain in second .assign? Since using test_df.apply() in second .assign will take initial test_df without computed test_df['min_of_max']

Sorry for somewhat unreadable code in Python, I'am still figuring out how to write more clear.


Viewing all articles
Browse latest Browse all 201867

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>