I am sorry for pretty heavy explanation, but hope you will get the idea.
I'm R user and I find tidyverse capabilities in data wrangling really powerful. But recently I have started learning Python, and in particular pandas to extend my opportunities in data analysis. Instinctively I'm trying to do things in pandas as I used to do them while I was using dplyr.
So my question is whether any equivalent to dplyr dot while you are using method chaining in pandas.
Here example illustrates computing of minimum value from all values that are greater than current value in test_df['data'] per each group and than the same computing but across new column.
R's Example:
require(dplyr)
require(purrr)
test_df = data.frame(group = rep(c(1,2,3), each = 3),
data= c(1:9))
test_df %>%
group_by(group) %>%
mutate(., min_of_max = map_dbl(data, ~data[data > .x] %>% min())) %>%
mutate(., min_of_max_2 = map_dbl(min_of_max, ~min_of_max[min_of_max > .x] %>% min()))
Output:
# A tibble: 9 x 4
# Groups: group [3]
group data min_of_max min_of_max_2
<dbl> <int> <dbl> <dbl>
1 1 1 2 3
2 1 2 3 Inf
3 1 3 Inf Inf
4 2 4 5 6
5 2 5 6 Inf
6 2 6 Inf Inf
7 3 7 8 9
8 3 8 9 Inf
9 3 9 Inf Inf
I know that dplyr doesn't even require dot, but I put it for better understanding the specific of my question
Doing the same in Pandas
Invalid Example:
import pandas as pd
import numpy as np
test_df = (
pd.DataFrame({'A': np.array([1,2,3]*3), 'B': np.array(range(1,10))})
.sort_values(by = ['A', 'B'])
)
(test_df.assign(min_of_max = test_df.apply(lambda x: (test_df.B[(test_df.B > x.B) &
(test_df.A[test_df.A == x.A])]).min(), axis = 1))
.assign(min_of_max2 = 'assume_dot_here'.apply(lambda x: (test_df.min_of_max[(test_df.min_of_max > x.min_of_max) &
(test_df.A[test_df.A == x.A])]).min(), axis = 1)))
In this example putting dot in a second .assign
would be great ability but it doesn't work in pandas.
Valid Example, which ruins chain:
test_df = test_df.assign(min_of_max = test_df.apply(lambda x:
(test_df.B[(test_df.B > x.B) & (test_df.A[test_df.A == x.A])]).min(), axis = 1))
test_df = test_df.assign(min_of_max2 = test_df.apply(lambda x :
(test_df.min_of_max[(test_df.min_of_max > x.min_of_max) & (test_df.A[test_df.A
== x.A])]).min(), axis = 1))
Output:
A B min_of_max min_of_max2
0 1 1 4.0 7.0
3 1 4 7.0 NaN
6 1 7 NaN NaN
1 2 2 5.0 8.0
4 2 5 8.0 NaN
7 2 8 NaN NaN
2 3 3 6.0 9.0
5 3 6 9.0 NaN
8 3 9 NaN NaN
So is there any convenient way to call object from previous part of chain in second .assign
?
Since using test_df.apply()
in second .assign will take initial test_df without computed test_df['min_of_max']
Sorry for somewhat unreadable code in Python, I'am still figuring out how to write more clear.