How can I use pandas factorize
with values that exist across two columns?
Specifically, I am trying to convert values that exist in two columns to numeric values, and put the corresponding factorized values into new columns, such that the factorization is consistent with the two input columns 'A' and 'B'.
Existing DataFrame:
A B
0 a b
1 c a
2 d a
3 e c
4 c b
5 b e
6 e f
Desired Output:
A B A_ID B_ID
0 a b 0 4
1 c a 1 0
2 d a 2 0
3 e c 3 1
4 c b 1 4
5 b e 4 3
6 e f 3 5
I am able to use factorize
successfully for one column using:
df['A_ID'] = pd.factorize(df.A)[0]
How could I accomplish this with consistent mappings for values across two columns? Do I need to resort to using a custom lambda
function instead, or is there a way to accomplish this with factorize
?
pd.factorize
, apply
+ pd.Categorical
:
_, b = pd.factorize(df.values.T.reshape(-1, ))
# or df.values.ravel('F'), as suggested by Zero
r = df.apply(lambda x: pd.Categorical(x, b).codes).add_suffix('_ID')
A_ID B_ID
0 0 4
1 1 0
2 2 0
3 3 1
4 1 4
5 4 3
6 3 5
pd.concat([df, r], 1)
A B A_ID B_ID
0 a b 0 4
1 c a 1 0
2 d a 2 0
3 e c 3 1
4 c b 1 4
5 b e 4 3
6 e f 3 5
Let's use apply
, add_suffix
with pd.factorize
and assign
:
f = pd.factorize(df.stack().drop_duplicates().sort_index(level=1))
s1 = pd.Series(f[0], index=f[1])
print(df.assign(**df.apply(lambda x: x.map(s1)).add_suffix('_ID')))
Output:
A B A_ID B_ID
0 a b 0 1
1 c a 2 0
2 d a 3 0
3 e c 4 2
4 c b 2 1
5 b e 1 4
Ouput with updated dataset:
A B A_ID B_ID
0 a b 0 4
1 c a 1 0
2 d a 2 0
3 e c 3 1
4 c b 1 4
5 b e 4 3
6 e f 3 5
Here's one way, if you want to reuse factorize values.
In [2637]: facts = np.unique(np.unique(df[['A', 'B']]), return_index=True)
In [2638]: mapping = dict(zip(*facts))
In [2639]: df.join(df[['A', 'B']].apply(lambda x: x.map(mapping)).add_suffix('_ID'))
Out[2639]:
A B A_ID B_ID
0 a b 0 1
1 c a 2 0
2 d a 3 0
3 e c 4 2
4 c b 2 1
5 b e 1 4
6 e f 4 5
Or, using replace
In [2640]: df.join(df[['A', 'B']].replace(mapping).add_suffix('_ID'))
Out[2640]:
A B A_ID B_ID
0 a b 0 1
1 c a 2 0
2 d a 3 0
3 e c 4 2
4 c b 2 1
5 b e 1 4
6 e f 4 5
And, to preserve your order of values use
In [2]: mapping = dict(zip(*pd.factorize(df['A'].append(df['B']).drop_duplicates())[::-1]))
In [2]: mapping
Out[2666]: {'a': 0, 'b': 4, 'c': 1, 'd': 2, 'e': 3, 'f': 5}
In [3]: df.join(df[['A', 'B']].replace(mapping).add_suffix('_ID'))
Out[3]:
A B A_ID B_ID
0 a b 0 4
1 c a 1 0
2 d a 2 0
3 e c 3 1
4 c b 1 4
5 b e 4 3
6 e f 3 5
Details
In [2641]: facts
Out[2641]:
(array(['a', 'b', 'c', 'd', 'e', 'f'], dtype=object),
array([0, 1, 2, 3, 4, 5], dtype=int64))
In [2642]: mapping
Out[2642]: {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5}