Base scenario
For a recommendation service I am training a matrix factorization model (LightFM) on a set of user-item interactions. For the matrix factorization model to yield the best results, I need to map my user and item IDs to a continuous range of integer IDs starting at 0.
I'm using a pandas DataFrame in the process, and I have found a MultiIndex to be extremely convenient to create this mapping, like so:
ratings = [{'user_id': 1, 'item_id': 1, 'rating': 1.0},
{'user_id': 1, 'item_id': 3, 'rating': 1.0},
{'user_id': 3, 'item_id': 1, 'rating': 1.0},
{'user_id': 3, 'item_id': 3, 'rating': 1.0}]
df = pd.DataFrame(ratings, columns=['user_id', 'item_id', 'rating'])
df = df.set_index(['user_id', 'item_id'])
df
Out:
rating
user_id item_id
1 1 1.0
1 3 1.0
3 1 1.0
3 1 1.0
And then allows me to get the continuous maps like so
df.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 1, 1], dtype='int8')
df.index.labels[1] # For items
Out:
FrozenNDArray([0, 1, 0, 1], dtype='int8')
Afterwards, I can map them back using df.index.levels[0].get_loc
method. Great!
Extension
But, now I'm trying to streamline my model training process, ideally by training it incrementally on new data, preserving the old ID mappings. Something like:
new_ratings = [{'user_id': 2, 'item_id': 1, 'rating': 1.0},
{'user_id': 2, 'item_id': 2, 'rating': 1.0}]
df2 = pd.DataFrame(new_ratings, columns=['user_id', 'item_id', 'rating'])
df2 = df2.set_index(['user_id', 'item_id'])
df2
Out:
rating
user_id item_id
2 1 1.0
2 2 1.0
Then, simply appending the new ratings to the old DataFrame
df3 = df.append(df2)
df3
Out:
rating
user_id item_id
1 1 1.0
1 3 1.0
3 1 1.0
3 3 1.0
2 1 1.0
2 2 1.0
Looks good, but
df3.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 2, 2, 1, 1], dtype='int8')
df3.index.labels[1] # For items
Out:
FrozenNDArray([0, 2, 0, 2, 0, 1], dtype='int8')
I added user_id=2 and item_id=2 in the later DataFrame on purpose, to illustrate where it goes wrong for me. In df3
, labels 3 (for both user and item), have moved from integer position 1 to 2. So the mapping is no longer the same. What I'm looking for is [0, 0, 1, 1, 2, 2]
and [0, 1, 0, 1, 0, 2]
for user and item mappings respectively.
This is probably because of ordering in pandas Index objects, and I'm unsure if what I want is at all possible using a MultiIndex strategy. Looking for help on how most to effectively tackle this problem :)
Some notes:
- I find using DataFrames convenient for several reasons, but I use the MultiIndex purely for the ID mappings. Alternatives without MultiIndex are completely acceptable.
- I cannot guarantee that new user_id and item_id entries in new ratings are larger than any values in the old dataset, hence my example of adding id 2 when [1, 3] were present.
- For my incremental training approach, I will need to store my ID maps somewhere. If I only load new ratings partially, I will have to store the old DataFrame and ID maps somewhere. Would be great if it could all be in one place, like it would be with an index, but columns work too.
- EDIT: An additional requirement is to allow for row re-ordering of the original DataFrame, as might happen when duplicate ratings exist, and I want to keep the most recent one.
Solution (credits to @jpp for original)
I've made a modification to @jpp's answer to satisfy the additional requirement I've added later (tagged with EDIT). This also truly satisfies the original question as posed in the title, since it preserves the old index integer positions, regardless of rows being reordered for whatever reason. I've also wrapped things into functions:
from itertools import chain
from toolz import unique
def expand_index(source, target, index_cols=['user_id', 'item_id']):
# Elevate index to series, keeping source with index
temp = source.reset_index()
target = target.reset_index()
# Convert columns to categorical, using the source index and target columns
for col in index_cols:
i = source.index.names.index(col)
col_cats = list(unique(chain(source.index.levels[i], target[col])))
temp[col] = pd.Categorical(temp[col], categories=col_cats)
target[col] = pd.Categorical(target[col], categories=col_cats)
# Convert series back to index
source = temp.set_index(index_cols)
target = target.set_index(index_cols)
return source, target
def concat_expand_index(old, new):
old, new = expand_index(old, new)
return pd.concat([old, new])
df3 = concat_expand_index(df, df2)
The result:
df3.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 1, 1, 2, 2], dtype='int8')
df3.index.labels[1] # For items
Out:
FrozenNDArray([0, 1, 0, 1, 0, 2], dtype='int8')
I think the use of MultiIndex overcomplicates this objective:
This solution falls in to the below category:
Explaination
This is how to maintain a mapping for the
user_id
values. Same holds for theitem_id
values as well.These are the initial
user_id
values (unique):user_map
maintains a mapping foruser_id
values, as per your requirement:These are the new
user_id
values you got fromdf2
- ones that you didn't see indf
:Now we update
user_map
for the total user base with the new users:Then, just map the values from
user_map
todf['user_id']
:Forcing alignment of index labels after concatenation does not appear straightforward and, if there is a solution, it is poorly documented.
One option which may appeal to you is Categorical Data. With some careful manipulation, this can achieve the same purpose: each unique index value within a level has a one-to-one mapping to an integer, and this mapping persists even after concatenation with other dataframes.
I use
toolz.unique
to return an ordered unique list, but if you don't have access to this library, you can use the identicalunique_everseen
recipe from theitertool
docs.Now let's have a look at the category codes underlying the 0th index level:
Then perform our concatenation:
Finally, check that categorical codes are aligned:
For each index level, note we must take the union of all index values across dataframes to form
col_cats
, otherwise the concatenation will fail.