Plotting event density in Python with ggplot and p

2019-07-11 06:01发布

问题:

I am trying to visualize data of this form:

  timestamp               senderId
0     735217  106758968942084595234
1     735217  114647222927547413607
2     735217  106758968942084595234
3     735217  106758968942084595234
4     735217  114647222927547413607
5     etc...

geom_density works if I don't separate the senderIds:

df = pd.read_pickle('data.pkl')
df.columns = ['timestamp', 'senderId']
plot = ggplot(aes(x='timestamp'), data=df) + geom_density()
print plot

The result looks as expected:

However if I want to show the senderIds separately, as is done in the doc, it fails:

> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
ValueError: `dataset` input should have multiple elements.

Trying out with a larger dataset (~40K events):

> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
numpy.linalg.linalg.LinAlgError: singular matrix

Any idea? There are some answers on SO for those errors but none seems relevant.

This is the kind of graph I want (from ggplot's doc):

回答1:

With the smaller dataset:

> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
ValueError: `dataset` input should have multiple elements.

This was because some senderIds had only one row.

With the bigger dataset:

> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
numpy.linalg.linalg.LinAlgError: singular matrix

This was because for some senderIds I had multiple rows at the exact same timestamp. This is not supported by ggplot. I could solve it by using finer timestamps.