I am trying to make a frequency table based on a dataframe with pandas
and Python. In fact it's exactly the same as a previous question of mine which used R.
Let's say that I have a dataframe in pandas that looks like this (in fact the dataframe is much larger, but for illustrative purposes I limited the rows):
node | precedingWord
-------------------------
A-bom de
A-bom die
A-bom de
A-bom een
A-bom n
A-bom de
acroniem het
acroniem t
acroniem het
acroniem n
acroniem een
act de
act het
act die
act dat
act t
act n
I'd like to use these values to make a count of the precedingWords per node, but with subcategories. For instance: one column to add values to that is titled neuter
, another non-neuter
and a last one rest
. neuter
would contain all values for which precedingWord is one of these values: t
,het
, dat
. non-neuter
would contain de
and die,
and rest
would contain everything that doesn't belong into neuter
or non-neuter
. (It would be nice if this could be dynamic, in other words that rest
uses some sort of reversed variable that is used for neuter and non-neuter. Or which simply subtracts the values in neuter and non-neuter from the length of rows with that node.)
Example output (in a new dataframe, let's say freqDf
, would look like this:
node | neuter | nonNeuter | rest
-----------------------------------------
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1
I found an answer to a similar question but the use case isn't exactly the same. It seems to me that in that question all variables are independent. However, in my case it is obvious that I have multiple rows with the same node, which should all be brought down to a single one frequency - as show in the expected output above.
I thought something like this (untested):
def specificFreq(d):
for uniqueWord in d['node']
return pd.Series({'node': uniqueWord ,
'neuter': sum(d['node' == uniqueWord] & d['precedingWord'] == 't|het|dat'),
'nonNeuter': sum(d['node' == uniqueWord] & d['precedingWord'] == 'de|die'),
'rest': len(uniqueWord) - neuter - nonNeuter}) # Length of rows with the specific word, distracted by neuter and nonneuter values above
df.groupby('node').apply(specificFreq)
But I highly doubt this the correct way of doing something like this.