This question already has answers here:
Closed last year.
Suppose we have simple Dataframe
df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits'])
df.columns = ['fruits']
how to calculate number of words in keywords, similar to:
1 word: 2
2 words: 2
3 words: 1
4 words: 1
IIUC then you can do the following:
In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count
Out[89]:
1 words: 2
2 words: 2
3 words: 1
4 words: 1
Name: fruits, dtype: int64
Here we use the vectorised str.split
to split on spaces, and then apply
len
to get the count of the number of elements, we can then call value_counts
to aggregate the frequency count.
We then rename the index and sort it to get the desired output
UPDATE
This can also be done using str.len
rather than apply
which should scale better:
In [41]:
count = df['fruits'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count
Out[41]:
0 words: 2
1 words: 1
2 words: 3
3 words: 4
4 words: 2
5 words: 1
Name: fruits, dtype: int64
Timings
In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()
1000 loops, best of 3: 799 µs per loop
1000 loops, best of 3: 347 µs per loop
For a 6K df:
In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()
100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop
You could use str.count
with space ' '
as delimiter.
In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False)
In [1717]: count.index = count.index.astype('str') + ' words:'
In [1718]: count
Out[1718]:
1 words: 2
2 words: 2
3 words: 1
4 words: 1
Name: fruits, dtype: int64
Timings
str.count
is marginally faster
Small
In [1724]: df.shape
Out[1724]: (6, 1)
In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1000 loops, best of 3: 649 µs per loop
In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts()
1000 loops, best of 3: 840 µs per loop
Medium
In [1728]: df.shape
Out[1728]: (6000, 1)
In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
100 loops, best of 3: 6.58 ms per loop
In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts()
100 loops, best of 3: 6.99 ms per loop
Large
In [1732]: df.shape
Out[1732]: (60000, 1)
In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1 loop, best of 3: 57.6 ms per loop
In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts()
1 loop, best of 3: 73.8 ms per loop