add one row in a pandas.DataFrame

2018-12-31 09:01发布

I understand that pandas is designed to load fully populated DataFrame but I need to create an empty DataFrame then add rows, one by one. What is the best way to do this ?

I successfully created an empty DataFrame with :

res = DataFrame(columns=('lib', 'qty1', 'qty2'))

Then I can add a new row and fill a field with :

res = res.set_value(len(res), 'qty1', 10.0)

It works but seems very odd :-/ (it fails for adding string value)

How can I add a new row to my DataFrame (with different columns type) ?

标签: python pandas
18条回答
与风俱净
2楼-- · 2018-12-31 09:41

You can also build up a list of lists and convert it to a dataframe -

import pandas as pd

rows = []
columns = ['i','double','square']

for i in range(6):
    row = [i, i*2, i*i]
    rows.append(row)

df = pd.DataFrame(rows, columns=columns)

giving

    i   double  square
0   0   0   0
1   1   2   1
2   2   4   4
3   3   6   9
4   4   8   16
5   5   10  25
查看更多
梦该遗忘
3楼-- · 2018-12-31 09:46

It's been a long time, but I faced the same problem too. And found here a lot of interesting answers. So I was confused what method to use.

In the case of adding a lot of rows to dataframe I interested in speed performance. So I tried 3 most popular methods and checked their speed.

SPEED PERFORMANCE

  1. Using .append (NPE's answer)
  2. Using .loc (fred's answer and FooBar's answer)
  3. Using dict and create DataFrame in the end (ShikharDua's answer)

Results (in secs):

Adding    1000 rows  5000 rows   10000 rows
.append   1.04       4.84        9.56
.loc      1.16       5.59        11.50
dict      0.23       0.26        0.34

So I use addition through the dictionary for myself.


Code:

import pandas
import numpy
import time

numOfRows = 10000
startTime = time.perf_counter()
df1 = pandas.DataFrame(numpy.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
    df1 = df1.append( dict( (a,numpy.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))

startTime = time.perf_counter()
df2 = pandas.DataFrame(numpy.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
    df2.loc[df2.index.max()+1]  = numpy.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))

startTime = time.perf_counter()
row_list = []
for i in range (0,5):
    row_list.append(dict( (a,numpy.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows):
    dict1 = dict( (a,numpy.random.randint(100)) for a in ['A','B','C','D','E'])
    row_list.append(dict1)

df3 = pandas.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))

P.S. I believe, my realization isn't perfect, and maybe there is some optimization.

查看更多
深知你不懂我心
4楼-- · 2018-12-31 09:46

Figured out a simple and nice way:

>>> df
     A  B  C
one  1  2  3
>>> df.loc["two"] = [4,5,6]
>>> df
     A  B  C
one  1  2  3
two  4  5  6
查看更多
永恒的永恒
5楼-- · 2018-12-31 09:49

Create a new record(data frame) and add to old_data_frame.
pass list of values and corresponding column names to create a new_record (data_frame)

new_record = pd.DataFrame([[0,'abcd',0,1,123]],columns=['a','b','c','d','e'])

old_data_frame = pd.concat([old_data_frame,new_record])
查看更多
流年柔荑漫光年
6楼-- · 2018-12-31 09:52

If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer):

import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )

# now fill it up row by row
for x in np.arange(0, numberOfRows):
    #loc or iloc both work here since the index is natural numbers
    df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]: 
   lib  qty1  qty2
0   -1    -1    -1
1    0     0     0
2   -1     0    -1
3    0    -1     0
4   -1     0     0

Speed comparison

In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, @fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop

And - as from the comments - with a size of 6000, the speed difference becomes even larger:

Increasing the size of the array (12) and the number of rows (500) makes the speed difference more striking: 313ms vs 2.29s

查看更多
大哥的爱人
7楼-- · 2018-12-31 09:53

This is not an answer to the OP question but a toy example to illustrate the answer of @ShikharDua above which I found very useful.

While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the stats below for more than one taget column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you @ShikharDua !

import pandas as pd 

BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
                          'Territory'  : ['West','East','South','West','East','South'],
                          'Product'  : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData

columns = ['Customer','Num Unique Products', 'List Unique Products']

rows_list=[]
for name, group in BaseData.groupby('Customer'):
    RecordtoAdd={} #initialise an empty dict 
    RecordtoAdd.update({'Customer' : name}) #
    RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})      
    RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})                   

    rows_list.append(RecordtoAdd)

AnalysedData = pd.DataFrame(rows_list)

print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)
查看更多
登录 后发表回答