Python - Efficient way to add rows to dataframe

2020-01-30 04:16发布

From this question and others it seems that it is not recommended to use concat or append to build a pandas dataframe because it is recopying the whole dataframe each time.

My project involves retrieving a small amount of data every 30 seconds. This might run for a 3 day weekend, so someone could easily expect over 8000 rows to be created one row at a time. What would be the most efficient way to add rows to this dataframe?

6条回答
兄弟一词,经得起流年.
2楼-- · 2020-01-30 04:44

sundance's answer might be correct in terms of usage, but the benchmark is just wrong. As moobie correctly pointed out an index 3 already exists in this example, which makes access way quicker than with a non-existent index. Have a look at this:

%%timeit
test = pd.DataFrame({"A": [1,2,3], "B": [1,2,3], "C": [1,2,3]})
for i in range(0,1000):
    testrow = pd.DataFrame([0,0,0])
    pd.concat([test[:1], testrow, test[1:]])

2.15 s ± 88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
test = pd.DataFrame({"A": [1,2,3], "B": [1,2,3], "C": [1,2,3]})
for i in range(0,1000):
    test2 = pd.DataFrame({'A': 0, 'B': 0, 'C': 0}, index=[i+0.5])
    test.append(test2, ignore_index=False)
test.sort_index().reset_index(drop=True)

972 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
test = pd.DataFrame({"A": [1,2,3], "B": [1,2,3], "C": [1,2,3]})
for i in range(0,1000):
    test3 = [0,0,0]
    test.loc[i+0.5] = test3
test.reset_index(drop=True)

1.13 s ± 46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Of course, this is purely synthetic, and I admittedly wasn't expecting these results, but it seems that with non-existent indices .loc and .append perform quite similarly. Just leaving this here.

查看更多
倾城 Initia
3楼-- · 2020-01-30 04:45

Editing the chosen answer here since it was completely mistaken. What follows is an explanation of why you should not use setting with enlargement. "Setting with enlargement" is actually worse than append.

The tl;dr here is that there is no efficient way to do this with a DataFrame, so if you need speed you should use another data structure instead. See other answers for better solutions.

More on setting with enlargement

You can add rows to a DataFrame in-place using loc on a non-existent index, but that also performs a copy of all of the data (see this discussion). Here's how it would look, from the Pandas documentation:

In [119]: dfi
Out[119]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4

In [120]: dfi.loc[3] = 5

In [121]: dfi
Out[121]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5

For something like the use case described, setting with enlargement actually takes 50% longer than append:

With append(), 8000 rows took 6.59s (0.8ms per row)

%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
for i in range(8000):
    df = df.append(new_row, ignore_index=True)

# 6.59 s ± 53.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With .loc(), 8000 rows took 10s (1.25ms per row)

%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
for i in range(8000):
    df.loc[i] = new_row

# 10.2 s ± 148 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

What about a longer DataFrame?

As with all profiling in data-oriented code, YMMV and you should test this for your use case. One characteristic of the copy-on-write behavior of append and "setting with enlargement" is that it will get slower and slower with large DataFrames:

%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
for i in range(16000):
    df.loc[i] = new_row

# 23.7 s ± 286 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Building a 16k row DataFrame with this method takes 2.3x longer than 8k rows.

查看更多
可以哭但决不认输i
4楼-- · 2020-01-30 04:53

Assuming that your dataframe is indexed in order you can:

First check to see what the next index value is to create a new row:

myindex = df.shape[0]+1 

Then use 'at' to write to each desired column

df.at[myindex,'A']=val1
df.at[myindex,'B']=val2
df.at[myindex,'C']=val3
查看更多
够拽才男人
5楼-- · 2020-01-30 04:57

You need to split the problem into two parts:

  1. Accepting the data (collecting it) every 30 seconds efficiently.
  2. Processing the data once its collected.

If your data is critical (that is, you cannot afford to lose it) - send it to a queue and then read it from the queue in batches.

The queue will provide reliable (guaranteed) acceptance and that your data will not be lost.

You can read the data from the queue and dump it in a database.

Now your Python app simply reads from the database and does the analysis at whatever interval makes sense for the application - perhaps you want to do hourly averages; in this case you would run your script each hour to pull the data from the db and perhaps write the results in another database / table / file.

The bottom line - split the collecting and analyzing parts of your application.

查看更多
【Aperson】
6楼-- · 2020-01-30 04:58

I used this answer's df.loc[i] = [new_data] suggestion, but I have > 500,000 rows and that was very slow.

While the answers given are good for the OP's question, I found it more efficient, when dealing with large numbers of rows up front (instead of the tricking in described by the OP) to use csvwriter to add data to an in memory CSV object, then finally use pandas.read_csv(csv) to generate the desired DataFrame output.

from io import BytesIO
from csv import writer 
import pandas as pd

output = BytesIO()
csv_writer = writer(output)

for row in iterable_object:
    csv_writer.writerow(row)

output.seek(0) # we need to get back to the start of the BytesIO
df = pd.read_csv(output)
return df

This, for ~500,000 rows was 1000x faster and as the row count grows the speed improvement will only get larger (the df.loc[1] = [data] will get a lot slower comparatively)

Hope this helps someone who need efficiency when dealing with more rows than the OP.

查看更多
倾城 Initia
7楼-- · 2020-01-30 05:11

The response of Tom Harvey works well. However, I would like to add an simpler answer based on pandas.DataFrame.from_dict.

By adding the data of a row in a list and then this list to a dictionary, you can then use pd.DataFrame.from_dict(dict) to create a dataframe without iteration.

If each value of the dictionary is a row. You can use just: pd.DataFrame.from_dict(dictionary,orient='index')

Small example:

# Dictionary containing the data
dic = {'row_1':['some','test','values',78,90],'row_2':['some','test','values',100,589]}

# Creation of the dataframe
df = pd.DataFrame.from_dict(dic,orient='index')
df
          0       1       2      3       4
row_1   some    test    values  78       90
row_2   some    test    values  100     589
查看更多
登录 后发表回答