I feel like I'm overlooking something really simple, but I can't make it work. I'm using SQLite
now, but a solution in SQLAlchemy
would also be very helpful.
Let's create our original dataset:
### This is just the setup part
import pandas as pd
import sqlite3
conn = sqlite3.connect('test.sqlite')
orig = pd.DataFrame({'COLUPC': [100001, 100002, 100003, 100004],
'L5': ['ABC ALE', 'ABC MALT LIQUOR', 'ABITA AMBER', 'ABITA AMBER'],
'attr1': [0.25, 0.25, 0.041, 0.041]})
orig.to_sql("UPCs", conn, if_exists='replace', index=False)
#Create an index just in case it's needed
conn.execute("""CREATE INDEX upc_index
ON UPCs (COLUPC);""")
Now suppose I take that orig
dataframe
and add a column called 'L5_lower'. Then I create the column in the SQLite database:
# Create new variable
orig['L5_lower'] = orig.L5.str.lower()
conn.execute("alter table UPCs add column L5_lower TEXT;")
Now suppose I want to fill in this single column L5_lower
to the SQLite table, without having to pass other columns (below I explain why I need this)
I tried passing the index and the new column as tuples:
query='''insert or replace into UPCs (COLUPC, L5_lower) values (?,?) '''
conn.executemany(query, orig[['COLUPC', 'L5_lower']].to_records(index=False))
conn.commit()
# But then:
df = pd.read_sql("SELECT * FROM UPCs;", conn)
conn.close()
gives this messed up result.
COLUPC L5 attr1 L5_lower
0 100001 ABC ALE 0.250 None
1 100002 ABC MALT LIQUOR 0.250 None
2 100003 ABITA AMBER 0.041 None
3 100004 ABITA AMBER 0.041 None
4 b'\xa1\x86\x01\x00\x00\x00\x00\x00' None NaN abc ale
5 b'\xa2\x86\x01\x00\x00\x00\x00\x00' None NaN abc malt liquor
6 b'\xa3\x86\x01\x00\x00\x00\x00\x00' None NaN abita amber
7 b'\xa4\x86\x01\x00\x00\x00\x00\x00' None NaN abita amber
Instead, the expected output is:
COLUPC L5 attr1 L5_lower
0 100001 ABC ALE 0.250 abc ale
1 100002 ABC MALT LIQUOR 0.250 abc malt liquor
2 100003 ABITA AMBER 0.041 abita amber
3 100004 ABITA AMBER 0.041 abita amber
So, why am I trying to pass a single column? I have a very big dataset and I won't be able to have the whole dataframe in memory. My intended workflow is to construct one column at a time and then update
or insert
into the SQLite database.