I have two tables. One (df below) has approximately 18,000 rows, and the other (mapfile below) has ~800,000 rows. I need a solution that can work with such large DataFrames.
Here is a toy example: Table 1 - df
Sample Chr Start End Value
S1 1 100 200 1
S1 2 200 250 1
S2 1 50 75 5
S2 2 150 225 4
Table 2 - mapfile
Name Chr Position
P1 1 105
P2 1 60
P3 1 500
P4 2 25
P5 2 220
P6 2 240
I'm trying to do something like the following (my syntax is wrong, but I think the idea comes across):
for mapline in mapfile:
for dfline in df:
if df[dfline]['Chr'] == mapfile[mapline]['Chr']
if mapfile[mapline]['Position'] > df[dfline]['Start'] & mapfile[mapline]['Position'] < df[dfline]['End']
newdf[['Name','Chr','Position','Value', 'Sample']] = pd.DataFrame([ mapfile[mapline]['Name'], mapfile[mapline]['Chr'], mapfile[mapline]['Position'], df[dfline]['Value'], df[dfline]['Sample'] ] )
In words: I need to go through each item (line) in mapfile and see if its position is between any of the START & ENDs on each CHR in the df. If it is, I need to add it to a new file with the Name, Chr, Position, Sample, and Value fields from the two tables.
Output table of toy data:
Name Chr Position Value Sample
P1 1 105 1 S1
P2 1 60 5 S2
P5 2 220 1 S1
P5 2 220 4 S2
P6 2 240 1 S1
So far: I've got what I have above, and have been having issues figuring out the syntax to just do a general loop in python. However, my understanding is that this may be much easier with a package such as pandas or NumPy? Please help me find the most efficient way to do this, and some help with the syntax along the way would be great.
A few related posts that I tried but couldn't make work What is the most efficient way to loop through dataframes with pandas? How to iterate over rows in a DataFrame in Pandas? Append column to pandas dataframe Conditionally fill column values based on another columns value in pandas
IIUC you can use
read_csv
andmerge
: