Python Pandas - Compare 2 dataframes, multiple par

I have two tables. One (df below) has approximately 18,000 rows, and the other (mapfile below) has ~800,000 rows. I need a solution that can work with such large DataFrames.

Here is a toy example: Table 1 - df

Sample    Chr    Start     End    Value
S1        1       100      200     1
S1        2       200      250     1
S2        1       50        75     5
S2        2       150      225     4

Table 2 - mapfile

Name    Chr    Position
P1       1      105
P2       1      60
P3       1      500
P4       2      25
P5       2      220
P6       2      240

I'm trying to do something like the following (my syntax is wrong, but I think the idea comes across):

for mapline in mapfile:
    for dfline in df:
       if df[dfline]['Chr'] == mapfile[mapline]['Chr']
           if mapfile[mapline]['Position'] > df[dfline]['Start'] & mapfile[mapline]['Position'] < df[dfline]['End']
                  newdf[['Name','Chr','Position','Value', 'Sample']] = pd.DataFrame([ mapfile[mapline]['Name'], mapfile[mapline]['Chr'], mapfile[mapline]['Position'], df[dfline]['Value'], df[dfline]['Sample'] ] )

In words: I need to go through each item (line) in mapfile and see if its position is between any of the START & ENDs on each CHR in the df. If it is, I need to add it to a new file with the Name, Chr, Position, Sample, and Value fields from the two tables.

Output table of toy data:

Name    Chr    Position    Value   Sample
P1       1      105         1       S1
P2       1      60          5       S2
P5       2      220         1       S1
P5       2      220         4       S2
P6       2      240         1       S1

So far: I've got what I have above, and have been having issues figuring out the syntax to just do a general loop in python. However, my understanding is that this may be much easier with a package such as pandas or NumPy? Please help me find the most efficient way to do this, and some help with the syntax along the way would be great.

标签： python numpy pandas

1条回答

姐就是有狂的资本

2楼-- · 2019-07-23 16:41

IIUC you can use read_csv and merge:

import pandas as pd
import io

temp1=u"""Sample;Chr;Start;End;Value
S1;1;100;200;1
S1;2;200;250;1
S2;1;50;75;5
S2;2;150;225;4"""
#after testing replace io.StringIO(temp1) to filename
dfline = pd.read_csv(io.StringIO(temp1), sep=";")

temp2=u"""Name;Chr;Position
P1;1;105
P2;1;60
P3;1;500
P4;2;25
P5;2;220
P6;2;240"""
#after testing replace io.StringIO(temp2) to filename
mapfile = pd.read_csv(io.StringIO(temp2), sep=";")

print dfline
  Sample  Chr  Start  End  Value
0     S1    1    100  200      1
1     S1    2    200  250      1
2     S2    1     50   75      5
3     S2    2    150  225      4
print mapfile
  Name  Chr  Position
0   P1    1       105
1   P2    1        60
2   P3    1       500
3   P4    2        25
4   P5    2       220
5   P6    2       240

#merge by column Chr
df = pd.merge(dfline, mapfile, on=['Chr'])

#select by conditions
df = df[(df.Position > df.Start) & (df.Position < df.End)]

#subset of df
df =  df[['Name','Chr','Position','Value', 'Sample']]

print df
   Name  Chr  Position  Value Sample
0    P1    1       105      1     S1
4    P2    1        60      5     S2
7    P5    2       220      1     S1
8    P6    2       240      1     S1
10   P5    2       220      4     S2

#if you need reset index
print df.reset_index(drop=True)
  Name  Chr  Position  Value Sample
0   P1    1       105      1     S1
1   P2    1        60      5     S2
2   P5    2       220      1     S1
3   P6    2       240      1     S1
4   P5    2       220      4     S2

0人赞添加讨论(0) 举报

Python Pandas - Compare 2 dataframes, multiple par

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间