How to filter dataframe based on ip range

2019-08-24 20:51发布

问题:

I have dataframe which has 2 columns. I want to filter this dataframe based on ip ranges present in json file.

ip_ranges.json

[
    {"start": "45.43.144.0", "end": "45.43.161.255"}
    {"start": "104.222.130.0", "end": "104.222.191.255"}
    ...
]

Dataframe:

ip,p_value
97.98.173.96,3.7
73.83.192.21,6.9
...

Note: ip_range.json contains 100k elements and my dataframe has 300k rows.

Currently, I implemented like this

Created python list to store all ips in each range. For example ["45.43.144.0", "45.43.144.1", "45.43.144.2", ..., "45.43.161.255"]. Similar way for all ip ranges.
Removed duplicate elements from this list
Constructed dataframe using this list
Merged two dataframes on 'ip'

This process works fine for small set of ip_ranges. But for large set of ip_ranges, the process takes longer time to complete.

Is there any better approach to perform this more efficiently?

回答1:

Just an idea: Put you ranges into a dataframe ip_range with columns From and To. Convert all ip-addresses (the ones in df, too) to decimal numbers with the fast code provided for example here.

Now generating the ranges can be done fast:

ip_range['Rng'] = ip_range.apply(lambda x: np.arange(x.From, x.To+1), axis=1)

These ranges can be converted into a DataFrame:

ips = pd.DataFrame(itertools.chain(*ip_range['Rng']))

This DataFrame can easily be merged with df.