I have dataframe which has 2 columns. I want to filter this dataframe based on ip ranges present in json file.
ip_ranges.json
[
{"start": "45.43.144.0", "end": "45.43.161.255"}
{"start": "104.222.130.0", "end": "104.222.191.255"}
...
]
Dataframe:
ip,p_value
97.98.173.96,3.7
73.83.192.21,6.9
...
Note: ip_range.json contains 100k elements and my dataframe has 300k rows.
Currently, I implemented like this
- Created python list to store all ips in each range. For example ["45.43.144.0", "45.43.144.1", "45.43.144.2", ..., "45.43.161.255"]. Similar way for all ip ranges.
- Removed duplicate elements from this list
- Constructed dataframe using this list
- Merged two dataframes on 'ip'
This process works fine for small set of ip_ranges. But for large set of ip_ranges, the process takes longer time to complete.
Is there any better approach to perform this more efficiently?
Just an idea: Put you ranges into a dataframe
ip_range
with columnsFrom
andTo
. Convert all ip-addresses (the ones indf
, too) to decimal numbers with the fast code provided for example here.Now generating the ranges can be done fast:
ip_range['Rng'] = ip_range.apply(lambda x: np.arange(x.From, x.To+1), axis=1)
These ranges can be converted into a DataFrame:
ips = pd.DataFrame(itertools.chain(*ip_range['Rng']))
This DataFrame can easily be merged with
df
.