I generate a npz file as follows:
import numpy as np
import os
# Generate npz file
dataset_text_filepath = 'test_np_load.npz'
texts = []
for text_number in range(30000):
texts.append(np.random.random_integers(0, 20000,
size = np.random.random_integers(0, 100)))
texts = np.array(texts)
np.savez(dataset_text_filepath, texts=texts)
This gives me this ~7MiB npz file (basically only 1 variable texts
, which is a NumPy array of Numpy arrays):
which I load with numpy.load()
:
# Load data
dataset = np.load(dataset_text_filepath)
If I query it as follows, it takes several minutes:
# Querying data: the slow way
for i in range(20):
print('Run {0}'.format(i))
random_indices = np.random.randint(0, len(dataset['texts']), size=10)
dataset['texts'][random_indices]
while if I query as follows, it takes less than 5 seconds:
# Querying data: the fast way
data_texts = dataset['texts']
for i in range(20):
print('Run {0}'.format(i))
random_indices = np.random.randint(0, len(data_texts), size=10)
data_texts[random_indices]
How comes the second method is so much faster than the first one?