Given a fixed-sized array A of objects, suppose a much smaller subset of these objects meet a certain criteria B. There are three tasks I want to complete with approximately equal frequency:
- I want to be able to change an object currently not meeting criteria B to meet criteria B at any time when accessing the object in A by index.
- I want to be able to change an object currently meeting criteria B to no longer meet criteria B when accessing the object in A by index.
- I also want to be able to choose a random object from ONLY those objects that meet criteria B.
All tasks should be able to be completed in constant time, or as close to constant time as possible, not dependent on the number of objects in A and not dependent on the number of objects meeting criteria B. If constant time isn't possible, which I suspect to be the case, then I want to complete both as quickly as possible, taking into consideration the frequencies I mentioned earlier. What data structure(s) are appropriate for these two tasks if they are being repeated huge numbers of times?
Take, for example, the C++ implementation that I have below. While the timed part (the section of the code that gets repeated a huge number of times) is independent of the overall size of A (alltiles), the time complexity linearly depends on B (bluetiles) (regardless of whether the number of alltiles increases or not), seriously slowing down the code.
#include <iostream>
#include <vector>
#include <chrono>
#include <cstdlib>
#include <algorithm>
using namespace std;
enum color {RED, GREEN, BLUE};
const int NUM_ATTEMPTS = 10000;
const int INITIAL_NUM_BLUE_TILES = 1000;
const int TOTAL_TILES = 1000000;
struct tile
{
int color = RED;
};
struct room
{
vector<tile> alltiles;
vector<tile*> bluetiles;
room(vector<tile> v) : alltiles(v) {}
};
int main()
{
srand (time(NULL));
// set up the initial room, time complexity here is irrelevant
room myroom(vector<tile>(1*TOTAL_TILES));
for(int i = 0; i < INITIAL_NUM_BLUE_TILES; i++)
{
myroom.alltiles[i].color = BLUE;
myroom.bluetiles.push_back(&myroom.alltiles[i]);
}
auto begin = std::chrono::high_resolution_clock::now();
for(int attempt_num = 0; attempt_num < NUM_ATTEMPTS; attempt_num++)
{
// access a BLUE tile by index from alltiles to change its color to RED
myroom.alltiles[5].color = RED; // constant time
myroom.bluetiles.erase(std::remove(myroom.bluetiles.begin(), myroom.bluetiles.end(), &myroom.alltiles[5]), myroom.bluetiles.end()); // linear time, oh no!
// access a RED tile by index from alltiles to change its color to BLUE
myroom.alltiles[5].color = BLUE; // constant time
myroom.bluetiles.push_back(&myroom.alltiles[5]); // constant time
// randomly choose from ONLY the blue tiles
int rand_index = rand() % myroom.bluetiles.size(); // constant time
myroom.bluetiles[rand_index]->color = GREEN; // constant time
myroom.bluetiles[rand_index]->color = BLUE; // constant time
// so now I have constant time access to a random blue tile
}
auto end = std::chrono::high_resolution_clock::now();
double runtime = std::chrono::duration_cast<std::chrono::milliseconds>(end-begin).count();
cout << runtime << " ms" << endl;
return 0;
}
The portions that are being timed are the operations that I am interested in performing frequently; in the real program, the logic behind choosing which tiles to change differs. Hopefully, a better data structure wouldn't require any probabilistic analysis, but I fear it still might.
I suspect that, perhaps, using double indirection by keeping a pointer (pointing to elements in the bluetiles vector) in the tile class could maybe allow me to achieve this in constant time, but I'm not certain. I guess it could at least speed it up in the sense that searching bluetiles would no longer be necessary, but then removal of an element in bluetiles would still be linear time (since I'm using a vector), so I'm really just not sure what to do here.
Can you devise the fastest data structure to achieve this goal, and provide a C++ implementation building from my example? Or is what I have as good as it will ever get?
UPDATE: This resembles the solution I propose for the SO question Random element from unordered_set in O(1)
You can implement something like the following
SubsetVector<T>
class which lets you to insert/remove element from the subset (i.e. mark them) in O(1). Then it lets you find the size of the subset in O(1), and access i-th item from this subset in O(1). I think this is what you wanted. Note that the subset does not guarantee any specific order, but this should be OK for your needs.The idea is to maintain two vectors.
m_entries
that contains the actual data.m_entries[i]
contains the element and and an index intom_subset_indices
, if the element is in the subset, and -1 otherwise.m_subset_indices
contains all indices ofm_entries
elements that are in the subset.Here is the code (compiled but not tested):