Data Structure(s) Allowing For Alteration Through

2019-08-09 11:11发布

Given a fixed-sized array A of objects, suppose a much smaller subset of these objects meet a certain criteria B. There are three tasks I want to complete with approximately equal frequency:

  1. I want to be able to change an object currently not meeting criteria B to meet criteria B at any time when accessing the object in A by index.
  2. I want to be able to change an object currently meeting criteria B to no longer meet criteria B when accessing the object in A by index.
  3. I also want to be able to choose a random object from ONLY those objects that meet criteria B.

All tasks should be able to be completed in constant time, or as close to constant time as possible, not dependent on the number of objects in A and not dependent on the number of objects meeting criteria B. If constant time isn't possible, which I suspect to be the case, then I want to complete both as quickly as possible, taking into consideration the frequencies I mentioned earlier. What data structure(s) are appropriate for these two tasks if they are being repeated huge numbers of times?

Take, for example, the C++ implementation that I have below. While the timed part (the section of the code that gets repeated a huge number of times) is independent of the overall size of A (alltiles), the time complexity linearly depends on B (bluetiles) (regardless of whether the number of alltiles increases or not), seriously slowing down the code.

#include <iostream>
#include <vector>
#include <chrono>
#include <cstdlib>
#include <algorithm>

using namespace std;

enum color {RED, GREEN, BLUE};
const int NUM_ATTEMPTS = 10000;
const int INITIAL_NUM_BLUE_TILES = 1000;
const int TOTAL_TILES = 1000000;

struct tile
{
  int color = RED;
};

struct room
{
  vector<tile> alltiles;
  vector<tile*> bluetiles;
  room(vector<tile> v) : alltiles(v) {}
};

int main()
{
  srand (time(NULL));

  // set up the initial room, time complexity here is irrelevant
  room myroom(vector<tile>(1*TOTAL_TILES));
  for(int i = 0; i < INITIAL_NUM_BLUE_TILES; i++)
  {
    myroom.alltiles[i].color = BLUE;
    myroom.bluetiles.push_back(&myroom.alltiles[i]);
  }

  auto begin = std::chrono::high_resolution_clock::now();
  for(int attempt_num = 0; attempt_num < NUM_ATTEMPTS; attempt_num++)
  {
    // access a BLUE tile by index from alltiles to change its color to RED
    myroom.alltiles[5].color = RED; // constant time
    myroom.bluetiles.erase(std::remove(myroom.bluetiles.begin(), myroom.bluetiles.end(), &myroom.alltiles[5]), myroom.bluetiles.end()); // linear time, oh no!

    // access a RED tile by index from alltiles to change its color to BLUE
    myroom.alltiles[5].color = BLUE; // constant time
    myroom.bluetiles.push_back(&myroom.alltiles[5]); // constant time

    // randomly choose from ONLY the blue tiles
    int rand_index = rand() % myroom.bluetiles.size(); // constant time
    myroom.bluetiles[rand_index]->color = GREEN; // constant time
    myroom.bluetiles[rand_index]->color = BLUE; // constant time
    // so now I have constant time access to a random blue tile

  }
  auto end = std::chrono::high_resolution_clock::now();
  double runtime = std::chrono::duration_cast<std::chrono::milliseconds>(end-begin).count();
  cout << runtime << " ms" << endl; 
  return 0;
}

The portions that are being timed are the operations that I am interested in performing frequently; in the real program, the logic behind choosing which tiles to change differs. Hopefully, a better data structure wouldn't require any probabilistic analysis, but I fear it still might.

I suspect that, perhaps, using double indirection by keeping a pointer (pointing to elements in the bluetiles vector) in the tile class could maybe allow me to achieve this in constant time, but I'm not certain. I guess it could at least speed it up in the sense that searching bluetiles would no longer be necessary, but then removal of an element in bluetiles would still be linear time (since I'm using a vector), so I'm really just not sure what to do here.

Can you devise the fastest data structure to achieve this goal, and provide a C++ implementation building from my example? Or is what I have as good as it will ever get?

1条回答
SAY GOODBYE
2楼-- · 2019-08-09 12:04

UPDATE: This resembles the solution I propose for the SO question Random element from unordered_set in O(1)

You can implement something like the following SubsetVector<T> class which lets you to insert/remove element from the subset (i.e. mark them) in O(1). Then it lets you find the size of the subset in O(1), and access i-th item from this subset in O(1). I think this is what you wanted. Note that the subset does not guarantee any specific order, but this should be OK for your needs.

The idea is to maintain two vectors.

  1. m_entries that contains the actual data. m_entries[i] contains the element and and an index into m_subset_indices, if the element is in the subset, and -1 otherwise.
  2. m_subset_indices contains all indices of m_entries elements that are in the subset.

Here is the code (compiled but not tested):

template <class T>
class SubsetVector
{
private:
   struct Entry
   {
       T element;
       int index_in_subset = -1;
   };
public:
   explicit SubsetVector(unsigned size = 0) : m_entries(size) 
   {
       m_subset_indices.reserve(size);
   }

   void push_back(const T & element)
   {
       m_entries.push_back(Entry{element, -1});
   }
   const T & operator[](unsigned index) const { return m_entries[index].element; }
   T & operator[](unsigned index) { return m_entries[index].element; }

   void insert_in_subset(unsigned index)
   {
       if (m_entries[index].index_in_subset < 0) {
           m_entries[index].index_in_subset = m_subset_indices.size();
           m_subset_indices.push_back(index);
       }
   }
   void erase_from_subset(unsigned index)
   {
       if (m_entries[index].index_in_subset >= 0) {
           auto subset_index = m_entries[index].index_in_subset;
           auto & entry_to_fix = m_entries[m_subset_indices.back()];
           std::swap(m_subset_indices[subset_index], m_subset_indices.back());
           entry_to_fix.index_in_subset = subset_index;
           m_subset_indices.pop_back();
           m_entries[index].index_in_subset = -1;
       }
   }
   unsigned subset_size() const 
   {
       return m_subset_indices.size();
   }
   T & subset_at(unsigned subset_index)
   {
       auto index = m_subset_indices.at(subset_index);
       return m_entries.at(index).element;
   }
   const T & subset_at(unsigned subset_index) const
   {
       auto index = m_subset_indices.at(subset_index);
       return m_entries.at(index).element;
   }

private:
   std::vector<Entry> m_entries;
   std::vector<unsigned> m_subset_indices;
};
查看更多
登录 后发表回答