I have a possible solution to a problem I'm trying to address, but I wanted to run it by here just to be on the safe side. The challenge is to ensure that a user that has gone through some test questions in an exam application doesn't encounter them again during a subsequent test.
I'm not using a SQL database, which would allow me to use left joins, subqueries, temporary tables and so on. I'm using Google App Engine's datastore and hope to get the information I need within a single HTTP request and a single thread, preferably in under a second.
Suppose there is a pool of 100,000 vocabulary questions of a certain type (e.g., synonyms). The application would select 30 questions from the pool for a given section of an exam. Here's what I was thinking of doing:
- When a question is fist created, assign it a random integer position within the pool.
- At the time of the individual's first exam, choose a random number, then select the first 100 questions, ordered by position, whose positions are larger than that number. Keep track of the number as the lower bound of a window of questions and also as the starting position for the pool as a whole. Keep track of the (maximum) position of the last question in the result set as the upper bound of the new window.
- Select 30 questions at random from the window and then give them as the section. Store the remaining 70 questions for use later on in the exam and possibly in a subsequent exam.
- As the user goes through additional sections (for practice, say), and the list of remaining questions in the current window is depleted, select the next 100 questions from the pool whose positions are greater than the upper bound that was stored earlier. Make the old upper bound the new lower bound and find the upper bound of the new window.
- When a query returns less than 100 questions, wrap around to a position of 0 and continue until the original starting point is encountered (it's unlikely that anyone would go through the entire pool, but it's important to be sure).
The main reason the positions are randomly assigned is to cancel out the effects of changes in the style of how questions are written, e.g., questions that were written earlier on when there was less experience versus later ones.
The application would assign a position to a question without checking to see that the position is unique. If there are a large enough number of questions, the birthday paradox suggests that duplicate positions will become more and more common. My thought was that it won't hurt to have occasional duplicates, and that this would make things simpler than ensuring that a given position is unique, which might entail a retry and the associated network costs that go with it. It's more important that there be no repeat questions than to ensure that every question within a range of questions is shown to a user.
Is there a better way to do this? Is it ok not to worry about duplicate positions?
I don't think you need to "choose 30 at random" out of your 100 question pool. The 100 were selected randomly to begin with - if you take the first 30, those will already be randomized. Your code will be simpler, with no less randomness.
Put the items in some order and give each one a number. Maintain a counter for the user which starts at zero. Generate a random(ish) key for each user. Choose a function which maps the combination of an integer in the range 0...N and a key to another integer in the range 0...N, such that for any value off the key, the mapping between integers is bijective. When you need an item, take the value of the counter, put it through the function along with the key, and use that number to index into the list of items. Increment the counter.
All you need to store for each user is the key and the counter, and you have a guarantee that no item will ever be repeated. It's basically a way of constructing a huge permutation on the fly.
The problem is of course choosing the function!
A simple one would be
f(counter, key) = (counter + key) mod N
, and whilst this would work, it doesn't randomise the items at all, so everyone will get the items in the same order, just starting at different places. If N + 1 is prime, you can use F(counter, key) = (((counter + 1) * (key + 1)) mod (N + 1)) - 1, which would work pretty well. If the range was 0...2^64, you could use any block cipher which has a 64-bit block, which would give excellent randomisation. But it's not clear that you could adapt this to smaller sizes.I'll have a bit of a dig to see if i can come up with a generically applicable function. This is a problem i've come across myself, and it would be great to finally have a good answer. I'll edit this answer if i find anything.
EDIT: Okay, here we go! I got the key ideas from a thread i started on sci.crypt, and in particular from one Paul Rubin, who is an all-round usenet hero.
Slight change of strategy. Put your items in a list, so they can be accessed by an index. Pick a number B, such that 2^B >= N - any value will do, but you really want the smallest one (ie the ceiling of the base-2 logarithm of N-1). We then refer to 2^B as M. Set up a counter which will run from 0 to M-1, and a key for each user, which can be an arbitrary byte string - a random integer is probably simplest. Set up the Magic Function, which is a bijection, aka a permutation, on the set of integers 0 ... M-1 (see below!). When you want an item, take the value of the counter, increment the counter, then put the original value through the Magic Function to get an index; if the index is greater than N-1, then throw it away, and repeat the process. Repeat until you get an index that is less than N. Sometimes, you'll need to go through one or more useless indices before you get a good one, but on average, it will take M/N tries, which is not too bad (that's less than 2 if you chose the smallest possible B).
Incidentally, calculating the ceiling of the base 2 logarithm of a number is easy:
Okay, so the Magic Function, which maps a number from 0 ... M-1 to another such number. I mentioned block ciphers above, and that's what we're going to use. Except that our block is B bits long, which is variable and smaller than the normal 64 or 128 bits. So we need a cipher that works on small, variable-sized blocks. So we're going to write our own - a variable-sized mildly-unbalanced Feistel cipher. Easy!
To do a Feistel cipher, you need:
A Feistel cipher works like this:
The final value for B is the encrypted value, and is the output from the cipher. Decrypting it is pretty simple - do the above in reverse - but since we don't need to decrypt, don't worry about that.
So there you go. Maintain a counter (and a key, and the value of M), encrypt its value with a tiny cipher, and use the result as an index. Given that the cipher is a permutation, it's easy to show that this will never repeat, which should keep your clients happy. Even better, given the cryptographic properties of the cipher, you can also claim that the students will be unable to predict which question will come next (which is probably not important, but sounds cool).
All this is a bit more complicated than just incrementing a counter and giving them the items in order, but it's not that hard. You could do it in a hundred lines of java. Well, i can do it in a hundred lines of java - i don't know about you! :)
Incidentally, this will work with a growing pool of items, provided that you always add items at the end, although it will never pick items numbered higher than M. In many cases, though, that will give you some room for growth.
Here is an approach to consider. Do not have tons of time to write it up and edit, so apologies in advance for any issues to follow. Create a model with your 100K questions such that you can access them using a key name (e.g. question_000001, question_0000002, etc.) When a student enrolls create his/her record with three TextProperties. When the record is created, generate a randomized set of numbers from 1-100K and serialize it as a delimited text string. The second string is to record answered questions yet to be taskqueue processed. The third string is for answered questions after TQ processing.
When the user logs in, send the first N fields of the to-be-aksed string (N=more than enough questions to service any type of session), plus the whole answered questions string. On the client side, split them into an array of questions to ask, and a hash of questions answered. If the question is in the hash, skip it. As the user works through the questions, each is serviced by a simple get_by_id call to your on-line handler. As each question is answered, client sends it to GAE where you append the question number to the text field that has recently answered questions, and set a boolean to flag for later TQ processing. Once a day or thereabouts, run a taskqueue process to go through and update the record by splitting the questions to be asked text field, and removing all of the questions answered since the last update. Move these to the third text filed as completed questions.
Note that you could skip a lot of this using just two text fields, and updating as the answers get posted. However, my inclination is to always do as little as possible with the on-line handler, and push things to TQs.
No counters (always available by getting the length of you split GAE text fields), no queries (except for the TQ boolean flag query). No complexity. There are numerous ways to make this more efficient re: amount of data transferred. etc.
Wish I had a bit more time to more ensure this is 100% aligned with your needs, but have to go. So I leave you with a "HTH" -stevep
Use a floating point number between 0 and 1 instead of an integer. It has a nice domain, which doesn't change with the number of entities you have, and doubles have a 52 bit mantissa, which gives us approximately 2^26 objects before we can expect a collision; substantially more than what you're dealing with.