I'm building an application that needs to store and re-use large amounts of data per session.
So for example, the user selects a large list of list items (say 2000 or significantly more) which have a numeric value as their key then they save that selection and go off to another page, do something else and then come back to the original page and need to load their selections into that page.
What is the quickest and most efficient way of storing and reusing that data?
In a text file saved with the session id?
In a temp DB table?
In the session data itself (DB sessions so size isn't a limit) using a serialized string or using gzcompress or gzencode
?
anyway you want. But whatever you will choose, the array will be serialized into a string, and kept either in a file (implicitly, using sessions) or in a database field. The read and write is faster performed on a file, so is search. I see no reason to use a database for this.
For alternative serialization, check out this tool: http://msgpack.sourceforge.net/
A database will work fine for this. Just link the session to a visitors table and have a table called visitor_list_items that stores the selected items as rows.
2000 isn't an insane number to fetch. I mean, geez, if they're gonna sit there and select 2000 list items, they can wait one second for the page to load! (Are you sure there isn't a way to break this selection process up into steps?)
If it's in the db, you can leverage the conventional db uses (i.e., more easily run reports on what items visitors are selecting when they come to your site, etc...).
Although normally I'd always recommend users to keep their data in a database rather than simple files, this is an exception. In general, there is a slight overhead in storing data in a database compared with files - but the former provides a lot of flexibility over access and removes a lot of the locking problems. However unless you expect your page turns to be particularly slow and users to run with multiple browsers accessing the same session, then concurrency will not be a big problem, i.e.
Using any sort of database will be slower
(also, if you're going to be dealing with a large cluster of webservers - more than 200 - sharing the same session, then yes, a distributed database may outperform a cluster filesystem on a SAN).
You probably do want to think about how often the session will be written. The default handler writes the data back to disk every time regardless if it has changed or not - for such a large session, I'd suggest that you use you write your own session handler which writes, not just the serialized session data to the file, but also stores a hash of the serialized data - when you read in the session, store the hash in a static variable. In the save handler, generate a new hash and compare it with the static variable populated at load time - only write the session if it has changed. You could extend this further by applying heuristics to seperate the session into parts which update often and parts which are less frequently changed, then record these in seperate files.
Using compression for this is not really going to help with performance.
There's certainly scope for a lot of OS level tuning to optimize this - but you don't say what your OS is. Assuming its POSIX and your system isn't already on its knees, then your perormance hits are going to be:
Latency in accessing the data file and parsing the data
(time to read the file is relatively small, and the write should be buffered).
As long as there is enough cache, the file will be read from memory rather than disk, so latency will be negligible.
C.