I'm working on a web application that uses a bunch of Amazon Web Services. I'd like to use DynamoDB for a particular part of the application but I'm not sure if it's an appropriate use-case.
When a registered user on the site performs a "job", an entry is recorded and stored for that job. The job has a bunch of details associated with it, but the most relevant thing is that each job has a unique identifier and an associated username. Usernames are unique too, but there can of course be multiple job entries for the same user, each with different job identifiers.
The only query that I need to perform on this data is: give me all the job entries (and their associated details) for username X.
I started to create a DynamoDB table but I'm not sure if it's right. My understanding is that the chosen hash key should be the key that's used for querying/indexing into the table, but it should be unique per item/row. Username is what I want to query by, but username will not be unique per item/row.
If I make the job identifier the primary hash key and the username a secondary index, will that work? Can I have duplicate values for a secondary index? But that means I will never use the primary hash key for querying/indexing into the table, which is the whole point of it, isn't it?
Is there something I'm missing, or is this just not a good fit for NoSQL.
Edit:
The accepted answer helped me find out what I was looking for as well as this question.
It seems that username as the hash key and a unique job_id as the range, as others have already suggested would serve you well in dynamodb. Using a query you can quickly search for all records for a username.
Another option is to take advantage of local secondary indexes and sparse indexes. It seems that there is a status column but based upon what I've read you could add another column, perhaps 'not_processed': 'x', and make your local secondary index on username+not_processed. Only records which have this field are indexed and once a job is complete you delete this field. This means you can effectively table scan using an index for username where not_processed=x. Also your index will be small.
All my relational db experience seems to be getting in the way for my understanding dynamodb. Good luck!
I'm not totally clear on what you're asking, but I'll give it a shot...
With DynamoDB, the combination of your hash key and range key must uniquely identify an item. Range key is optional; without it, hash key alone must uniquely identify an item.
You can also store a list of values (rather than just a single value) as an item's attributes. If, for example, each item represented a user, an attribute on that item could be a list of that user's job entries.
If you're concerned about hitting the size limitation of DynamoDB records, you can use S3 as backing storage for that list - essentially use the DDB item to store a reference to the S3 resource containing the complete list for a given user. This gives you flexibility to query for or store other attributes rather easily. Alternatively (as you suggested in your answer), you could put the entire user's record in S3, but you'd lose some of the flexibility and throughput of doing your querying/updating through DDB.
I reckon I didn't really play with the DynamoDB console for long enough to get a good understanding before posting this question. I only just understood now that a DynamoDB table (and presumably any other NoSQL table) is really just a giant dictionary/hash data structure. So to answer my question, yes I can use DynamoDB, and each item/row would look something like this:
But I'm not sure it's even worth using DynamoDB after all that. It might be simpler to just store a JSON file containing that content structure above in an S3 bucket, where the filename is the username.json
Edit:
For what it's worth, I just realized that DynamoDB has a 400KB size limit on items. That's a huge amount of data, relatively speaking for my use-case, but I can't take the chance so I'll have to go with S3.
Perhaps a "Jobs" table would work better for you than a "User" table. Here's what I mean.
If you're worried about all of those jobs inside a user document adding up to more than the 400kb limit, why not store the jobs individually in a table like:
Username is the hash and JobId is the range. You can query on the Username to get all the user's jobs.
Now that the size of each document is more limited, you could think about putting all the data for each job in the dynamo db record instead of using the FileRef and looking it up in S3. This would probably save a significant amount of latency.
Each record might then look like: