Assuming I have a list of shifts for an event (in the format start date/time, end date/time) - is there some sort of algorithm I could use to create a generalized summary of the schedule? It is quite common for most of the shifts to fall into some sort of common recurrence pattern (ie. Mondays from 9:00 am to 1:00 pm, Tuesdays from 10:00 am to 3:00 pm, etc). However, there can (and will be) exceptions to this rule (eg. one of the shifts fell on a holiday and was rescheduled for the next day). It would be fine to exclude those from my "summary", as I'm looking to provide a more general answer of when does this event usually occur.
I guess I'm looking for some sort of statistical method to determine the day and time occurences and create a description based on the most frequent occurences found in the list. Is there some sort of general algorithm for something like this? Has anyone created something similar?
Ideally I'm looking for a solution in C# or VB.NET, but don't mind porting from any other language.
Thanks in advance!
You may use Cluster Analysis.
Clustering is a way to segregate a set of data into similar components (subsets). The "similarity" concept involves some definition of "distance" between points. Many usual formulas for the distance exists, among others the usual Euclidean distance.
Practical Case
Before pointing you to the quirks of the trade, let's show a practical case for your problem, so you may get involved in the algorithms and packages, or discard them upfront.
For easiness, I modelled the problem in Mathematica, because Cluster Analysis is included in the software and very straightforward to set up.
First, generate the data. The format is { DAY, START TIME, END TIME }.
The start and end times have a random variable added (+half hour, zero, -half hour} to show the capability of the algorithm to cope with "noise".
There are three days, three shifts per day and one extra (the last one) "anomalous" shift, which starts at 7 AM and ends at 9 AM (poor guys!).
There are 150 events in each "normal" shift and only two in the exceptional one.
As you can see, some shifts are not very far apart from each other.
I include the code in Mathematica, in case you have access to the software. I'm trying to avoid using the functional syntax, to make the code easier to read for "foreigners".
Here is the data generation code:
Now we join the data, obtaining one big dataset:
Let's run a cluster analysis for the data:
"Agglomerate" and "Linkage" -> "Complete" are two fine tuning options of the clustering methods implemented in Mathematica. They just specify we are trying to find very compact clusters.
I specified to try to detect 7 clusters. If the right number of shifts is unknown, you can try several reasonable values and see the results, or let the algorithm select the more proper value.
We can get a chart with the results, each cluster in a different color (don't mind the code)
And the result is:
alt text http://i28.tinypic.com/2hmdlab.png
Where you can see our seven clusters clearly apart.
That solves part of your problem: identifying the data. Now you also want to be able to label it.
So, we'll get each cluster and take means (rounded):
The result is:
And with that you get again your seven classes.
Now, perhaps you want to classify the shifts, no matter the day. If the same people make the same task at the same time everyday, so it's no useful to call it "Monday shift from 10 to 15", because it happens also on Weds and Fridays (as in our example).
Let's analyze the data disregarding the first column:
In this case, we are not selecting the number of clusters to retrieve, leaving the decision to the package.
The result is
image http://i27.tinypic.com/mise9.png
You can see that five clusters have been identified.
Let's try to "label" them as before:
The result is:
Which is exactly what we "suspected": there are repeated events each day at the same time that could be grouped together.
Edit: Overnight Shifts and Normalization
If you have (or plan to have) shifts that start one day and end on the following, it's better to model
than
That's because as with any statistical method, the correlation between the variables must be made explicit, or the method fails miserably. The principle could run something like "keep your candidate data normalized". Both concepts are almost the same (the attributes should be independent).
--- Edit end ---
By now I guess you understand pretty well what kind of things you can do with this kind if analysis.
Some references
HTH!
I don't think any ready made algorithm exists, so unfortunately you need to come up with something yourself. Because the problem is not really well defined (from mathematical perspective) it will require testing on some "real" data that would be reasonably representative, and a fair bit of tweaking.
I would start from dividing your shifts into weekdays (because if I understand correctly you are after a weekly view) - so for each weekday we have shifts that happen to be on that day. Then for each day I would group the shifts that happen at the same time (or "roughly" at the same time - here you need to come up with some heuristic, i.e. both start and end times do not deviate from average in the group by more than 15min or 30 mins). Now we need another heuristic to decide if this group is relevant, i.e. if a shift 1pm-3pm on a Monday happened only once it is probably not relevant, but if it happened on at least 70% of Mondays covered by the data then it is relevant. And now your relevant groups for each day of the week will form the schedule you are after.
Could we see an example data set? If it is really "clean" data then you could simply find the mode of the start and end times.
One option would be to label all the start times as +1 and the end times as -1 then create a three column table of times (both start and ends), label (+1 or -1), and number of staff at that time (starts with zero and adds or subtracts staff using the label) and sort the whole thing in time order.
This time series now is a summary descriptor of your staff levels and the labels are also a series as well. Now you can apply time series statistics to both to look for daily, weekly or monthly patterns.