NOTE: I am using a graph database (OrientDB to be specific). This gives me the freedom to write a server-side function in javascript or groovy rather than limit myself to SQL for this issue.*
NOTE 2: Since this is a graph database, the arrows below are simply describing the flow of data. I do not literally need the arrows to be returned in the query. The arrows represent relationships.*
I have data that is represented in a time-flow manner; i.e. EventC occurs after EventB which occurs after EventA, etc. This data is coming from multiple sources, so it is not completely linear. It needs to be congregated together, which is where I'm having the issue.
Currently the data looks something like this:
# | event | next
--------------------------
12:0 | EventA | 12:1
12:1 | EventB | 12:2
12:2 | EventC |
12:3 | EventA | 12:4
12:4 | EventD |
Where "next" is the out() edge to the event that comes next in the time-flow. On a graph this comes out to look like:
EventA-->EventB-->EventC
EventA-->EventD
Since this data needs to be congregated together, I need to merge duplicate events but preserve their edges. In other words, I need a select query that will result in:
-->EventB-->EventC
EventA--|
-->EventD
In this example, since EventB and EventD both occurred after EventA (just at different times), the select query will show two branches off EventA as opposed to two separate time-flows.
EDIT #2
If an additional set of data were to be added to the data above, with EventB->EventE, the resulting data/graph would look like:
# | event | next
--------------------------
12:0 | EventA | 12:1
12:1 | EventB | 12:2
12:2 | EventC |
12:3 | EventA | 12:4
12:4 | EventD |
12:5 | EventB | 12:6
12:6 | EventE |
EventA-->EventB-->EventC
EventA-->EventD
EventB-->EventE
I need a query to produce a tree like:
-->EventC
-->EventB--|
| -->EventE
EventA--|
-->EventD
EDIT #3 and #4
Here is the data with edges shown as opposed to the "next" column above. I also added a couple additional columns here to hopefully clear up any confusion about the data:
# | event | ip_address | timestamp | in | out |
----------------------------------------------------------------------------
12:0 | EventA | 123.156.189.18 | 2015-04-17 12:48:01 | | 13:0 |
12:1 | EventB | 123.156.189.18 | 2015-04-17 12:48:32 | 13:0 | 13:1 |
12:2 | EventC | 123.156.189.18 | 2015-04-17 12:48:49 | 13:1 | |
12:3 | EventA | 103.145.187.22 | 2015-04-17 14:03:08 | | 13:2 |
12:4 | EventD | 103.145.187.22 | 2015-04-17 14:05:23 | 13:2 | |
12:5 | EventB | 96.109.199.184 | 2015-04-17 21:53:00 | | 13:3 |
12:6 | EventE | 96.109.199.184 | 2015-04-17 21:53:07 | 13:3 | |
The data is saved like this to preserve each individual event and the flow of a session (labeled by the ip address).
TL;DR
Got lots of events, some duplicates, and need them all organized into one neat time-flow graph.
Holy cow.
After wrestling with this for over a week I think I FINALLY have a working function. This isn't optimized for performance (oh the loops!), but gets the job done for the time being while I can work on performance. The resulting OrientDB server-side function (written in javascript):
The function:
Takeaways: