可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm currently working on a project where I use natural language processing to extract emotions from text to correlate them with contextual information.

Definition of contextual information: Every information that is relevant to describe an entity's situation in time an space.

Description of the data structure I'm looking for:

There is a arbitrary number of entities (an entity can either be a person or a group for example (twitter hash tags)) of which I want to track contextual information and their conversations with other entities. Conversations between entities are processed in order to classify their emotional features. Basic emotional features consist of a vector that specifies their occurrence percentually: {fear: 0.1, happiness: 0.4, joy: 0.1, surprise: 0.9, anger: 0} Entities can also submit any contextual information they'd like to share, for example: location, room-temperature, blood pressure, ... and so on (will refer to this as contextual variables). Because neither the number of conversations of an entity, nor the number of contextual variables they want to share is clear at any point in time, the data structure needs to be able to adjust accordingly.

Important: Every change in the data must also represent an own state as I'm looking forward to correlate certain changes in state with each other.

Example: Bob and Alice have a conversation that shows high magnitude of fear. A couple of hours later they have another conversation that shows no more fear, but happiness. Now, one could argue that high magnitude fear, followed by happiness actually could be interpreted as the emotion relief.

However, in order to be able to extract this very information I need to be able to correlate different states with each other. Same goes for using contextual information to correlate them with the tracked emotions in conversations. This is why every state change must be recorded and available.

To make this more clear to you, I've created a graphic and attached it to the question.

Now, the actual question I have is: Which database/data structure can I use to solve this problem? I've looked into event-sourcing databases but wasn't quite convinced if I can easily recreate a graph structure with them. I also looked at graph databases but didn't find what I was looking for.

Therefore it would be nice if someone here could at least point me in the right direction or help me adjust my structure accordingly to solve the problem. If however there are data structures supporting, what I call it graph databases with snapshots then ease of usage is probably the most important feature to filter for.

回答1:

There's a database called Datomic by Rich Hickey (of Clojure fame) that stores facts over time. Every entry in the database is a fact with a timestamp, append-only as in Event Sourcing.

These facts can be queried with a relational/logical language ala Datalog (remiscent of Prolog). Please see This post by kisai for a quick overview. It has been used for querying graphs with some success in the past: Using Datomic as a Graph Database.

While I have no experience with Datomic, it does seem to be quite suitable for your particular problem.

回答2:

You have an interesting project, I do not work on things like this directly but for my 2 cents -

It seems to me your picture is a bit flawed. You are trying to represent a graph database overtime but there isn't really a way to represent time this way. If we examine the image, you have conversations and context data changing over time, but the fact of "Bob" and "Alice" and "Malory" actually doesn't change over time. So lets remove them from the equation.

Instead focus on the things you can model over time, a conversation, a context, a location. These things will change as new data comes in. These objects are an excellent candidate for an event sourced model. In your app, the conversation would be modeled as a series of individual events which your aggregate would use and combine and factor to generate a final state which would be your 'relief' determination.

For example you could write logic where if a conversation was angry then a very happy event came in then the subject is now feeling relief.

What I would do is model these conversation states in your graph db connected to your 'Fact' objects "Bob", "Alice", etc. And a query such as 'What is alice feeling right now?' would be a graph traversal through your conversation states factoring in the context data connected to alice.

To answer a question such as 'What was alice feeling 5 minutes ago?' you would take all the event streams for the conversations and rewind them to the appropriate point then examine the state of the conversations.

TLDR: Separate the time dependent variables from the time independent variables and use event sourcing to model time.

回答3:

There is an obvious 1:1 correspondence between your states at a given time and a relational database with a given schema. So there is an obvious 1:1 correspondence between your set of states over time and a changing-schema database, ie a variable whose value is a database plus metadata, manipulated by both DDL and DML update commands. So there is no evidence that you shouldn't just use a relational DBMS.

Relational DBMSs allow generic querying with automated implementation at a certain computational complexity with certain opportunities for optimization. Any application can have specialized queries that make a specialized data structure and operators a better choice. But you must design your application and know about such special aspects to justify this. As it is, with the obvious correspondences between your states and relational states, this has not been justified.

EAV is frequently used instead of DDL and a changing schema. But under EAV the DBMS does not know the real tables you are concerned with, which have columns that are EAV attributes, and which are explicit in the DDL/DML changing schema approach. So EAV foregoes simplicity, clarity, optimization and most of all integrity and ACID. It can only be justified (compared to DDL/DML, assuming a relational representation is otherwise appropriate) by demonstrating that DDL with schema updates (adding, deleting and changing columns and tables) is worse (per the above) than EAV in your particular application.

Just because you can draw a picture of your application state at some time using a graph does not mean that you need a graph database. What matters is what specialized queries/expressions you will be evaluating. You should understand what these are in terms of your problem domain, which is probably most easily expressible per some specialized data structure and operators and relationally. Then you can compare the expressive and computational demands to a specialized data structure, a relational representation, and the models of particular graph databases. Be sure to google stackoverflow.

According to Wikipedia "Neo4j is the most popular graph database in use today".