I'm working on a research project for school. I've written some text mining software that analyzes legal texts in a collection and spits out a score that indicates how similar they are. I ran the program to compare each text with every other text, and I have data like this (although with many more points):
codeofhammurabi.txt crete.txt 0.570737
codeofhammurabi.txt iraqi.txt 1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt 1.25546
crete.txt iraqi.txt 0.329545
crete.txt magnacarta.txt 0.589786
crete.txt us.txt 0.491903
iraqi.txt magnacarta.txt 0.834488
iraqi.txt us.txt 1.37718
magnacarta.txt us.txt 1.09582
Now I need to plot them on a graph. I can easily invert the scores so that a small value now indicates texts that are similar and a large value indicates texts that are dissimilar: the value can be the distance between points on a graph representing the texts.
codeofhammurabi.txt crete.txt 1.75212
codeofhammurabi.txt iraqi.txt 0.8812
codeofhammurabi.txt magnacarta.txt 1.0573
codeofhammurabi.txt us.txt 0.7965
crete.txt iraqi.txt 3.0344
crete.txt magnacarta.txt 1.6955
crete.txt us.txt 2.0329
iraqi.txt magnacarta.txt 1.1983
iraqi.txt us.txt 0.7261
magnacarta.txt us.txt 0.9125
SHORT VERSION: Those values directly above are distances between points on a scatter plot (1.75212 is the distance between the codeofhammurabi point and the crete point). I can imagine a big system of equations with circles representing the distances between points. What's the best way to make this graph? I have MATLAB, R, Excel, and access to pretty much any software I might need.
If you can even point me in a direction, I'll be infinitely grateful.
Here's a potential solution for Matlab:
You can arrange your data into a formal 5x5 similarity matrix S where element S(i,j) represents your similarity (or dissimilarity) between the document i and document j. Assuming your distance measure is an actual metric, you can apply multi-dimensional scaling to this matrix via mdscale(S,2).
This function will attempt to find a 5x2 dimensional representation of your data that preserves the similarity (or dissimilarity) between your classes found in the higher dimensions. You can then visualize this data as a scatterplot of 5 points.
You could also potentially try this using mdscale(S,3) to project into a 5x3 dimensional matrix which you can then visualize with plot3().
This Matlab snippet should work if you want to try a 3D bar view:
If the question is 'how I can do something like this guy did?' (from xiii1408's comment to the question), then the answer is use Gephi’s built-in Force Atlas 2 algorithm on Euclidean distances of document topic posterior probabilities.
"This guy" is Matt Jockers, who is an innovative scholar in the digital humanities. He has documented some of his methods on his blog and else where, etc. Jockers mostly works in
R
and shares some of his code. His basic work flow seems to be:Here's a small-scale reproducible example in
R
(with an export to Gephi) that might be close to what Jockers did:Get data...
Clean and reshape...
Part of speech tagging and sub-setting of nouns...
Topic modelling with latent Dirichlet allocation...
Calculate Euclidean distances of one document from another using topics probabilities as the document's 'DNA'
Visualize using a force-directed graph...
And if you want to use the Force Atlas 2 algorithm in Gephi you simply export the
R
graph object to agraphml
file and then open it in Gephi and set the layout to Force Atlas 2:Here's the Gephi plot with the Force Atlas 2 algorithm:
Your data are really distances (of some form) in the multivariate space spanned by the corpus of words contained in the documents. Dissimilarity data such as these are often ordinated to provide the best k-d mapping of the dissimilarities. Principal coordinates analysis and non-metric multidimensional scaling are two such methods. I would suggest you plot the results of applying one or the other of these methods to your data. I provide examples of both below.
First, load in the data you supplied (without labels at this stage)
What you effectively have is the following distance matrix:
R, in general, needs a dissimilarity object of class
"dist"
. We could useas.dist(mat)
now to get such an object, or we could skip creatingmat
and go straight to the"dist"
object like this:Now we have an object of the right type we can ordinate it. R has many packages and functions for doing this (see the Multivariate or Environmetrics Task Views on CRAN), but I'll use the vegan package as I am somewhat familiar with it...
Principal coordinates
First I illustrate how to do principal coordinates analysis on your data using vegan.
The first PCO axis is by far the most important at explaining the between text differences, as exhibited by the Eigenvalues. An ordination plot can now be produced by plotting the Eigenvectors of the PCO, using the
plot
methodwhich produces
Non-metric multidimensional scaling
A non-metric multidimensional scaling (nMDS) does not attempt to find a low dimensional representation of the original distances in an Euclidean space. Instead it tries to find a mapping in k dimensions that best preserves the rank ordering of the distances between observations. There is no closed-form solution to this problem (unlike the PCO applied above) and an iterative algorithm is required to provide a solution. Random starts are advised to assure yourself that the algorithm hasn't converged to a sub-optimal, locally optimal solution. Vegan's
metaMDS
function incorporates these features and more besides. If you want plain old nMDS, then seeisoMDS
in package MASS.With this small data set we can essentially represent the rank ordering of the dissimilarities perfectly (hence the warning, not shown). A plot can be achieved using the
plot
methodwhich produces
In both cases the distance on the plot between samples is the best 2-d approximation of their dissimilarity. In the case of the PCO plot, it is a 2-d approximation of the real dissimilarity (3 dimensions are needed to represent all of the dissimilarities fully), whereas in the nMDS plot, the distance between samples on the plot reflects the rank dissimilarity not the actual dissimilarity between observations. But essentially distances on the plot represent the computed dissimilarities. Texts that are close together are most similar, texts located far apart on the plot are the most dissimilar to one another.
You could do a network graph using igraph. The Fruchterman-Reingold layout has a parameter to provide edge weights. Weights bigger than 1 result in more "attraction" along the edges, weights less than 1 do the opposite. In your example, crete.txt has the lowest distance and sits in the middle and has smaller edges to other vertices. In fact, it is closer to iraqi.txt. Note that you have to inverse the data for E(g)$weight to get the correct distances.
Are you making all pairwise comparisons? Depends on how you calculate the distance(similarity), I am not sure if it is possible to make such a scatter plot. so when you have only 3 text file to consider, your scatter plot is easy to make (triangle with sides equal the distances). but when you add the fourth point, you might not be able to place it in a location where its distances to the existing 3 points satisfy all constraints.
But if you can do that, than you have a solution, just add new points on and on....I think... Or, if you don't need the distances on the scatter plot to be precise, you can simply make a web and label the distance.