I am trying to prepare the best tools for efficient Data Analysis in Mathematica. I have a approximately 300 Columns & 100 000 Rows.
What would be the best tricks to :
"Remove", "Extract" or simply "Consider" parts of the data structure, for plotting for e.g.
One of the trickiest examples I could think of is :
Given a data structure,
Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y
I also welcome any general advice on data manipulation.
To pull out columns (or rows) you can do it by part indexing
The last line is just to view it pretty.
As Sjoerd mentioned in his comment (and in the explanation in his answer), indexing a single range can be easily done with the
Span
(;;
) command. If you are joining multiple disjoint ranges, usingFlatten
to combine the separate ranges created withRange
is easier than entering them by hand.I read:
as meaning that we want:
AND
[[2]] == x && [[8]] != y
.This is what I hacked together:
Some useful commands to get pieces of matrices and list are
Span
(;;),Drop
,Take
,Select
,Cases
and more. See tutorial/GettingAndSettingPiecesOfMatrices and guide/PartsOfMatrices,Part
([[...]]) in combination with;;
can be quite powerful. a[[All, 1;;-1;;2]], for instance, means take all rows and all odd columns (-1 having the usual meaning of counting from the end).Select
can be used to pick elements from a list (and remember a matrix is a list of lists), based on a logical function. It's twin brother isCases
which does selection based on a pattern. The function I used here is a 'pure' function, where # refers to the argument on which this function is applied (the elements of the list in this case). Since the elements are lists themselves (the rows of the matrix) I can refer to the columns by using thePart
([[..]]) function.For a generic manipulation of data in a table with named columns, I refer you to this solution of mine, for a similar question. For any particular case, it might be easier to write a function for
Select
manually. However, for many columns, and many different queries, chances to mess up indexes are high. Here is the modified solution from the mentioned post, which provides a more friendly syntax:What happens here is that the function used in
Select
gets generated automatically from your specifications. For example (using @Yoda's example):We need to define the column names (must be strings or symbols without values):
(in practice, usually names are more descriptive, of course). Here is the table then:
Here is the select statement you need (I picked
x = 4
andy=2
):Now, for a single query, this may look like a complicated way to do this. But you can do many different queries, such as
and similar.
Of course, if there are specific correlations in your data, you might find a particular special-purpose algorithm which will be faster. The function above can be extended in many ways, to simplify common queries (include "all", etc), or to auto-compile the generated pure function (if possible).
EDIT
On a philosophical note, I am sure that many Mathematica users (myself included) found themselves from time to time writing similar code again and again. The fact that Mathematica has a concise syntax makes it often very easy to write for any particular case. However, as long as one works in some specific domain (like, for example, data manipulations in a table), the cost of repeating yourself will be high for many operations. What my example illustrates in a very simple setting is a one possible way out - create a Domain-Specific Language (DSL). For that, one generally needs to define a syntax/grammar for it, and write a compiler from it to Mathematica (to generate Mathematica code automatically). Now, the example above is a very primitive realization of this idea, but my point is that Mathematica is generally very well suited for DSL creation, which I think is a very powerful technique.