Call R from JAVA to get Chi-squared statistic and

2020-02-12 20:52发布

问题:

I have two 4*4 matrices in JAVA, where one matrix holds observed counts and the other expected counts.

I need an automated way to calculate the p-value from the chi-square statistic between these two matrices; however, JAVA has no such function as far as I am aware.

I can calculate the chi-square and its p-value by reading the two matrices into R as .csv file formats, and then using the chisq.test function as follows:

obs<-read.csv("obs.csv")
exp<-read.csv("exp.csv")
chisq.test(obs,exp)

where the format of the .csv files would as follows:

A, C, G, T
A, 197.136, 124.32, 63.492, 59.052
C, 124.32, 78.4, 40.04, 37.24
G, 63.492, 40.04, 20.449, 19.019
T, 59.052, 37.24, 19.019, 17.689

Given these commands, R will give an output of the format:

X-squared = 20.6236, df = 9, p-value = 0.01443

which includes the p-value I was looking for.

Does anyone know of an efficient way to automate the process of:

1) Outputting my matrices from JAVA into .csv files 2) Uploading the .csv files into R 3) Calling the chisq.test on the .csv files into R 4) Returning the outputted p-value back into JAVA?

Thanks for any help....

回答1:

There are (at least) two ways of going about this.


Command Line & Scripts

You can execute Rscripts from the command line with Rscript.exe. E.g. in your script you would have:

# Parse arguments.
# ...
# ...

chisq.test(obs, exp)

Rather than creating CSVs in Java and having R read them, you should be able to pass them straight to R. I don't see the need to create CSVs and pass data that way, UNLESS your matrices are quite big. There are limitations on the size of command line arguments you can pass (varies across operating system I think).

You can pass arguments into Rscripts and parse them using the commandArgs() functions or with various packages (e.g. optparse or getopt). See this thread for more information.

There are several ways of calling and reading from the command line in Java. I don't know enough about it to give you advice but a bit of googling will give you a result. Calling a script from the command line is done like this:

Rscript my_script.R

JRI

JRI lets you talk to R straight from Java. Here's an example of how you would pass a double array to R and have R sum it (this is Java now):

// Start R session.
Rengine re = new Rengine (new String [] {"--vanilla"}, false, null);

// Check if the session is working.
if (!re.waitForR()) {
    return;
}

re.assign("x", new double[] {1.5, 2.5, 3.5});
REXP result = re.eval("(sum(x))");
System.out.println(result.asDouble());
re.end();

The function assign() here is the same as doing this in R:

x <- c(1.5, 2.5, 3.5)

You should be able to work out how to extend this to work with a matrix.


I think JRI is quite difficult at the beginning. So if you want to get this done quickly the command line option is probably best. I would say the JRI approach is less messy once you get it set up though. And if you have situations where you have a lot of back and forth between R and Java it is definitely better than calling multiple scripts.

  1. Link to JRI.
  2. Recommended Eclipse plugin to set up JRI.


回答2:

Check this page JRI

Description form their site:

JRI is a Java/R Interface, which allows to run R inside Java applications as a single thread. Basically it loads R dynamic library into Java and provides a Java API to R functionality. It supports both simple calls to R functions and a full running REPL.



回答3:

RCaller 2.2 can do what you want to do. Suppose the frequency matrix is given as in your question. The resulted p.value and df variables can be calculated and returned using the code below:

double[][] data = new double[][]{
        {197.136, 124.32, 63.492, 59.052},
        {124.32, 78.4, 40.04, 37.24},
        {63.492, 40.04, 20.449, 19.019},
        {59.052, 37.24, 19.019, 17.689}
        };
    RCaller caller = new RCaller();
    Globals.detect_current_rscript();
    caller.setRscriptExecutable(Globals.Rscript_current);
    RCode code = new RCode();

    code.addDoubleMatrix("mydata", data);
    code.addRCode("result <- chisq.test(mydata)");
    code.addRCode("mylist <- list(pval = result$p.value, df=result$parameter)");

    caller.setRCode(code);
    caller.runAndReturnResult("mylist");

    double pvalue = caller.getParser().getAsDoubleArray("pval")[0];
    double df = caller.getParser().getAsDoubleArray("df")[0];
    System.out.println("Pvalue is : "+pvalue);
    System.out.println("Df is : "+df);

The output is:

Pvalue is : 1.0
Df is : 9.0

You can get the technical details in here



回答4:

Rserve is another way to get your data from Java to R and back. It is a server which takes R scripts as string inputs. You can use some string parsing and conversion in Java to convert the matrices into strings that can be input into R.

import org.rosuda.REngine.REXP;
import org.rosuda.REngine.Rserve.RConnection;


public class RtestScript {

private String emailTestScript = "open <- c('O', 'O', 'N', 'N', 'O', 'O', 'N', 'N', 'N', 'O', " +
        " 'O', 'N', 'N', 'O', 'O', 'N', 'N', 'N', 'O');" +
        "testgroup <- c('A', 'A', 'A','A','A','A','A','A','A','A', 'B'," +
        "'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B');" +
        "emailTest <- data.frame(open, testgroup);" +
        "emailTable<- table(emailTest$open, emailTest$testgroup);" +
        "emailResults<- prop.test(emailTable, correct=FALSE);" +
        "print(emailResults$p.value);";

public void executeRscript() {
    try {
        //Make sure to type in library(Rserve); Rserve() in Rstudio before running this
        RConnection testConnection = new RConnection();

        REXP testExpression = testConnection.eval(emailTestScript);
        System.out.println("P value: " + testExpression.asString());
    } catch(Exception e) {
        e.printStackTrace();
    }
}
}

Here is some more information on Rserve. Incidentally, this is also how Tableau can communicate with R as well with their R connection.

https://cran.r-project.org/web/packages/Rserve/index.html



回答5:

1) Outputting my matrices from JAVA into .csv files

Use any of CSV libraies, I would recommend http://opencsv.sourceforge.net/

2) Uploading the .csv files into R 3) Calling the chisq.test on the .csv files into R

2 & 3 a pretty the same, You better create parametrized script to be run in R.

obs<-read.csv(args[1])
exp<-read.csv(args[2])
chisq.test(obs,exp)

So you can run

RScript your_script.r path_to_csv1 path_to_csv2, 

and use unique names for the csv files for example:

UUID.randomUUID().toString().replace("-","")

And then you use

Runtime.getRuntime().exec(command, environments, dataDir);

4) Returning the outputted p-value back into JAVA? You can only read the output of R if you are using getRuntime().exec() to invoke R.

I would also recommend to take a look at Apache's Statistics Lib & How to calculate PValue from ChiSquare. Maybe you can live without R at all :)



回答6:

I recommend to simply use a Java library that does a ChiSquare test for you. There are enough of them:

  • Apache commons math: http://commons.apache.org/proper/commons-math/
  • JSC: http://www.jsc.nildram.co.uk/
  • JDistlib: http://jdistlib.sourceforge.net/

This is not a complete list, but what I found in 5 minutes searching.