Same Instances header ( arff ) for all my database

2020-05-03 10:24发布

问题:

I am using InstanceQuery , SQL queries, to construct my Instances. But my query results does not come in the same order always as it is normal in SQL. Beacuse of this Instances constucted from different SQL has different headers. A simple example can be seen below. I suspect my results changes because of this behavior.

Header 1

@attribute duration numeric
@attribute protocol_type {tcp,udp}
@attribute service {http,domain_u}
@attribute flag {SF}

Header 2

@attribute duration numeric
@attribute protocol_type {tcp}
@attribute service {pm_dump,pop_2,pop_3}
@attribute flag {SF,S0,SH}

My question is : How can I give correct header information to Instance construction.

Is something like below workflow is possible?

  1. get pre-prepared header information from arff file or another place.
  2. give instance construction this header information
  3. call sql function and get Instances (header + data)

I am using following sql function to get instances from database.

public static Instances getInstanceDataFromDatabase(String pSql
                                      ,String pInstanceRelationName){
    try {
        DatabaseUtils utils = new DatabaseUtils();

        InstanceQuery query = new InstanceQuery();

        query.setUsername(username);
        query.setPassword(password);
        query.setQuery(pSql);

        Instances data = query.retrieveInstances();
        data.setRelationName(pInstanceRelationName);

        if (data.classIndex() == -1)
        {
              data.setClassIndex(data.numAttributes() - 1);
        }
        return data;
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

回答1:

I tried various approaches to my problem. But it seems that weka internal API does not allow solution to this problem right now. I modified weka.core.Instances append command line code for my purposes. This code is also given in this answer

According to this, here is my solution. I created a SampleWithKnownHeader.arff file , which contains correct header values. I read this file with following code.

public static Instances getSampleInstances() {
    Instances data = null;
    try {
        BufferedReader reader = new BufferedReader(new FileReader(
                "datas\\SampleWithKnownHeader.arff"));
        data = new Instances(reader);
        reader.close();
        // setting class attribute
        data.setClassIndex(data.numAttributes() - 1);
    }
    catch (Exception e) {
        throw new RuntimeException(e);
    } 
    return data;

}

After that , I use following code to create instances. I had to use StringBuilder and string values of instance, then I save corresponding string to file.

public static void main(String[] args) {

    Instances SampleInstance = MyUtilsForWeka.getSampleInstances();

    DataSource source1 = new DataSource(SampleInstance);

    Instances data2 = InstancesFromDatabase
            .getInstanceDataFromDatabase(DatabaseQueries.WEKALIST_QUESTION1);

    MyUtilsForWeka.saveInstancesToFile(data2, "fromDatabase.arff");

    DataSource source2 = new DataSource(data2);

    Instances structure1;
    Instances structure2;
    StringBuilder sb = new StringBuilder();
    try {
        structure1 = source1.getStructure();
        sb.append(structure1);
        structure2 = source2.getStructure();
        while (source2.hasMoreElements(structure2)) {
            String elementAsString = source2.nextElement(structure2)
                    .toString();
            sb.append(elementAsString);
            sb.append("\n");

        }

    } catch (Exception ex) {
        throw new RuntimeException(ex);
    }

    MyUtilsForWeka.saveInstancesToFile(sb.toString(), "combined.arff");

}

My save instances to file code is as below.

public static void saveInstancesToFile(String contents,String filename) {

     FileWriter fstream;
    try {
        fstream = new FileWriter(filename);
      BufferedWriter out = new BufferedWriter(fstream);
      out.write(contents);
      out.close();
    } catch (Exception ex) {
        throw new RuntimeException(ex);
    }

This solves my problem but I wonder if more elegant solution exists.



回答2:

I solved a similar problem with the Add filter that allows adding attributes to Instances. You need to add a correct Attibute with proper list of values to both datasets (in my case - to test dataset only):

Load train and test data:

/* "train" contains labels and data */
/* "test" contains data only */
CSVLoader csvLoader = new CSVLoader();
csvLoader.setFile(new File(trainFile));
Instances training = csvLoader.getDataSet();
csvLoader.reset();
csvLoader.setFile(new File(predictFile));
Instances test = csvLoader.getDataSet();

Set a new attribute with Add filter:

Add add = new Add();
/* the name of the attribute must be the same as in "train"*/
add.setAttributeName(training.attribute(0).name());
/* getValues returns a String with comma-separated values of the attribute */
add.setNominalLabels(getValues(training.attribute(0)));
/* put the new attribute to the 1st position, the same as in "train"*/
add.setAttributeIndex("1");
add.setInputFormat(test);
/* result - a compatible with "train" dataset */
test = Filter.useFilter(test, add);

As a result, the headers of both "train" and "test" are the same (compatible for Weka machine learning)



标签: weka