Converting Time & Date to relative time (CSV proce

2019-08-15 10:03发布

I am currently in the early stages of writing a multi-faceted investment algorithm. The part I am currently working on is concerned with using a Graphical Gaussian Model with a LASSO penalty to find inter-dependencies which can be used to inform investment strategy. I am currently trying to use JAVA to pre-process historical CSV data input and create a new CSV output file with the relevant data.

The raw, small-scale example data I am using to test the processing algorithm (which will eventually be used on a Reuters Eikon live feed) is in txt/CSV format. I have a folder containing text files with historical data on many stocks on the NYSE. Although there are 8 columns, the three I am interested in (for the purposes of pre-processing before creating a covariance matrix which will feed into 'GLASSO') are the Date, Time & Opening prices. The opening prices column requires no pre-processing, so that can be fed into a new, less noisy output file.

My issue is how to convert the two columns (date and time) into a single time measurement. I was thinking the most obvious way to do this would be to find the earliest point in time in my data and use this as point 0 (in seconds). I would then need to convert every time and date combination into a single column showing how many seconds it is past the original time point in the output CSV file. Once this was done rather than a file path specification I would like to be able to specify a folder and the program loop through all text files finding the relevant columns and output all into a single CSV file.

What this would hopefully look like in practice:

CSV title and first entry in one NYSE txt file -

"Date,Time,Open,High,Low,Close,Volume,OpenInt

2016-02-03,15:35:00,37.27,37.36,37.17,37.29,25274,0"

So essentially if the first entry is the earliest time reference:

2016-02-03,15:35:00 = '0'

2016-02-03,15:40:00 = '300' (5 minutes is 300 seconds)

Just to re-iterate, input is a folder containing hundreds of the following formatted CSVs:

Columns - 1: Date 2: Time 3: Open 4: High 5: Low 6: Close 7: Volume 8: OpenInt

Output is a single CSV file containing:

Columns - 1: Time measure (distance in seconds from earliest entry point) 2: Stock price for each time measure entry.

Please let me know if you have any clues about how I could go about doing this, don't hesitate to let me know if there is anything I can clarify to make your lives easier, I realise I could have maybe explained this in a less convoluted manner.

2条回答
ゆ 、 Hurt°
2楼-- · 2019-08-15 10:38

Here's an example using the example CSV line you gave. I altered the input to change the seconds so you can see how the difference works:

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;

public class Dater {

    String s1[] = {"2016-02-03,15:35:01,37.27,37.36,37.17,37.29,25274,0",  //1 sec after minDate
                    "2016-02-03,15:35:00,37.27,37.36,37.17,37.29,25274,0", //<-- minDate
                    "2016-02-03,15:35:02,37.27,37.36,37.17,37.29,25274,0"  //2 sec after minDate
                    };
    Date [] dates;
    Date minDate;

    public Dater()
    {
        minDate = new Date();
        makeDates();

        for (Date d : dates)
        {
            System.out.println(diffSeconds(d));
        }
    }
    public void makeDates()
    {
        dates = new Date[s1.length];
        int index = 0;
        SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        for (String s : s1)
        {
            String [] split = s.split(",");
            String date = split[0];
            String time = split[1];

            try {
                dates[index] = formatter.parse(date + " " + time); //make Date objects
                if (dates[index].compareTo(minDate) < 0)           //establish origin
                {
                    minDate = dates[index];
                }
            } catch (ParseException e)
            {
                e.printStackTrace();
            }
            index++;
        }
    }

    public Long diffSeconds(Date d)
    {
        return (d.getTime() - minDate.getTime()) / 1000;
    }

    public static void main(String...args)
    {
        new Dater();
    }
}

Output:

1
0
2
查看更多
Bombasti
3楼-- · 2019-08-15 10:54

java.time

The Answer by Saviour Self looks correct. But it uses the old date-time classes that have been supplanted by the java.time framework built into Java 8 and later.

Apache Commons CSV

As a bonus, I show how to use the Apache Commons CSV library to handle the chore of reading/writing CSV files.

First we simulate a CSV file by making a StringReader.

RFC 4180 Specification

The RFC 4180 spec defines the CSV format formally. Variations on this also exist.

RFC 4180 requires Carriage Return + Line Feed (CRLF) as the newline (line terminator). The last line’s terminator is optional which we include here.

We omit the optional header line (column titles).

String newline = "\r\n";
StringBuilder input = new StringBuilder ();
input.append ( "2016-02-03,15:10:00,37" ).append ( newline );
input.append ( "2016-02-03,15:15:00,38" ).append ( newline );  // 5 minutes later.
input.append ( "2016-02-03,15:17:00,39" ).append ( newline );  // 2 minutes later.

Reader in = new StringReader ( input.toString () );

Next we read in the entire CSV file into memory, where the Commons CSV library creates CSVRecord objects to represent each row of incoming data. One line of code does all that work, with CSVFormat::parse producing a CSVParser object (an implementation of Interable).

Iterable<CSVRecord> records;
try {
    records = CSVFormat.DEFAULT.parse ( in );  // 'records' is a CSVParser.
} catch ( IOException ex ) {
    // FIXME: Handle exception.
    System.out.println ( "[ERROR] " + ex );
    return; // Bail-out.
}

Now we analyze that collection of CSVRecord objects. Remember the first one as our baseline, stored here as an Instant (discussed below). Then loop to compare each successive CSVRecord object, examining each field as a String.

Instant firstInstant = null; // Track the baseline against which we calculate the increasing time
for ( CSVRecord record : records ) {
    String dateInput = record.get ( 0 );  // Zero-based index.
    String timeInput = record.get ( 1 );
    String priceInput = record.get ( 2 );
    //System.out.println ( dateInput + " | " + timeInput + " | " + priceInput );  // Dump input strings for debugging.

Extract the strings for date-only and time-only, combine into a LocalDateTime.

    // Parse strings.
    LocalDate date = LocalDate.parse ( dateInput );
    LocalTime time = LocalTime.parse ( timeInput );
    Integer price = Integer.parseInt ( priceInput );
    // Combine date and time.
    LocalDateTime ldt = LocalDateTime.of ( date , time );  // Not a specific moment on the timeline.

This date-time object is not a point on the timeline as we do not know its offset-from-UTC or time zone. If you were to use these values to calculate a delta between LocalDateTime objects, you would be assuming generic 24-hour days free of anomalies such as Daylight Saving Time (DST). You might get away with this if your data happens to not occur during any anomaly, but it is a bad habit. Better to assign a time zone if known.

We know the source of the data, so we can assume the intended time zone, a ZoneId. By assigning that assumed time zone, we get a real moment on the timeline.

    // Generally best to assign the time zone known to apply to this incoming data.
    ZoneId zoneId = ZoneId.of ( "America/New_York" );  // Move this line somewhere else to eliminate needless repetition.
    ZonedDateTime zdt = ldt.atZone ( zoneId );  // Now this becomes a specific moment on the timeline.

From that ZonedDateTime we can extract the same moment in UTC (an Instant). Generally the Instant is what you should be using for data storage, data exchange, serialization, and so on. You only need the ZonedDateTime for presentation to the user in their expected time zone.

    Instant instant = zdt.toInstant ();  // Use Instant (moment on the timeline in UTC) for data storage, exchange, serialization, database, etc.
    if ( null == firstInstant ) {
        firstInstant = instant;  // Capture the first instant.
    }

The goal is to compare each CSVRecord to the original baseline date-time. The Duration.between method does just that.

    Duration duration = Duration.between ( firstInstant , instant );

We calculate the delta in total seconds.

    Long deltaInSeconds = duration.getSeconds ();

Writing these results to an output CSV file is left as an exercise for you the reader. The Apache Commons CSV library makes short work of that is it writes as well as reads CSV formats.

    // … output the deltaInSeconds & price to CSV. Apache Commons CSV can write as well as read CSV files.
    System.out.println ( "deltaInSeconds: " + deltaInSeconds + " | price: " + price );

}

When run.

deltaInSeconds: 0 | price: 37
deltaInSeconds: 300 | price: 38
deltaInSeconds: 420 | price: 39
查看更多
登录 后发表回答