I am currently in the early stages of writing a multi-faceted investment algorithm. The part I am currently working on is concerned with using a Graphical Gaussian Model with a LASSO penalty to find inter-dependencies which can be used to inform investment strategy. I am currently trying to use JAVA to pre-process historical CSV data input and create a new CSV output file with the relevant data.
The raw, small-scale example data I am using to test the processing algorithm (which will eventually be used on a Reuters Eikon live feed) is in txt/CSV format. I have a folder containing text files with historical data on many stocks on the NYSE. Although there are 8 columns, the three I am interested in (for the purposes of pre-processing before creating a covariance matrix which will feed into 'GLASSO') are the Date, Time & Opening prices. The opening prices column requires no pre-processing, so that can be fed into a new, less noisy output file.
My issue is how to convert the two columns (date and time) into a single time measurement. I was thinking the most obvious way to do this would be to find the earliest point in time in my data and use this as point 0 (in seconds). I would then need to convert every time and date combination into a single column showing how many seconds it is past the original time point in the output CSV file. Once this was done rather than a file path specification I would like to be able to specify a folder and the program loop through all text files finding the relevant columns and output all into a single CSV file.
What this would hopefully look like in practice:
CSV title and first entry in one NYSE txt file -
"Date,Time,Open,High,Low,Close,Volume,OpenInt
2016-02-03,15:35:00,37.27,37.36,37.17,37.29,25274,0"
So essentially if the first entry is the earliest time reference:
2016-02-03,15:35:00 = '0'
2016-02-03,15:40:00 = '300' (5 minutes is 300 seconds)
Just to re-iterate, input is a folder containing hundreds of the following formatted CSVs:
Columns - 1: Date 2: Time 3: Open 4: High 5: Low 6: Close 7: Volume 8: OpenInt
Output is a single CSV file containing:
Columns - 1: Time measure (distance in seconds from earliest entry point) 2: Stock price for each time measure entry.
Please let me know if you have any clues about how I could go about doing this, don't hesitate to let me know if there is anything I can clarify to make your lives easier, I realise I could have maybe explained this in a less convoluted manner.
Here's an example using the example CSV line you gave. I altered the input to change the seconds so you can see how the difference works:
Output:
java.time
The Answer by Saviour Self looks correct. But it uses the old date-time classes that have been supplanted by the java.time framework built into Java 8 and later.
Apache Commons CSV
As a bonus, I show how to use the Apache Commons CSV library to handle the chore of reading/writing CSV files.
First we simulate a CSV file by making a
StringReader
.RFC 4180 Specification
The RFC 4180 spec defines the CSV format formally. Variations on this also exist.
RFC 4180 requires Carriage Return + Line Feed (
CRLF
) as the newline (line terminator). The last line’s terminator is optional which we include here.We omit the optional header line (column titles).
Next we read in the entire CSV file into memory, where the Commons CSV library creates
CSVRecord
objects to represent each row of incoming data. One line of code does all that work, withCSVFormat::parse
producing aCSVParser
object (an implementation ofInterable
).Now we analyze that collection of
CSVRecord
objects. Remember the first one as our baseline, stored here as anInstant
(discussed below). Then loop to compare each successiveCSVRecord
object, examining each field as aString
.Extract the strings for date-only and time-only, combine into a
LocalDateTime
.This date-time object is not a point on the timeline as we do not know its offset-from-UTC or time zone. If you were to use these values to calculate a delta between
LocalDateTime
objects, you would be assuming generic 24-hour days free of anomalies such as Daylight Saving Time (DST). You might get away with this if your data happens to not occur during any anomaly, but it is a bad habit. Better to assign a time zone if known.We know the source of the data, so we can assume the intended time zone, a
ZoneId
. By assigning that assumed time zone, we get a real moment on the timeline.From that
ZonedDateTime
we can extract the same moment in UTC (anInstant
). Generally theInstant
is what you should be using for data storage, data exchange, serialization, and so on. You only need theZonedDateTime
for presentation to the user in their expected time zone.The goal is to compare each
CSVRecord
to the original baseline date-time. TheDuration.between
method does just that.We calculate the delta in total seconds.
Writing these results to an output CSV file is left as an exercise for you the reader. The Apache Commons CSV library makes short work of that is it writes as well as reads CSV formats.
When run.