Am an awk newbie. Using Windows-GNU gawk in UNXUTILS.
Have 2 kinds of records arranged sequentially in date and time order in my file(s), 30-field Order records (start with "O") where quantity is the 15th field, and 18-field Trade records (start with "T") where quantity is the 8th field. The underlying research data is historical-archival Indian stock market data spanning 15 days in April 2006, about 1000 firms, and comprising in all about 100 million separate order or trade records. My test data is 500 records for 2 dates, and some 200 firms.
My objective at this point is only to compute for each firm and each date, that firm-date's cumulative order quantity and trade quantity.
The raw data IS ordered by date and time (firms obviously jumbled up, just like voters who don't usually vote in alphabetical order!). And I do now have two separate text files, one containing a list of just the distinct firm symbols; and the other, the distinct dates, one per line.
I want try to complete the computations in a way that does not require making me go thru all of the records over and over again for each of the firms and dates. The basic computations given a firm=FIRM_1 and a date=DATE_1 are easy, for e.g. what I have resembles
# For each order record with firm_symbol = FIRM_1, date = DATE_1,
# cumulate its Order quantity ($15).
( /^O/ && $4~/FIRM_1/ ) && $2~/DATE_1/
{ Order_Q[FIRM_1_DATE_1]=Order_Q[FIRM_1_DATE_1]+$15] }
# For each trade record with firm_symbol = FIRM_1, date = DATE_1,
#cumulate its Trade quantity ($8).
( /^T/ && $4~/FIRM_1/ ) && $2~/DATE_1/
{ Trade_Q[FIRM_1_DATE_1]=Trade_Q[FIRM_1_DATE_1]+$8] }
END { print "FIRM_1 ", "DATE_1 ", Order_Q[FIRM_1_DATE_1], Trade_Q[FIRM_1_DATE_1] }
The question is how to construct an intelligent loop over all firms and dates, given the size of the underlying data. There are several related questions.
I know the name FIRM_1 need not be hard-coded inside this awk script, but could be given as a command line parameter. But can one go one step further and get awk to take that name sequentially from a list of names in a separate file, one per line? (If that's feasible, then taking dates from a list of dates would also be possible.)
I constructed the array argument names to hold Order quantity and Trade quantity knowing FIRM_1 and DATE_1. If we succeed in resolving 1 above, can one construct array argument names like FIRM_1_DATE_1 and FIRM_1_DATE_1 on the fly, inside awk, while it is running? Will string concatenation to help form a name be allowed?
I realize that I could use an editor macro or some such method, to combine my 2 keys, FIRM (1000 values) and DATE (15 values) into one FIRM_DATE key (15000 values) before doing any of this, in a separate step. If 2 above is feasible, I'm assuming there's no value to doing this. Would it help anyway?
In principle we are looking to hold in memory perhaps 1000 firms times 15 days times 2 variables = 30,000 cell entries in 2 arrays, ORDER_Q and TRADE_Q. Is this a lot? I use a modest Windows desktop with I think 8GB RAM.
Any suggestion or reference or example that will help reduce having to go over the original large input data several times will be very welcome. If something involves learning more not just about awk but about shell scripts, that will also be very welcome.
Use associative arrays. Assuming that
$2
contains the name of the firm and$4
the date, then:That does not give you a defined order for the companies or dates in the output. There are techniques to do that. This makes a single pass over the data accumulating the results for all the companies and all the dates all in one turn.
If you want the firms in a specific (e.g. sorted) order, sort the firm list before printing. GNU
awk
provides built-in sort functions. Otherwise, you'll have to write anawk
function to do it. (See Programming Pearls or More Programming Pearls (or both) for more information on writing sort functions inawk
.)Warning: untested code.