optimizing loop, passing parameters from external

2019-07-15 14:02发布

Am an awk newbie. Using Windows-GNU gawk in UNXUTILS.

Have 2 kinds of records arranged sequentially in date and time order in my file(s), 30-field Order records (start with "O") where quantity is the 15th field, and 18-field Trade records (start with "T") where quantity is the 8th field. The underlying research data is historical-archival Indian stock market data spanning 15 days in April 2006, about 1000 firms, and comprising in all about 100 million separate order or trade records. My test data is 500 records for 2 dates, and some 200 firms.

My objective at this point is only to compute for each firm and each date, that firm-date's cumulative order quantity and trade quantity.

The raw data IS ordered by date and time (firms obviously jumbled up, just like voters who don't usually vote in alphabetical order!). And I do now have two separate text files, one containing a list of just the distinct firm symbols; and the other, the distinct dates, one per line.

I want try to complete the computations in a way that does not require making me go thru all of the records over and over again for each of the firms and dates. The basic computations given a firm=FIRM_1 and a date=DATE_1 are easy, for e.g. what I have resembles

# For each order record with firm_symbol = FIRM_1, date = DATE_1, 
# cumulate its Order quantity ($15).

( /^O/ && $4~/FIRM_1/ ) && $2~/DATE_1/ 
            { Order_Q[FIRM_1_DATE_1]=Order_Q[FIRM_1_DATE_1]+$15] }

# For each trade record with firm_symbol = FIRM_1, date = DATE_1, 
#cumulate its Trade quantity ($8).

( /^T/ && $4~/FIRM_1/ ) && $2~/DATE_1/ 
            { Trade_Q[FIRM_1_DATE_1]=Trade_Q[FIRM_1_DATE_1]+$8] }

END { print "FIRM_1 ", "DATE_1 ", Order_Q[FIRM_1_DATE_1], Trade_Q[FIRM_1_DATE_1] }

The question is how to construct an intelligent loop over all firms and dates, given the size of the underlying data. There are several related questions.

  1. I know the name FIRM_1 need not be hard-coded inside this awk script, but could be given as a command line parameter. But can one go one step further and get awk to take that name sequentially from a list of names in a separate file, one per line? (If that's feasible, then taking dates from a list of dates would also be possible.)

  2. I constructed the array argument names to hold Order quantity and Trade quantity knowing FIRM_1 and DATE_1. If we succeed in resolving 1 above, can one construct array argument names like FIRM_1_DATE_1 and FIRM_1_DATE_1 on the fly, inside awk, while it is running? Will string concatenation to help form a name be allowed?

  3. I realize that I could use an editor macro or some such method, to combine my 2 keys, FIRM (1000 values) and DATE (15 values) into one FIRM_DATE key (15000 values) before doing any of this, in a separate step. If 2 above is feasible, I'm assuming there's no value to doing this. Would it help anyway?

  4. In principle we are looking to hold in memory perhaps 1000 firms times 15 days times 2 variables = 30,000 cell entries in 2 arrays, ORDER_Q and TRADE_Q. Is this a lot? I use a modest Windows desktop with I think 8GB RAM.

Any suggestion or reference or example that will help reduce having to go over the original large input data several times will be very welcome. If something involves learning more not just about awk but about shell scripts, that will also be very welcome.

1条回答
时光不老,我们不散
2楼-- · 2019-07-15 14:38

Use associative arrays. Assuming that $2 contains the name of the firm and $4 the date, then:

awk '/^O/ { order_qty[$2,$4] += $15 }
     /^T/ { trade_qty[$2,$4] += $8  }
     END  { for (key in order_qty) { print key, "O", order_qty[key]; }
            for (key in trade_qty) { print key, "T", trade_qty[key]; }
          }'

That does not give you a defined order for the companies or dates in the output. There are techniques to do that. This makes a single pass over the data accumulating the results for all the companies and all the dates all in one turn.

awk '     { if (date[$4]++ == 0) date_list[d++] = $4; # Dates appear in order
            if (firm[$2]++ == 0) firm_list[f++] = $2; # Firms appear out of order
          }
     /^O/ { order_qty[$2,$4] += $15 }
     /^T/ { trade_qty[$2,$4] += $8  }
     END  { for (i = 0; i < f; i++)
            {
                for (j = 0; j < d; j++)
                {
                    if ((qty = order_qty[firm_list[i],date_list[j]]) > 0)
                        print firm_list[i], date_list[j], "O", qty
                    if ((qty = trade_qty[firm_list[i],date_list[j]]) > 0)
                        print firm_list[i], date_list[j], "T", qty
                }
            }
          }'

If you want the firms in a specific (e.g. sorted) order, sort the firm list before printing. GNU awk provides built-in sort functions. Otherwise, you'll have to write an awk function to do it. (See Programming Pearls or More Programming Pearls (or both) for more information on writing sort functions in awk.)

Warning: untested code.

查看更多
登录 后发表回答