Common Lisp Compiling and execution time

2019-03-25 21:18发布

问题:

I have a lisp file which does lots of sampling, file I/O and arithmetic in a loop. (I do particle filtering in common lisp.) I am compiling my lisp file using the compile-file command. I also use the (declaim (optimize (speed 3) (debug 0) (safety 0))) at the beginning of my lisp file as I want to have my results as fast as possible.
I use (time (load "/....../myfile.lisp") and (time (load "/......./myfile.dx64fsl") to measure speed. The issue is that compiling does not bring any advantage to me. There is no improvement. Do I do something wrong? Is there a way to improve things? Speed is the most important criterion so I can sacrifice a lot in order to get a fast response. I have no idea about this kind of issues so any help would be appreciated.
Moreover, when I increase the number of particles (each particle is a vector of size ~40) to like 10000 the code gets really slow, so there may be some memory issues too.
Thank you very much in advance.

edit: This is the profiling results with 1000 particles and 50 iterations.

(LOAD "/.../myfile.dx64fsl") took 77,488,810 microseconds (77.488810 seconds) to run 
                    with 8 available CPU cores.
During that period, 44,925,468 microseconds (44.925470 seconds) were spent in user mode
                    32,005,440 microseconds (32.005440 seconds) were spent in system mode
2,475,291 microseconds (2.475291 seconds) was spent in GC.
 1,701,028,429 bytes of memory allocated.
 1,974 minor page faults, 0 major page faults, 0 swaps.
; Warning: Function CREATE-MY-DBN has been redefined, so times may be inaccurate.
;          MONITOR it again to record calls to the new definition.
; While executing: MONITOR::MONITOR-INFO-VALUES, in process repl-thread(10).


                                                               Cons   
                             %      %                          Per      Total     Total
Function                    Time   Cons    Calls  Sec/Call     Call     Time      Cons
------------------------------------------------------------------------------------------
SAMPLE:                    25.61   26.14  2550000  0.000005      174    13.526   443040000
DISCRETE-PARENTS:          19.66    3.12  4896000  0.000002       11    10.384    52800000
LINEAR-GAUSSIAN-MEAN:       8.86    3.12  1632000  0.000003       32     4.679    52800000
DISCRETE-PARENT-VALUES:     7.47   12.33  3264000  0.000001       64     3.946   208896000
LIST-DIFFERENCE:            6.41   25.69  6528000  0.000001       67     3.384   435392000
CONTINUOUS-PARENTS:         6.33    0.00  1632000  0.000002        0     3.343           0
PF-STEP:                    5.17    0.23       48  0.056851    80080     2.729     3843840
CONTINUOUS-PARENT-VALUES:   4.13    7.20  1632000  0.000001       75     2.184   122048000
TABLE-LOOKUP:               3.85    8.39  2197000  0.000001       65     2.035   142128000
PHI-INVERSE:                3.36    0.00  1479000  0.000001        0     1.777           0
PHI-INTEGRAL:               3.32    1.38  2958000  0.000001        8     1.755    23344000
PARENT-VALUES:              2.38   10.65  1122000  0.000001      161     1.259   180528016
CONDITIONAL-PROBABILITY:    1.41    0.00   255000  0.000003        0     0.746           0
------------------------------------------------------------------------------------------
TOTAL:                     97.96   98.24  30145048                       51.746  1664819856
Estimated monitoring overhead: 21.11 seconds
Estimated total monitoring overhead: 23.93 seconds

with 10000 particles and 50 iterations:

(LOAD "/.../myfile.dx64fsl") took 809,931,702 microseconds (809.931700 seconds) to run 
                    with 8 available CPU cores.
During that period, 476,627,937 microseconds (476.627930 seconds) were spent in user mode
                    328,716,555 microseconds (328.716550 seconds) were spent in system mode
54,274,625 microseconds (54.274624 seconds) was spent in GC.
 16,973,590,588 bytes of memory allocated.
 10,447 minor page faults, 417 major page faults, 0 swaps.
; Warning: Funtion CREATE-MY-DBN has been redefined, so times may be inaccurate.
;          MONITOR it again to record calls to the new definition.
; While executing: MONITOR::MONITOR-INFO-VALUES, in process repl-thread(10).


                                                               Cons    
                             %      %                          Per       Total     Total
Function                    Time   Cons    Calls  Sec/Call     Call      Time      Cons
-------------------------------------------------------------------------------------------
SAMPLE:                    25.48   26.11  25500000  0.000006       174   144.211  4430400000
DISCRETE-PARENTS:          18.41    3.11  48960000  0.000002        11   104.179   528000000
LINEAR-GAUSSIAN-MEAN:       8.61    3.11  16320000  0.000003        32    48.751   528000000
LIST-DIFFERENCE:            7.57   25.66  65280000  0.000001        67    42.823  4353920000
DISCRETE-PARENT-VALUES:     7.50   12.31  32640000  0.000001        64    42.456  2088960000
CONTINUOUS-PARENTS:         5.83    0.00  16320000  0.000002         0    32.980           0
PF-STEP:                    5.05    0.23       48  0.595564    800080    28.587    38403840
TABLE-LOOKUP:               4.52    8.38  21970000  0.000001        65    25.608  1421280000
CONTINUOUS-PARENT-VALUES:   4.25    7.19  16320000  0.000001        75    24.041  1220480000
PHI-INTEGRAL:               3.15    1.38  29580000  0.000001         8    17.849   233440000
PHI-INVERSE:                3.12    0.00  14790000  0.000001         0    17.641           0
PARENT-VALUES:              2.87   10.64  11220000  0.000001       161    16.246  1805280000
CONDITIONAL-PROBABILITY:    1.36    0.00  2550000  0.000003         0     7.682           0
-------------------------------------------------------------------------------------------
TOTAL:                     97.71   98.12  301450048                       553.053  16648163840
Estimated monitoring overhead: 211.08 seconds
Estimated total monitoring overhead: 239.13 seconds

回答1:

Typical arithmetic stuff in Common Lisp can be slow. Improving it is possible, but needs a bit of knowledge.

Reasons:

  • Common Lisp numbers are not what the machine provides (bignums, rational, complex, ...)
  • automatic change from fixnum to bignum and back
  • generic math operations
  • tagging uses bits of the word size
  • consing of numbers

One thing you can see from the profiling output is that you generate 1.7 GB garbage. This is a typical hint that your number operations cons. To get rid of that is often not that easy. It is just a guess on my side, that these are number operations.

Ken Anderson (unfortunately he died a few years ago) has some advice on his web site for improving numeric software: http://openmap.bbn.com/~kanderso/performance/

A usual solution is to give the code to some experienced Lisp developer which knows a bit about the compiler used and/or optimizations.



回答2:

First of all, never ever declare (speed 3) together with (safety 0) at the top-level, i.e. globally. This will sooner or later come back and bite off your head. At these levels, most common lisp compilers do less safety checking than C-compilers. For instnace, some lisps drop checking for interrupt signals in (safety 0) code. Next, (safety 0) very rarely give noticeable gains. I would declare (speed 3)(safety 1)(debug 1) in hot functions, possibly going to (debug 0) if this brings a noticeable gain.

Otherwise, without actually looking at some of the actual code, it's kinda hard to come up with suggestions. Looking from time() it seems like the GC-pressure is kinda high. Make sure that you use open-coded arithmetic in hot functions, and don't needlessly box floats or ints. Use (disassemble 'my-expensive-function) to take a close look at the code the compiler generated. SBCL will provide lots of helpful output when compiling with high priority on speed and it can be worthwhile to try to eliminate some of these warnings.

It is also important that you use a fast data structure for representing particles, using instantiatiable arrays and macrology if needed.



回答3:

If all code contained in "myfile.lisp" is the parts where you do computations, no, compiling that file will not noticeably improve your run-time. The difference between the two cases is probably going to be "we compile a few loops", calling functions that are either compiled or interpreted in both cases.

To gain improvement from compiling, you need to also compile the code that is called. You may also need to type-annotate your code, in order to allow your compiler to optimize better. SBCL has quite good compiler diagnostics for missed annotations (to the point that people complain it is too verbose while compiling).

As far as the loading-time goes, it may actually be the case that loading a compiled file takes longer (there's a whole slew of essentially-dynamic-linking happening, if you're not changing your code frequently, but changing the data you process, it may be an advantage to prepare a new core file with your particle-filtering code already in-core).



回答4:

Several points:

  1. Try to move file I/O out of the loop if possible; read data into memory in batch before iteration. File I/O is several magnitudes slower than memory access.

  2. Try SBCL if execution speed is important for you.

  3. Ten times increase in your input results in roughly ten times execution time increase, which is linear, so your algorithm seems fine; just need to work on your constant factor.

  4. Take advantage of Lisp work flow: edit function, compile function and test run instead of edit file, compile file and test. The difference will become pronounced when your projects get bigger (or when you try SBCL, which will take longer to analyze/optimize your programs to produce faster code).



回答5:

Welcome to Clozure Common Lisp Version 1.7-r14925M  (DarwinX8664)!
? (inspect 'print)
[0]     PRINT
[1]     Type: SYMBOL
[2]     Class: #<BUILT-IN-CLASS SYMBOL>
        Function
[3]     EXTERNAL in package: #<Package "COMMON-LISP">
[4]     Print name: "PRINT"
[5]     Value: #<Unbound>
[6]     Function: #<Compiled-function PRINT #x30000011C9DF>
[7]     Arglist: (CCL::OBJECT &OPTIONAL STREAM)
[8]     Plist: NIL

Inspect> (defun test (x) (+ x 1))
TEST
Inspect> (inspect 'test)
[0]     TEST
[1]     Type: SYMBOL
[2]     Class: #<BUILT-IN-CLASS SYMBOL>
        Function
[3]     INTERNAL in package: #<Package "COMMON-LISP-USER">
[4]     Print name: "TEST"
[5]     Value: #<Unbound>
[6]     Function: #<Compiled-function TEST #x302000C5EFFF>
[7]     Arglist: (X)
[8]     Plist: NIL
Inspect> 

Note that both #'print and #'test are listed as 'compiled'. This means that the only performance difference between loading a .lisp file, and loading a compiled file, is the compile time. Which I'm assuming is not the bottleneck in your scenario. It's usually not, unless you are using a bunch of macros, and performing a code transformation is the primary purpose of your program.

This is one of the main reasons that I don't ever deal with compiled lisp files. I just load all shared libraries/packages I need up in my core file, and then load a few specific .lisp functions/files on top of that when I'm working on a particular project. And, at least for SBCL and CCL for me, everything is listed as 'compiled'.