Parsing scientific notation sensibly?

2020-05-01 07:45发布

问题:

I want to be able to write a function which receives a number in scientific notation as a string and splits out of it the coefficient and the exponent as separate items. I could just use a regular expression, but the incoming number may not be normalised and I'd prefer to be able to normalise and then break the parts out.

A colleague has got part way of an solution using VB6 but it's not quite there, as the transcript below shows.

cliVe> a = 1e6
cliVe> ? "coeff: " & o.spt(a) & " exponent: " & o.ept(a)
coeff: 10 exponent: 5 

should have been 1 and 6

cliVe> a = 1.1e6
cliVe> ? "coeff: " & o.spt(a) & " exponent: " & o.ept(a)
coeff: 1.1 exponent: 6

correct

cliVe> a = 123345.6e-7
cliVe> ? "coeff: " & o.spt(a) & " exponent: " & o.ept(a)
coeff: 1.233456 exponent: -2

correct

cliVe> a = -123345.6e-7
cliVe> ? "coeff: " & o.spt(a) & " exponent: " & o.ept(a)
coeff: 1.233456 exponent: -2

should be -1.233456 and -2

cliVe> a = -123345.6e+7
cliVe> ? "coeff: " & o.spt(a) & " exponent: " & o.ept(a)
coeff: 1.233456 exponent: 12

correct

Any ideas? By the way, Clive is a CLI based on VBScript and can be found on my weblog.

回答1:

Google on "scientific notation regexp" shows a number of matches, including this one (don't use it!!!!) which uses

*** warning: questionable ***
/[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?/

which includes cases such as -.5e7 and +00000e33 (both of which you may not want to allow).

Instead, I would highly recommend you use the syntax on Doug Crockford's JSON website which explicitly documents what constitutes a number in JSON. Here's the corresponding syntax diagram taken from that page:


(source: json.org)

If you look at line 456 of his json2.js script (safe conversion to/from JSON in javascript), you'll see this portion of a regexp:

/-?\d+(?:\.\d*)?(?:[eE][+\-]?\d+)?/

which, ironically, doesn't match his syntax diagram.... (looks like I should file a bug) I believe a regexp that does implement that syntax diagram is this one:

/-?(?:0|[1-9]\d*)(?:\.\d*)?(?:[eE][+\-]?\d+)?/

and if you want to allow an initial + as well, you get:

/[+\-]?(?:0|[1-9]\d*)(?:\.\d*)?(?:[eE][+\-]?\d+)?/

Add capturing parentheses to your liking.

I would also highly recommend you flesh out a bunch of test cases, to ensure you include those possibilities you want to include (or not include), such as:

allowed:
+3
3.2e23
-4.70e+9
-.2E-4
-7.6603

not allowed:
+0003   (leading zeros)
37.e88  (dot before the e)

Good luck!



回答2:

Building off of the highest rated answer, I modified the regex slightly to be /^[+\-]?(?=.)(?:0|[1-9]\d*)?(?:\.\d*)?(?:\d[eE][+\-]?\d+)?$/.

The benefits this provides are:

  1. allows matching numbers like .9 (I made the (?:0|[1-9]\d*) optional with ?)
  2. prevents matching just the operator at the beginning and prevents matching zero-length strings (uses lookahead, (?=.))
  3. prevents matching e9 because it requires the \d before the scientific notation

My goal in this is to use it for capturing significant figures and doing significant math. So I'm also going to slice it up with capturing groups like so: /^[+\-]?(?=.)(0|[1-9]\d*)?(\.\d*)?(?:(\d)[eE][+\-]?\d+)?$/.

An explanation of how to get significant figures from this:

  1. The entire capture is the number you can hand to parseFloat()
  2. Matches 1-3 will show up as undefined or strings, so combining them (replace undefined's with '') should give the original number from which significant figures can be extracted.

This regex also prevents matching left-padded zeros, which JavaScript sometimes accepts but which I have seen cause issues and which adds nothing to significant figures, so I see preventing left-padded zeros as a benefit (especially in forms). However, I'm sure the regex could be modified to gobble up left-padded zeros.

Another problem I see with this regex is it won't match 90.e9 or other such numbers. However, I find this or similar matches highly unlikely as it is the convention in scientific notation to avoid such numbers. Though you can enter it in JavaScript, you can just as easily enter 9.0e10 and achieve the same significant figures.

UPDATE

In my testing, I also caught the error that it could match '.'. So the look-ahead should be modified to (?=\.\d|\d) which leads to the final regex:

/^[+\-]?(?=\.\d|\d)(?:0|[1-9]\d*)?(?:\.\d*)?(?:\d[eE][+\-]?\d+)?$/


回答3:

Here is some Perl code I just hacked together quickly.

my($sign,$coeffl,$coeffr,$exp) = $str =~ /^\s*([-+])?(\d+)(\.\d*)?e([-+]?\d+)\s*$/;

my $shift = length $coeffl;
$shift = 0 if $shift == 1;

my $coeff =
  substr( $coeffl, 0, 1 );

if( $shift || $coeffr ){
  $coeff .=
    '.'.
    substr( $coeffl, 1 );
}

$coeff .= substr( $coeffr, 1 ) if $coeffr;

$coeff = $sign . $coeff if $sign;

$exp += $shift;

say "coeff: $coeff exponent: $exp";