Let's say I have to read a file, containing a bunch of floating-point numbers. The numbers can be like 1e+10
, 5
, -0.15
etc., i.e., any generic floating-point number, using decimal points (this is fixed!). However, my code is a plugin for another application, and I have no control over what's the current locale. It may be Russian, for example, and the LC_NUMERIC rules there call for a decimal comma to be used. Thus, Pi is expected to be spelled as "3,1415...", and
sscanf("3.14", "%f", &x);
returns "1", and x contains "3.0", since it refuses to parse past the '.' in the string.
I need to ignore the locale for such number-parsing tasks.
How does one do that?
I could write a parseFloat function, but this seems like a waste.
I could also save the current locale, reset it temporarily to "C", read the file, and restore to the saved one. What are the performance implications of this? Could setlocale() be very slow on some OS/libc combo, what does it really do under the hood?
Yet another way would be to use iostreams, but again their performance isn't stellar.
So I'm puzzled. What do you guys do in such situations?
Cheers!
I am not sure how to solve it in C.
But C++ streams (can) have a unique locale object.
My personal preference is to never use
LC_NUMERIC
, i.e. just callsetlocale
with other categories, or, after callingsetlocale
withLC_ALL
, usesetlocale(LC_NUMERIC, "C");
. Otherwise, you're completely out of luck if you want to use the standard library for printing or parsing numbers in a standared form for interchange.If you're lucky enough to be on a POSIX 2008 conforming system, you can use the
uselocale
and*_l
family of functions to make the situation somewhat better. There are at least 2 basic approaches:Leave the default locale unset (at least the troublesome parts like
LC_NUMERIC
;LC_CTYPE
should probably always be set), and pass alocale_t
object for the user's locale to the appropriate*_l
functions only when you want to present things to the user in a way that meets their own cultural expectations; otherwise use the default C locale.Have your code that needs to work with data for interchange keep around a
locale_t
object for the C locale, and either switch back and forth usinguselocale
when you need to work with data in a standard form for interchange, or use the appropriate*_l
functions (but there is noscanf_l
).Note that implementing your own floating point parser is not easy and is probably not the right solution to the problem unless you're an expert in numerical computing. Getting it right is very hard.
Here's what I've done with this stuff in the past.
The goal is to use locale-dependent numeric converters with a C-locale numeric representation. The ideal, of course, would be to use non-locale-dependent converters, or not change the locale, etc., etc., but sometimes you just have to live with what you've got. Locale support is seriously broken in several ways and this is one of them.</rant>
First, extract the number as a string using something like the
C
grammar's simple pattern for numeric preprocessing tokens. For use with scanf, I do an even simpler one:This could be simplified even more, depending on how what else you might expect in the input stream. The only thing you need to do is to not read beyond the end of the number; as long as you don't allow numbers to be followed immediately by letters, without intervening whitespace, the above will work fine.
Now, get the
struct lconv
(man 7 locale
) representing the current locale usinglocaleconv(3)
. The first entry in that struct isconst char* decimal_point
; replace all of the'.'
characters in your string with that value. (You might also need to replace'+'
and'-'
characters, although most locales don't change them, and the sign fields in thelconv
struct are documented as only applying to currency conversions.) Finally, feed the resulting string throughstrtod
and see if it passes.This is not a perfect algorithm, particularly since it's not always easy to know how locale-compliant a given library actually is, so you might want to do some autoconf stuff to configure it for the library you're actually compiling with.