I have a dataset that abbreviates numerical values in a column. For example, 12M mean 12 million, 1.2k means 1,200. M and k are the only abbreviations. How can I write code that allows R to sort these values from lowest to highest?
I've though about using gsub to convert M to 000,000 etc but that does not take into account the decimals (1.5M would then be 1.5000000).
In your case you can using
gsubfn
Then just multiply that power-of-ten by the decimal value you have.
-1*3
Now if you want to case-insensitive-match both 'k' and 'K' to Kilo (as computer people often write, even though it's technically an abuse of SI), then you'll need to special-case e.g with if-else ladder/expression (SI units are case-sensitive in general, 'M' means 'Mega' but 'm' strictly means 'milli' even if disk-drive users say otherwise; upper-case is conventionally for positive exponents). So for a few prefixes, @DanielV's case-specific code is better.
If you want negative SI prefixes too, use
as.integer(regexpr(u, 'zafpnum@KMGTPEY')-8)
where@
is just some throwaway character to keep uniform spacing, it shouldn't actually get matched. Again if you need to handle non-power-of-10**3 units like 'deci', 'centi', will require special-casing, or the general dict-based approach WeNYoBen uses.base::regexpr
is not vectorized also its performance is bad on big inputs, so if you want to vectorize and get higher-performance usestringr::str_locate
.I am glad to meet you.
I wrote another answer
Define function
Result
Give this a shot: