How do I remove all the non-alphabetic characters from a string?
E.g.
"Wë_1ird?!" -> "Wëird"
In Perl, I'd do this with =~ s/[\W\d_]+//g
. In Python, I'd use
re.sub(ur'[\W\d_]+', u'', u"Wë_1ird?!", flags=re.UNICODE)
Etc.
AFAICT, Str.regex
does not support \W
, \d
, etc. (I can't
tell whether it supports Unicode, but somehow I doubt it).
Str
doesn't support Unicode. Assuming you are dealing with UTF-8 encoded data. You can use Uutf and Uucp as follows:I'm not an expert in regexes and utf, but if I were in your shoes, then I would use
re2
library, and this is my first approximation:The first three lines open libraries and bring their definitions into scope. You do not need to open library to use it, but otherwise you need to prefix each defintion. OCaml core library is specially designed in a such way, that a user should open
Std
submodule to bring all necessary defintions to scope.Re2
library is from the same guys and have a consisten conventions.open Re2.Infix
will bring infix (and prefix operators) to scope, namely~/
that will create a regex from a string. Thedrop
function just ignores its argument and returns an empty string. I've prefixed parameter with an underscore, since it is a convention for unused parameteers (respected by a compiler). You can also use just a plain uderscore, as a wild card instead, likelet drop _ = ""
. Next iskeep_alpha
function that will substitute any utf symbol that doesn't match a utf letter class with an empty string, i.e., remove it from the output.Update
I've checked my code, and fixed errors. Also, I would like to show, how to play with this code in toplevel. You've several options, but the easiest is to use
coretop
script that ships withcore
library. It usesutop
toplevel, so make sure that you have installed it:Once, it is done, you can start toplevel:
this
-require re2
flag will automatically find and loadre2
library to your toplevel. You can load additional libraries without restartingutop
with the following command:The first
#
is a toplevel's prompt, you shouldn't type it, but the second is a start of directive, so make sure that actually type it. Any directive should be started from#
symbol. There're other useful directives in utop, namely:Toplevel will not evaluate your code until you terminate it with
;;
sequence. You may sometimes see this ugly;;
in a real code, but it is not needed, it is just to say the toplevel, that you want it to evaluate your code right at this place, and show you the result.