OCaml: How to remove all non-alphabetic characters

2019-05-24 22:55发布

How do I remove all the non-alphabetic characters from a string?

E.g.

"Wë_1ird?!"  ->  "Wëird"

In Perl, I'd do this with =~ s/[\W\d_]+//g. In Python, I'd use

re.sub(ur'[\W\d_]+', u'', u"Wë_1ird?!", flags=re.UNICODE)

Etc.

AFAICT, Str.regex does not support \W, \d, etc. (I can't tell whether it supports Unicode, but somehow I doubt it).

2条回答
贼婆χ
2楼-- · 2019-05-24 23:31

Str doesn't support Unicode. Assuming you are dealing with UTF-8 encoded data. You can use Uutf and Uucp as follows:

let keep_alpha s =
  let b = Buffer.create 255 in
  let add_alpha () _ = function
  | `Malformed _ -> Uutf.Buffer.add_utf_8 b Uutf.u_rep
  | `Uchar u -> if Uucp.Alpha.is_alphabetic u then Uutf.Buffer.add_utf_8 b u
  in
  Uutf.String.fold_utf_8 add_alpha () s;
  Buffer.contents b

# keep_alpha "Wë_1ird?!";;
- : string = "Wëird"
查看更多
叛逆
3楼-- · 2019-05-24 23:36

I'm not an expert in regexes and utf, but if I were in your shoes, then I would use re2 library, and this is my first approximation:

open Core.Std
open Re2.Std
open Re2.Infix

let drop _match = ""

let keep_alpha s = Re2.replace ~/"\\PL" ~f:drop s

The first three lines open libraries and bring their definitions into scope. You do not need to open library to use it, but otherwise you need to prefix each defintion. OCaml core library is specially designed in a such way, that a user should open Std submodule to bring all necessary defintions to scope. Re2 library is from the same guys and have a consisten conventions. open Re2.Infix will bring infix (and prefix operators) to scope, namely ~/ that will create a regex from a string. The drop function just ignores its argument and returns an empty string. I've prefixed parameter with an underscore, since it is a convention for unused parameteers (respected by a compiler). You can also use just a plain uderscore, as a wild card instead, like let drop _ = "". Next is keep_alpha function that will substitute any utf symbol that doesn't match a utf letter class with an empty string, i.e., remove it from the output.

Update

I've checked my code, and fixed errors. Also, I would like to show, how to play with this code in toplevel. You've several options, but the easiest is to use coretop script that ships with core library. It uses utop toplevel, so make sure that you have installed it:

 $ opam install -y utop

Once, it is done, you can start toplevel:

 $ coretop -require re2

this -require re2 flag will automatically find and load re2 library to your toplevel. You can load additional libraries without restarting utop with the following command:

 # #require "libname";;

The first # is a toplevel's prompt, you shouldn't type it, but the second is a start of directive, so make sure that actually type it. Any directive should be started from # symbol. There're other useful directives in utop, namely:

 # #use "filename.ml";;   (* will load and evaluate filename.ml      *)
 # #list;;                (* will list all available packages        *)
 # #typeof "keep_alpha";; (* will infer and print type of expression *)

Toplevel will not evaluate your code until you terminate it with ;; sequence. You may sometimes see this ugly ;; in a real code, but it is not needed, it is just to say the toplevel, that you want it to evaluate your code right at this place, and show you the result.

查看更多
登录 后发表回答