I have a erlang string which may contain characters like & " < and so on:
1> Unenc = "string & \"stuff\" <".
ok
Is there a Erlang function somewhere that parses the string and encodes all the needed HTML/XML entities, such as:
2> Enc = xmlencode(Unenc).
"string & "stuff" <".
?
My use case is for relatively short strings, which come from user input. The output strings of the xmlencode function will be the content of XML attributes:
<company name="Acme & C." currency="€" />
The final XML will be sent over the wire appropriately.
There is a function in the Erlang distribution that escapes angle brackets and ampersands but it isn't documented so probably not best to rely on it:
1> xmerl_lib:export_text("string & \"stuff\" <").
"string & \"stuff\" <"
If you're wanting to build/encode XML structures (instead of just encoding a single string), then the xmerl API would be a good option, e.g.
2> xmerl:export_simple([{foo, [], ["string & \"stuff\" <"]}], xmerl_xml).
["<?xml version=\"1.0\"?>",
[[["<","foo",">"],
["string & \"stuff\" <"],
["</","foo",">"]]]]
If your needs are simple, you could do this with a map over the chars in the string.
quote($<) -> "<";
quote($>) -> ">";
quote($&) -> "&";
quote($") -> """;
quote(C) -> C.
Then you would do
1> Raw = "string & \"stuff\" <".
2> Quoted = lists:map(fun quote/1, Raw).
But Quoted
would not be a flat list, which is still fine if you are going to send it to a file or as a http reply. I.e. see Erlang's io-lists.
In more recent Erlang releases, there are now encode-decode functions for multibyte utf8 to wide-byte/codepoint representations, see the erlang unicode module.
Reformatted comments, to make code examples stand out:
ettore: That's kind of what I am doing, although I do have to support multibyte characters. Here's my code:
xmlencode([], Acc) -> Acc;
xmlencode([$<|T], Acc) -> xmlencode(T, Acc ++ "<"); % euro symbol
xmlencode([226,130,172|T], Acc) -> xmlencode(T, Acc ++ "€");
xmlencode([OneChar|T], Acc) -> xmlencode(T, lists:flatten([Acc,OneChar])).
Although I would prefer not to reinvent the wheel if possible.
dsmith: The string that you are using would normally be a list of Unicode code-points (ie. a list of numbers), and so any given byte encoding is irrelevant. You would only need worry about specific encodings if you are working directly with binaries.
To clarify, the Unicode code-point for the euro symbol (decimal 8364) would be a single element in your list. So you would just do this:
xmlencode([8364|T], Acc) -> xmlencode(T, Acc ++ "€");
I'm not aware of one in the included OTP pakages. However Mochiweb's mochiweb_html module: has an escape function: mochiweb_html.erl it handles lists, binaries, and atoms.
And for url encoding checkout the mochiweb_util module: mochiweb_util.erl with its urlescape function.
You could use either of those libraries to get what you needed.