I have some XML content (UTF-8), that contains invalid characters (nokogiri tells me Line 2190, SyntaxError: PCDATA invalid Char value 15
when I try to parse the content with Nokogiri::XML(content)
).
The character is displayed in Sublime Text editor as a "SI":
When I try to copy the character, nothing gets copied, so I can't even look it up. When I open it for example in my Atom Editor, the "SI" is not displayed. However, when I step through the characters with the right key, I have to type twice to get over the place where the "SI" character is placed.
First, what kind of character is this? And second: Is there a way in Ruby, to remove such characters. I tried it with content.chars.select{|i| i.valid_encoding?}.join
but it doesn't remove the character.
Update
I found the character by reading the original file with ruby. The character is \u000F
and "\u000F".ord
returns the character code 15
. Regarding http://www.fileformat.info/info/unicode/char/000f/index.htm this is a SHIFT IN
character. Are there other characters like that? I could remove them by using str.split("\u000F").join
, but if there are other characters like this, this seems like not a good approach. Any ideas?
A method to remove control characters, but NOT whitespace, in UTF-8 text. Iconv will first convert the string to UTF-8 encoding. The encode line allows you to specify how to treat invalid characters, but does not remove control chars. The gsub takes care of removing control chars, but leaves white space. Substitute if "NOT ( NOT Control OR is Whitespace)" is used in place of substitute if (Is Control and NOT whitespace) due to regex constraints. This works in ruby 1.9.x forward, will not work in 1.8.7 REE.
If it were byte sequences actually invalid for the encoding (UTF-8), then in ruby 2.1+, you could use the String#scrub method. It will by default replace invalid chars with the "unicode replacement character" (usually represneted as a question mark in a box), but you can also use it to remove them entirely.
However, as you note, your 'weird byte' is actually valid UTF-8 represneting the unicode codepoint "\u000F", the
SHIFT IN
control character. (Good job figuring out the actual bytes/character involved, that's the hard part!)So we have to be clear about what we mean by "characters like that", if we want to remove them. Characters like what?
Nokogiri is complaining that it's invalid in an XML "PCDATA" (Parsed Character Data) area. Why would it be legal unicode/UTF-8, but invalid in XML PCDATA? What is legal in XML character data? I tried to figure it out, but it gets confusing, with the spec apparently saying that some characters are 'discouraged' (what?), and making what are to my eyes contradictory statements about other things.
I'm not sure exactly what characters Nokogiri will disallow from PCData, we'd have to look at the Nokogiri source (or more likely the libxml source), or try to ask a question of someone who knows more about nokogiri/libxml's source.
However, "\u000F" is a "control character", it's unlikely you want control characters in your XML character data (unless you know you do), and the XML spec seems to discourage control characters (and apparently Nokogiri/libxml actually disallows them?). So one way to interpret "characters like this" is "control characters".
You can remove all control characters from a string with this regex, for example:
If we interpret "characters like this" as any character that doesn't print -- a wider category than "control characters", and will include some that nokogiri has no problem with at all. We can try to remove a bit more than just control characters by using ruby's support for unicode character classes in regexes:
[:print]
is documented rather vaguely as "excludes control characters, and similar", so that's kind of a match for our vague spec of what we want to do. :)So it really depends on what we mean by "characters like this". Really, "characters like this" for your case probably means "any char that Nokogiri/libxml will refuse to allow", and I'm afraid I haven't actually answered that question, because I'm not sure and was not able to easily figure it out. But for many cases, removing control chars, or even better removing chars that don't match
[:print]
will probably do just fine, unless you have a reason to want control chars and similar to remain (if you knew you needed them as record separators, for instance).If instead of removing, you wanted to replace them with the unicode replacement char, which is commonly used to stand in for "byte sequence we couldn't handle":
If instead of removing them you want to escape them in some way they can be reconstructed after XML parsing.... ask again with that and I'll figure it out, but I haven't yet now. :)
Welcome to dealing with character encoding issues, it sure does get confusing sometimes.
This same thing happened to me reading emails from an xlsx file with the Roo gem.
I never knew exactly which bytes/character was coming through in my string, but since I knew which characters I would accept, I just removed those that didn't match, like this: