As some of you make know, I'm working on XMPP (Jabber) integration for the StackOverflow chat system, as an XMPP component written in Ruby using the xmpp4r package.
I'm struggling with one issue (well, many issues, but one issue at the moment :-) I am taking the JSON feed from the chat and extracting the HTML for the messages. I am using The Ruby TidyHTML bindings to convert the HTML from the JSON fed to XHTML, so that I can send it as an XMPP message -- since XMPP messages are just XML, converting the HTML to XHTMl should make it valid XML which I can just stick into the <message>
stanza.
For most messages, it works great!
However for other messages, it completely chokes -- the XMPP server closes the stream and the script grinds to a halt. (And rchern and others in The Tavern get upset. Well, maybe not upset, but they laugh at me. This makes me sad!)
I am almost certain that what's happing is, for some reason or other, the messages are not valid XML, and so the XMPP server is closing the connection because it encounters a parse error in the XML stream from the Ruby component. Here's an example of one such message:
<message to='jeswah@smart-safe-secure.com/Token' type='groupchat' xmlns='jabber:client'><body><div class="onebox ob-message"><a class="roomname" href="/transcript/message/263372#263372"><span title="2010-11-04 19:20:23Z">1 hour ago</span></a>, by <span class="user-name">Fosco</span> <br/><div class="quote"><div class="room-mini"><div class="room-mini-header"><h3><img class="small-site-logo" title="Gaming" alt="Gaming" width="16" height="16" src="http://sstatic.net/gaming/img/favicon.ico" />&nbsp;<span class="room-name"><a href="http://chat.stackexchange.com/rooms/28/minecraft-talk">Minecraft Talk</a></span></h3><div class="room-mini-description">Everything Minecraft, including classic and survival mode</div></div><div class="room-current-user-count" title="current users">9</div><div class="mspark" style="height:25px;width:205px">
<div class="mspbar" style="width:8px;height:13px;left:0px;"></div><div class="mspbar" style="width:8px;height:9px;left:8px;"></div><div class="mspbar" style="width:8px;height:2px;left:16px;"></div><div class="mspbar" style="width:8px;height:8px;left:24px;"></div><div class="mspbar" style="width:8px;height:1px;left:32px;"></div><div class="mspbar" style="width:8px;height:1px;left:56px;"></div><div class="mspbar" style="width:8px;height:0px;left:64px;"></div><div class="mspbar" style="width:8px;height:0px;left:88px;"></div><div class="mspbar" style="width:8px;height:0px;left:96px;"></div><div class="mspbar" style="width:8px;height:11px;left:104px;"></div><div class="mspbar" style="width:8px;height:7px;left:112px;"></div><div class="mspbar" style="width:8px;height:7px;left:120px;"></div><div class="mspbar" style="width:8px;height:25px;left:128px;"></div><div class="mspbar" style="width:8px;height:14px;left:136px;"></div><div class="mspbar" style="width:8px;height:4px;left:144px;"></div><div class="mspbar" style="width:8px;height:7px;left:152px;"></div><div class="mspbar" style="width:8px;height:19px;left:160px;"></div><div class="mspbar" style="width:8px;height:19px;left:168px;"></div><div class="mspbar" style="width:8px;height:12px;left:176px;"></div><div class="mspbar" style="width:8px;height:11px;left:184px;"></div><div class="mspbar now" style="height:25px;left:154px;"></div></div>
<div class="clear-both"></div></div></div></div></body><html xmlns='http://jabber.org/protocol/xhtml-im'><body xmlns='http://www.w3.org/1999/xhtml'><div class="onebox ob-message"><a class="roomname" href="/transcript/message/263372#263372"><span title="2010-11-04 19:20:23Z">1 hour ago</span></a>, by <span class="user-name">Fosco</span><br />
<div class="quote">
<div class="room-mini"><div class="room-mini-header">
<h3><img class="small-site-logo" title="Gaming" alt="Gaming" width="16" height="16" src="http://sstatic.net/gaming/img/favicon.ico" /> <span class="room-name"><a href="http://chat.stackexchange.com/rooms/28/minecraft-talk">Minecraft Talk</a></span></h3>
<div class="room-mini-description">Everything Minecraft, including classic and survival mode</div>
</div>
<div class="room-current-user-count" title="current users">9</div>
<div class="mspark" style="height:25px;width:205px">
<div class="mspbar" style="width:8px;height:13px;left:0px;"></div>
<div class="mspbar" style="width:8px;height:9px;left:8px;"></div>
<div class="mspbar" style="width:8px;height:2px;left:16px;"></div>
<div class="mspbar" style="width:8px;height:8px;left:24px;"></div>
<div class="mspbar" style="width:8px;height:1px;left:32px;"></div>
<div class="mspbar" style="width:8px;height:1px;left:56px;"></div>
<div class="mspbar" style="width:8px;height:0px;left:64px;"></div>
<div class="mspbar" style="width:8px;height:0px;left:88px;"></div>
<div class="mspbar" style="width:8px;height:0px;left:96px;"></div>
<div class="mspbar" style="width:8px;height:11px;left:104px;"></div><div class="mspbar" style="width:8px;height:7px;left:112px;"></div><div class="mspbar" style="width:8px;height:7px;left:120px;"></div><div class="mspbar" style="width:8px;height:25px;left:128px;"></div><div class="mspbar" style="width:8px;height:14px;left:136px;"></div>
<div class="mspbar" style="width:8px;height:4px;left:144px;"></div>
<div class="mspbar" style="width:8px;height:7px;left:152px;"></div>
<div class="mspbar" style="width:8px;height:19px;left:160px;"></div>
<div class="mspbar" style="width:8px;height:19px;left:168px;"></div><div class="mspbar" style="width:8px;height:12px;left:176px;"></div>
<div class="mspbar" style="width:8px;height:11px;left:184px;"></div>
<div class="mspbar now" style="height:25px;left:154px;"></div>
</div>
<div class="clear-both"></div>
</div>
</div>
</div>
</body></html></message>
(This message happened to be a quote of a oneboxed link to a chat room)
Here was the error Ruby gave me:
IOError: stream closed
/usr/lib/ruby/1.8/xmpp4r/stream.rb:594:in `empty?'
/usr/lib/ruby/1.8/rexml/parsers/baseparser.rb:153:in `empty?'
/usr/lib/ruby/1.8/rexml/parsers/baseparser.rb:193:in `pull'
/usr/lib/ruby/1.8/rexml/parsers/sax2parser.rb:92:in `parse'
/usr/lib/ruby/1.8/xmpp4r/streamparser.rb:79:in `parse'
/usr/lib/ruby/1.8/xmpp4r/stream.rb:75:in `start'
/usr/lib/ruby/1.8/xmpp4r/stream.rb:72:in `initialize'
/usr/lib/ruby/1.8/xmpp4r/stream.rb:72:in `new'
/usr/lib/ruby/1.8/xmpp4r/stream.rb:72:in `start'
/usr/lib/ruby/1.8/xmpp4r/connection.rb:119:in `start'
/usr/lib/ruby/1.8/xmpp4r/component.rb:70:in `start'
/usr/lib/ruby/1.8/xmpp4r/connection.rb:77:in `connect'
/usr/lib/ruby/1.8/xmpp4r/component.rb:47:in `connect'
./classes/SOXMPP_Bridge.rb:20:in `initialize'
./soxmpp.rb:81:in `new'
./soxmpp.rb:81
Finally, my question!
Given that sending invalid XML to the XMPP server kicks me off, is there any way using Ruby I can validate (and, preferably, correct) the XML before sending it to the XMPP server? Most likely, correcting it will be a matter of my writing additional code for each case where Tidy isn't producing valid XML, but I'd at least like to stop the script from crashing. So, how can I validate the XML before sending it to the XMPP server?
Are you running on *nix? If so, I'd delegate the problem to
xmllint
, a program that is a part of libxml2. I work with a system that generates xml before sending it over the net; we validate our xml with xmllint, like so:You'll need to adapt this to your situation, of course, but the basic idea works well. If you don't have a DTD, leave out the "--schema" bit.
Maybe actually converting to XML with Nokogiri will help? You can then re-serialize for the XMPP stream. Also, if you want your stuff to scale a bit and avoid memory bloats, switch to Blather instead of XMPP4r. Also the DSL is pretty awesome!
Don't use Tidy. Use the HTML5 parser, then dump the DOM it generates to XML. If you can produce a DOM, you can produce well-formed XML from it every time. It will also have the advantage of producing roughly the same DOM that most modern browsers will give you.
The actual error in this case is your
. According to XEP-0071, section 8, point 5:So this issue is about more than just generating well-formed XML, which is a pre-requisite. You'll also want to ensure that you're only using XHTML from the approved set in section 6.
In short, you need to read XEP-0071.