Nokogiri adds characters during parsing on Heroku

2019-06-03 07:01发布

问题:

It seems like Nokogiri has a problem with UTF-8 conversion of the nbsp character. I've gathered this is an issue related to LibXML2. Nokogiri recommends upgrading LibXML2 to 2.7.7 instead of 2.7.6 that's running on Heroku.

Anyone know how I can use LibXML2 2.7.7 (or higher) on Heroku?

The problem is as follows --

doc = Nokogiri::HTML("<html><p>Hi Hello</p></html>")
doc.inner_html
=> "<html><body><p>Hi Hello</p></body></html>"

doc.inner_html = "<p>Hello&nbsp;World</p>"
=> "<p>Hello&nbsp;World</p>"

doc.inner_html
=> "<p>Hello World</p>"

Looks like this is related: https://github.com/sparklemotion/nokogiri/issues/306

This doesn't happen on my local machine. Rails has 'utf-8' set as the config.encoding and the page that's rendered has a utf-8 charset meta tag.

On my local machine I'm running Nokogiri 1.6 with limxml2 2.8.0 and on Heroku I'm running Nokogiri 1.6 with libxml2 2.7.6.

Thanks.

回答1:

Unfortunately Heroku doesn't support installing additional libraries or binaries to stacks. The best workaround is to vendor these into your project. You'll need to use 64-bit Linux versions to make them work on Heroku; compiling statically can also help ensure that any dependencies needed are included. Similarly, for gems that depend on external libraries, we recommend compiling the gem statically and vendoring it into your project.

If you do wish to try to vendor your binary, library, or gem, you can use Heroku as your build environment. One of Herokus engineers created a build server that allows you to upload source code, run the compilation step, and then download the resulting binary. You can find this project on Github under the name "Vulcan".

Heres a link for more instructions... https://devcenter.heroku.com/articles/buildpack-binaries