如何将网:: HTTP响应转换成特定编码的1.9.1？(How to convert a Net::

我有一个应用程序西纳特拉（ http://analyzethis.espace-technologies.com ，做以下）

检索一个HTML页面（通过网/ HTTP）
从response.body创建文档引入nokogiri
提取一些信息，并将其发送回的响应。响应应该是UTF-8编码

于是，我来到了这个问题，而试图读取使用windows-1256编码方式，如www.filfan.com或www.masrawy.com网站。

问题是，虽然没有错误抛出的编码转换的结果是不正确的。

净/ HTTP response.body.encoding给出ASCII-8BIT不能被转换为UTF-8

如果我做引入nokogiri :: HTML（response.body），并使用CSS选择器来从页面的特定内容 - 例如说，标题标签的内容 - 我得到它，当我打电话string.encoding返回WINDOWS-1256的字符串。我用string.encode（“UTF-8”）和使用发送响应，但再次响应不正确。

什么是错误的，我的做法任何建议或想法？

Answer 1:

由于网:: HTTP不正确处理编码。见http://bugs.ruby-lang.org/issues/2567

您可以解析response['content-type']它包含字符集的，而不是分析整个response.body 。

然后使用force_encoding()来设置正确的编码。

response.body.force_encoding("UTF-8")如果网站是UTF-8提供服务。

Answer 2:

我发现下面的代码现在工作对我来说

def document
  if @document.nil? && response
    @document = if document_encoding
                  Nokogiri::HTML(response.body.force_encoding(document_encoding).encode('utf-8'),nil, 'utf-8')
                else
                  Nokogiri::HTML(response.body)
                end
  end
  @document
end

def document_encoding
  return @document_encoding if @document_encoding
  response.type_params.each_pair do |k,v|
    @document_encoding = v.upcase if k =~ /charset/i
  end
  unless @document_encoding
    #document.css("meta[http-equiv=Content-Type]").each do |n|
    #  attr = n.get_attribute("content")
    #  @document_encoding = attr.slice(/charset=[a-z1-9\-_]+/i).split("=")[1].upcase if attr
    #end
    @document_encoding = response.body =~ /<meta[^>]*HTTP-EQUIV=["']Content-Type["'][^>]*content=["'](.*)["']/i && $1 =~ /charset=(.+)/i && $1.upcase
  end
  @document_encoding
end

文章来源: How to convert a Net::HTTP response to a certain encoding in Ruby 1.9.1?