How to get correctly-encoded HTML from the clipboa

2019-04-28 17:48发布

问题:

Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?

For example, executing a command like this:

string s = (string) Clipboard.GetData(DataFormats.Html)

Results in stuff like:

<FONT size=-2>  <A href="/advanced_search?hl=en">Advanced 
Search</A><BR>  <A href="/preferences?hl=en">Preferences</A><BR>  <A 
href="/language_tools?hl=en">Language 
Tools</A></FONT>

Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.

It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?

回答1:

In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).

It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the  character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.

For my other project I made a function that fix data with corrupted encoding.

In this case simple conversion should be sufficient:

byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);

My original function is a little bit more complex and contains tests to ensure that data are not corrupted...

public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
  if (string.IsNullOrEmpty(text))
    return false;
  byte[] data = encoding.GetBytes(text);
  // there should not be any character outside source encoding
  string newStr = encoding.GetString(data);
  if (!string.Equals(text, newStr)) // if there is any character "outside"
    return false; // leave, the input is in a different encoding
  if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
    return false; // if not, can not convert to UTF-8
  text = Encoding.UTF8.GetString(data);
  return true;
}

I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...

EDIT: (July 20, 2017)

It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0) (Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)



回答2:

You have to interpret the data as UTF-8. See MS Office hyperlinks change code page?.



回答3:

Here's PowerShell script you could modify to the clipboard to change any encoding problems.

http://www.johndcook.com/blog/2008/10/17/manipulating-the-clipboard-with-powershell/



回答4:

I don't know what your original source document is, but be aware that Word and Outlook provide several versions of the clipboard in different encodings. One is usually Windows-1252 and another is UTF-8. Possibly you're grabbing the UTF-8 encoded version by default, when you're expecting the Windows-1252 (Latin-1 + Smart Quotes)? Non-ASCII characters would show up as multiple odd Latin-1 accented characters. Most "Smart Quotes" are not in the Latin-1 set and are often three bytes in UTF-8.

Can you specify which encoding you want the clipboard contents in?



回答5:

Try this

System.Windows.Forms.Clipboard.GetText(System.Windows.Forms.TextDataFormat.Html);



回答6:

DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.

You get allot of wrong encodings, leading funny/bad characters such as 'Å','‹','Å’','Ž','Å¡','Å“','ž','Ÿ','Â','¡','¢','£','¤','Â¥','¦','§','¨','©'

Full explanation here Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters

Soln: Create a translation dictionary and search and replace.