How to get correctly-encoded HTML from the clipboa

Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?

For example, executing a command like this:

string s = (string) Clipboard.GetData(DataFormats.Html)

Results in stuff like:

<FONT size=-2>Â Â <A href="/advanced_search?hl=en">Advanced 
Search</A><BR>Â Â <A href="/preferences?hl=en">Preferences</A><BR>Â Â <A 
href="/language_tools?hl=en">Language 
Tools</A></FONT>

Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.

It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?

标签： c# winforms encoding clipboard

6条回答

你好瞎i

2楼-- · 2019-04-28 17:32

I don't know what your original source document is, but be aware that Word and Outlook provide several versions of the clipboard in different encodings. One is usually Windows-1252 and another is UTF-8. Possibly you're grabbing the UTF-8 encoded version by default, when you're expecting the Windows-1252 (Latin-1 + Smart Quotes)? Non-ASCII characters would show up as multiple odd Latin-1 accented characters. Most "Smart Quotes" are not in the Latin-1 set and are often three bytes in UTF-8.

Can you specify which encoding you want the clipboard contents in?

0人赞添加讨论(0) 举报

姐就是有狂的资本

3楼-- · 2019-04-28 17:39

Try this

System.Windows.Forms.Clipboard.GetText(System.Windows.Forms.TextDataFormat.Html);

0人赞添加讨论(0) 举报

Animai°情兽

4楼-- · 2019-04-28 17:41

In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).

It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the Â character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.

For my other project I made a function that fix data with corrupted encoding.

In this case simple conversion should be sufficient:

byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);

My original function is a little bit more complex and contains tests to ensure that data are not corrupted...

public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
  if (string.IsNullOrEmpty(text))
    return false;
  byte[] data = encoding.GetBytes(text);
  // there should not be any character outside source encoding
  string newStr = encoding.GetString(data);
  if (!string.Equals(text, newStr)) // if there is any character "outside"
    return false; // leave, the input is in a different encoding
  if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
    return false; // if not, can not convert to UTF-8
  text = Encoding.UTF8.GetString(data);
  return true;
}

I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...

EDIT: (July 20, 2017)

It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0) (Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)

0人赞添加讨论(0) 举报

狗以群分

5楼-- · 2019-04-28 17:41

You have to interpret the data as UTF-8. See MS Office hyperlinks change code page?.

0人赞添加讨论(0) 举报

趁早两清

6楼-- · 2019-04-28 17:46

DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.

You get allot of wrong encodings, leading funny/bad characters such as 'Å','â€¹','Å’','Å½','Å¡','Å“','Å¾','Å¸','Â','Â¡','Â¢','Â£','Â¤','Â¥','Â¦','Â§','Â¨','Â©'

Full explanation here Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters

Soln: Create a translation dictionary and search and replace.

0人赞添加讨论(0) 举报

forever°为你锁心

7楼-- · 2019-04-28 17:52

Here's PowerShell script you could modify to the clipboard to change any encoding problems.

http://www.johndcook.com/blog/2008/10/17/manipulating-the-clipboard-with-powershell/

0人赞添加讨论(0) 举报

How to get correctly-encoded HTML from the clipboa

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间