I need to convert the string HTML from a mix of Cyrillic and Latin symbols to UNICODE.
I tried the following:
Public HTML As String
Sub HTMLsearch()
GetHTML ("http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1")
MsgBox HTML
HTML = StrConv(HTML, vbUnicode)
MsgBox HTML
End Sub
Function GetHTML(URL As String) As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.Send
HTML = .ResponseText
End With
End Function
You can see what is before and after the StrConv. If you like to get the html in a file, you can use the following code:
Public HTML As String
Sub HTMLsearch()
GetHTML ("http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1")
Dim path As String
path = ThisWorkbook.path & "\html.txt"
Open path For Output As #1
Print #1, HTML
Close #1
HTML = StrConv(HTML, vbUnicode)
path = ThisWorkbook.path & "\htmlUNICODE.txt"
Open path For Output As #1
Print #1, HTML
Close #1
End Sub
Function GetHTML(URL As String) As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.Send
HTML = .ResponseText
End With
End Function
IDEAS?
My production order data comes from many countries. this is the only vba function I could find that really works.
Example Usage:
VBA's support for Unicode is not all that great.
It is possible to handle Unicode strings, but you will not be able to see the actual characters with
Debug.Print
orMsgBox
- they will appear as?
there.You can set Control Panel > Region and Language > Administrative tab > "Current language for non-Unicode programs" to "Russian" switch to a different code page, which would allow you to see Cyrillic letters in VBA message boxes instead of question marks. But that's only a cosmetic change.
Your real problem is something else here.
The server (nfs.mobile.bg) sends the document as
Content-Type: text/html
. There is no information about character encoding. That means the receiver must figure out character encoding on its own.A browser does that by looking at the response byte stream and making guesses. In your case, a helpful
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
tag is present in the HMTL source. Therefore, the byte stream should be interpreted asWindows-1251
, which happens to be the Cyrillic ANSI code page in Windows.So, we do not even have Unicode here!
In the absence of any additional info, the
responseText
property of theXMLHTTP
object defaults tous-ascii
. The extended characters from the Cyrillic alphabet are not present in ASCII, so they will be converted to actual question marks and are lost. That's why you can't useresponseText
for anything.However, the original bytes of the response are still available, in the
responseBody
property, which is an array ofByte
.In VBA you must do the same thing a browser would do. You must interpret the byte-stream as a certain character set. The
ADODB.Stream
object can do that for you, and it's pretty straight-forward, too:I've been using MS Word here and calling
HTMLsearch()
properly writes Cyrillic characters to the page. They still do appear as?
in aMsgBox
for me, though, but now that's purely a display problem, caused by the fact that VBA-created UI cannot deal with Unicode.