C++ - How to read Unicode characters( Hindi Script-第2页回答

I have a hindi script file like this:

3.  भारत का इतिहास काफी समृद्ध एवं विस्तृत है।

I have to write a program which adds a position to each and every word in each sentence. Thus the numbering for every line for a particular word position should start off with 1 in parentheses. The output should be something like this.

3.  भारत(1) का(2) इतिहास(3) काफी(4) समृद्ध(5) एवं(6) विस्तृत(7) है(8) ।(9)

The meaning of the above sentence is:

3.  India has a long and rich history.

If you observe the '।'( which is a full stop in hindi equivalent to a '.' in English ) also has a word position and similarly other special symbols would also have as I am trying to go about English-Hindi Word alignment( a part of Natural Language Processing ( NLP ) ) so the full stop in english '.' should map to '।' in Hindi. Serial nos remain as it is untouched. I thought reading character by character could be a solution. Could you please help me with how to go about in C++ if its easy or if easier could you suggest some other way through some other programming language may like Python/Perl..?

The thing is I am able to get word positions for my English text using C++ as I was able to read character by character using ASCII values in C++ but I don't have a clue to how to go about the same for the hindi text.

The final aim of all this is to see which word position of the English text maps to which postion in Hindi. This way I can achieve bidirectional alignment.

Thank you for your time...:)

标签： c++ python perl utf-8 nlp

7条回答

beautiful°

2楼-- · 2019-03-27 11:07

The first thing to do is determine whether or not your input is in UNICODE. Do this by attempting to read your input as UNICODE and see if the results are garbled.

FILE * fp = _wfopen( L"fname",L"r" );
wchar_t buf[1000];
while( fgetws(buf,999, fp ) )   {
    fwprintf(L"%s",buf);
}

If the output is OK, you have a UNICODE file, if garbled it is UTF-8

If you have UTF-8 you will have to convert to Unicode to make processing straightforward.

// convert UTF-8 to UNICODE

    void String2WString( std::wstring& ws, const std::string& s )
    {
        ws.clear();
        int nLenOfWideCharStr = MultiByteToWideChar(CP_ACP, 0, 
            s.c_str(), s.length(), NULL, 0); 
        PWSTR pWideCharStr = (PWSTR)HeapAlloc(GetProcessHeap(), 0, 
            nLenOfWideCharStr * sizeof(wchar_t)+2); 
        if (pWideCharStr == NULL)         
            return; 
        MultiByteToWideChar(CP_ACP, 0, 
            s.c_str(), s.length(), 
            pWideCharStr, nLenOfWideCharStr);
        *(pWideCharStr+nLenOfWideCharStr ) = L'\0';
        ws = pWideCharStr ;
        HeapFree(GetProcessHeap(), 0, pWideCharStr); 

    }

    // read UTF-8
FILE * fp = fopen( "fname","r" );
char buf[1000];
std::string aline;
std::wstring wline;
std::vector< std::wstring> vline;
while( fgets(buf,999, fp ) )    {
    aline = buf;
    String2WString( wline, aline );
    vline.push_back( wline );
}

The above assumes that you are on Windows. On Unix, the same idea applies and the code is quite similar. However, I do not find it quite so straightforward, so I will let a UNIX expert provide the details.

0人赞添加讨论(0) 举报

上一页 1 2

C++ - How to read Unicode characters( Hindi Script

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间