Dealing with the Cyrillic encoding in Node.Js / Ex

2019-08-31 00:46发布

问题:

In my app a user submits text through a form's textarea and this text is passed on to the app and is then processed by jsesc library, which escapes javascript strings.

The problem is that when I type in a text in Russian, such as

 нам #интересны наши #идеи 

what i get is

 '\u043D\u0430\u043C #\u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B \u043D\u0430\u0448\u0438 #\u0438\u0434\u0435\u0438'

I then need to pass this data through FlowDock to extract hashtags and FlockDock just does not recognize it.

Can someone please tell me

1) What is the need for converting it into that representation;

2) If it makes sense to convert it back to cyrillic encoding for FlowDock and for the database, or shall I keep it in Unicode and try to make FlowDock work with it?

Thanks!

UPDATE

The complete script is:

result = getField(req, field);
result = S(result).trim().collapseWhitespace().s;

// at this point result = "нам #интересны наши #идеи"
result = jsesc(result, {
             'quotes': 'double'
         });

// now i end up with Unicode as above above (\u....)

var hashtags = FlowdockText.extractHashtags(result);

FlowDock receives the result which is

\u043D\u0430\u043C #\u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B \u043D\u0430\u0448\u0438 #\u0438\u0434\u0435\u0438

And doesn't extract hashtags from it...

回答1:

These are 2 representations of the same string:

'нам #интересны наши #идеи' ===  '\u043D\u0430\u043C #\u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B \u043D\u0430\u0448\u0438 #\u0438\u0434\u0435\u0438'

looks like flowdock-text doesn't work well with non-ASCII characters

UPD: Tried, actually works well:

fdt.extractHashtags('\u043D\u0430\u043C #\u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B \u043D\u0430\u0448\u0438 #\u0438\u0434\u0435\u0438');

You shouldn't have used escaping in the first place, it gives you string literal representation (suits for eval, etc), not a string.

UPD2: I've reduced you code to the following:

var jsesc = require('jsesc');
var fdt = require('flowdock-text');

var result = 'нам #интересны наши #идеи';

result = jsesc(result, {
             'quotes': 'double'
         });

var hashtags = fdt.extractHashtags(result);

console.log(hashtags);

As I said, the problem is with jsesc: you don't need it. It returns javascript-encoded string. You need when you are doing eval with concatenation to protect from code injection, or something like this. For example if you add result = eval('"' + result + '"');, it will work.



回答2:

What is the need for converting it into that representation?

jsesc is a JavaScript library for escaping JavaScript strings while generating the shortest possible valid ASCII-only output. Here’s an online demo.

This can be used to avoid mojibake and other encoding issues, or even to avoid errors when passing JSON-formatted data (which may contain U+2028 LINE SEPARATOR, U+2029 PARAGRAPH SEPARATOR, or lone surrogates) to a JavaScript parser or an UTF-8 encoder, respectively.

Sounds like in this case you don’t intend to use jsesc at all.



回答3:

Try this:

decodeURIComponent("\u043D\u0430\u043C #\u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B \u043D\u0430\u0448\u0438 #\u0438\u0434\u0435\u0438");