javascript and string manipulation w/ utf-16 surro

I'm working on a twitter app and just stumbled into the world of utf-8(16). It seems the majority of javascript string functions are as blind to surrogate pairs as I was. I've got to recode some stuff to make it wide character aware.

I've got this function to parse strings into arrays while preserving the surrogate pairs. Then I'll recode several functions to deal with the arrays rather than strings.

function sortSurrogates(str){
  var cp = [];                 // array to hold code points
  while(str.length){           // loop till we've done the whole string
    if(/[\uD800-\uDFFF]/.test(str.substr(0,1))){ // test the first character
                               // High surrogate found low surrogate follows
      cp.push(str.substr(0,2)); // push the two onto array
      str = str.substr(2);     // clip the two off the string
    }else{                     // else BMP code point
      cp.push(str.substr(0,1)); // push one onto array
      str = str.substr(1);     // clip one from string 
    }
  }                            // loop
  return cp;                   // return the array
}

My question is, is there something simpler I'm missing? I see so many people reiterating that javascript deals with utf-16 natively, yet my testing leads me to believe, that may be the data format, but the functions don't know it yet. Am I missing something simple?

EDIT: To help illustrate the issue:

var a = "0123456789"; // U+0030 - U+0039 2 bytes each
var b = "


   
    



        
        
        
        
        5条回答

           
       
           
           
           
                                              
            
                                  
            
            
            
            
            
            我想做一个坏孩纸                          
            
             
             2楼-- · 2019-01-16 17:30
             
             
             
                          
             
                                                                          
Javascript string iterators can give you the actual characters instead of the surrogate code points:

>>> [..."0123456789"]
["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
>>> [..."                                                                    
                                                        
            

              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                

                  
                



                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                

             
            
                                  
            
            
            
            
            
            The star\"                          
            
             
             3楼-- · 2019-01-16 17:33
             
             
             
                          
             
                                                                          
Javascript uses UCS-2 internally, which is not UTF-16. It is very difficult to handle Unicode in Javascript because of this, and I do not suggest attempting to do so.

As for what Twitter does, you seem to be saying that it is sanely counting by code point not insanely by code unit.

Unless you have no choice, you should use a programming language that actually supports Unicode, and which has a code-point interface, not a code-unit interface. Javascript isn't good enough for that as you have discovered. 

It has The UCS-2 Curse, which is even worse than The UTF-16 Curse, which is already bad enough.  I talk about all this in OSCON talk,                                                                     

                                                        

            

              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
             
            
                                  
            
            
            
            
            
            够拽才男人                          
            
             
             4楼-- · 2019-01-16 17:33
             
             
             
                          
             
                                                                          
Here are a couple scripts that might be helpful when dealing with surrogate pairs in JavaScript:


ES6 Unicode shims for ES3+ adds the String.fromCodePoint and String.prototype.codePointAt methods from ECMAScript 6. The ES3/5 fromCharCode and charCodeAt methods do not account for surrogate pairs and therefore give wrong results.
Full 21-bit Unicode code point matching in XRegExp with \u{10FFFF} allows matching any individual code point in XRegExp regexes.

    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
             
            
                                  
            
            
            
            
            
            孤傲高冷的网名                          
            
             
             5楼-- · 2019-01-16 17:41
             
             
             
                          
             
                                                                          
This is along the lines of what I was looking for. It needs better support for the different string functions. As I add to it I will update this answer.

function wString(str){
  var T = this; //makes 'this' visible in functions
  T.cp = [];    //code point array
  T.length = 0; //length attribute
  T.wString = true; // (item.wString) tests for wString object

//member functions
  sortSurrogates = function(s){  //returns array of utf-16 code points
    var chrs = [];
    while(s.length){             // loop till we've done the whole string
      if(/[\uD800-\uDFFF]/.test(s.substr(0,1))){ // test the first character
                                 // High surrogate found low surrogate follows
        chrs.push(s.substr(0,2)); // push the two onto array
        s = s.substr(2);         // clip the two off the string
      }else{                     // else BMP code point
        chrs.push(s.substr(0,1)); // push one onto array
        s = s.substr(1);         // clip one from string 
      }
    }                            // loop
    return chrs;
  };
//end member functions

//prototype functions
  T.substr = function(start,len){
    if(len){
      return T.cp.slice(start,start+len).join('');
    }else{
      return T.cp.slice(start).join('');
    }
  };

  T.substring = function(start,end){
    return T.cp.slice(start,end).join('');
  };

  T.replace = function(target,str){
    //allow wStrings as parameters
    if(str.wString) str = str.cp.join('');
    if(target.wString) target = target.cp.join('');
    return T.toString().replace(target,str);
  };

  T.equals = function(s){
    if(!s.wString){
      s = sortSurrogates(s);
      T.cp = s;
    }else{
        T.cp = s.cp;
    }
    T.length = T.cp.length;
  };

  T.toString = function(){return T.cp.join('');};
//end prototype functions

  T.equals(str)
};


Test results:

// plain string
var x = "0123456789";
alert(x);                    // 0123456789
alert(x.substr(4,5))         // 45678
alert(x.substring(2,4))      // 23
alert(x.replace("456","x")); // 0123x789
alert(x.length);             // 10

// wString object
x = new wString("                                                                    
                                                        
            

              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                

                  
                



                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                

             
            
                                  
            
            
            
            
            
            不美不萌又怎样                          
            
             
             6楼-- · 2019-01-16 17:53
             
             
             
                          
             
                                                                          
I've knocked together the starting point for a Unicode string handling object. It creates a function called UnicodeString() that accepts either a JavaScript string or an array of integers representing Unicode code points and provides length and codePoints properties and toString() and slice() methods. Adding regular expression support would be very complicated, but things like indexOf() and split() (without regex support) should be pretty easy to implement.



var UnicodeString = (function() {
    function surrogatePairToCodePoint(charCode1, charCode2) {
        return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
    }

    function stringToCodePointArray(str) {
        var codePoints = [], i = 0, charCode;
        while (i < str.length) {
            charCode = str.charCodeAt(i);
            if ((charCode & 0xF800) == 0xD800) {
                codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
            } else {
                codePoints.push(charCode);
            }
            ++i;
        }
        return codePoints;
    }

    function codePointArrayToString(codePoints) {
        var stringParts = [];
        for (var i = 0, len = codePoints.length, codePoint, offset, codePointCharCodes; i < len; ++i) {
            codePoint = codePoints[i];
            if (codePoint > 0xFFFF) {
                offset = codePoint - 0x10000;
                codePointCharCodes = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)];
            } else {
                codePointCharCodes = [codePoint];
            }
            stringParts.push(String.fromCharCode.apply(String, codePointCharCodes));
        }
        return stringParts.join("");
    }

    function UnicodeString(arg) {
        if (this instanceof UnicodeString) {
            this.codePoints = (typeof arg == "string") ? stringToCodePointArray(arg) : arg;
            this.length = this.codePoints.length;
        } else {
            return new UnicodeString(arg);
        }
    }

    UnicodeString.prototype = {
        slice: function(start, end) {
            return new UnicodeString(this.codePoints.slice(start, end));
        },

        toString: function() {
            return codePointArrayToString(this.codePoints);
        }
    };


    return UnicodeString;
})();

var ustr = UnicodeString("f                                                                    
                                                        
            

              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                

                  
                



                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                

   
   
               

               

     
                      登录 后发表回答

                                           
        
               

               

    




   


   
   
   
  
   相关问题
      
    
    
   
   

     


   
   Is there a limit to how many levels you can nest i   

   



     


   
   How to toggle on Order in ReactJS   

   



     


   
   void before promise syntax   

   



     


   
   Keeping track of variable instances   

   



     


   
   how to split a list into a given number of sub-lis   

   



        
      
    查看全部
   
   
  
   相关文章
 
   
   

     


   
   实时推送的大数据量（通过websocket)，造成页面数据加载比较慢，应该怎么改善？   

     


   
   反爬能检测到JS模拟的键盘输入吗   

     


   
   VUE的v-for中深入响应式原理的问题   

     


   
   做一个留言板，求动态修改数据文件的思路？   

     


   
   vue的data()中的值能否递归调用   

     


   
   浅拷贝的问题   

     


   
   javascript案例隐藏密码--有关元素选取的问题   

     


   
   JavaScript让一个变量影响另一个变量内容   

        
        
    查看全部
                 收藏的人(4) 

                            
             
             
                          
             
             
                          
             
             
                          
             
             
             
              
     
                 
           





  
    
      
      采纳回答
    
    

     
        
        
        
            
                向帮助了您的知道网友说句感谢的话吧!
            
            
                
                    
                        非常感谢!
                    
                
            
            
                
            
        
    

    

  








  
    
      
      编辑标签
    
    

    
        

                最多设置5个标签!

                    
                      
          
                       javascript
                          string
                          unicode
                          twitter
                          utf-16
                       
            
            
           
        
          
           
            
          
          
                        
                    

            

                
                 
      
    

    

  


 


  
    
      
      举报内容
    
    






检举类型


检举内容


检举用户




检举原因



广告推广


恶意灌水


回答内容与提问无关



抄袭答案


其他





检举说明(必填)






    

                
                 
      



    

  




 打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮
 

 










付费偷看金额在0.1-10元之间





 

 

 
 
 
 

 



  
    
      
      
       
               您已邀请15人回答
       查看邀请

       
        
        
       

       
    
    
     
    
 擅长该话题的人
 回答过该话题的人
我关注的人

 
     
       
       

    

  


  
                            
   




  
  
    
    



      
      

    
        
            
                
                    标签大全
                    站内问题
                    专栏文章
                    站内专家
                    站内话题
                    站内公告
                     财富值规则
                
               
            
                         
                宁ICP备15000671号-9
                
                站内文章地图xml
                
                站内问答地图xml
                
                站内作者地图xml
               
                站内标签地图xml
            
            
            
                        本站部分内容来自互联网，其发布内容言论不代表本站观点，如果其链接、内容的侵犯您的权益，烦请联系我们，我们将及时予以处理。
            
            
                        邮箱：z19940522666@163.com
            
            
                
                Copyright © 2016-2018 WHATSNSV3.8