Convert a String into an Array of Characters - mul

Assuming that in 2019 every solution which is not UNICODE-safe is wrong. What is the best way to convert a string to array of UNICODE characters in PHP?

Obviously this means that accessing the bytes with the brace syntax is wrong, as well as using str_split:

$arr = str_split($text);

From sample input like:

$string = '先éé€


               
                
                   
                        
                        标签：
                            
                              
                                                                      php
                           
               
                  regex
                           
               
                  unicode
                           
               
                  split
                           
               
                  multibyte-characters
                           
               
                                
                           
                        
                    
                    
                                                   
                        
                                          
                        
                        
                        
                        
                        举报


   
    



        
        
        
        
        2条回答

           
       
           
           
           
                                              
            
                                  
            
            
            
            
            
            Root（大扎）                          
            
             
             2楼-- · 2020-02-07 04:17
             
             
             
                          
             
                                                                          
        This works for me, it explodes a unicode string into an array of characters:

//
// split at all position not after the start: ^
// and not before the end: $, with unicode modifier
// u (PCRE_UTF8).
//
$arr = preg_split("/(?<!^)(?!$)/u", $text);


For example:

<?php
//
$text = "堆栈溢出";

$arr = preg_split("/(?<!^)(?!$)/u", $text);

echo '<html lang="fr">
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
</head>
<body>
';

print_r($arr);

echo '</body>
</html>
';
?>


In a browser, it produces this:

Array ( [0] => 堆 [1] => 栈 [2] => 溢 [3] => 出 )

    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
             
            
                                  
            
            
            
            
            
            ゆ 、 Hurt°                          
            
             
             3楼-- · 2020-02-07 04:28
             
             
             
                          
             
                                                                          
        Just pass an empty pattern with the PREG_SPLIT_NO_EMPTY flag.
Otherwise, you can write a pattern with \X (unicode dot) and \K (restart fullstring match).  I'll include a mb_split() call and a preg_match_all() call for completeness.

Code: (Demo)

$string='先秦兩漢';
var_export(preg_split('~~u', $string, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_split('~\X\K~u', $string, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_split('~\X\K(?!$)~u', $string));
echo "\n---\n";
var_export(mb_split('\X\K(?!$)', $string));
echo "\n---\n";
var_export(preg_match_all('~\X~u', $string, $out) ? $out[0] : []);


All produce::

array (
  0 => '先',
  1 => '秦',
  2 => '兩',
  3 => '漢',
)


From https://www.regular-expressions.info/unicode.html: 


  How to Match a Single Unicode Grapheme
  
  Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use \X.
  
  You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.




UPDATE, DHarman has brought to my attention that mb_str_split() is now available from PHP7.4.

The default length parameter of the new function is 1, so the length parameter can be omitted for this case.

https://wiki.php.net/rfc/mb_str_split

Dharman's demo: https://3v4l.org/M85Fi/rfc#output
    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
   
   
               
               
     
                      登录 后发表回答



   
   
   
  
   相关问题
      
    
    
   
   

     


   
   Views base64 encoded blob in HTML with PHP   

   



     


   
   Laravel Option Select - Default Issue   

   



     


   
   PHP Recursively File Folder Scan Sorted by Modific   

   



     


   
   Can php detect if javascript is on or not?   

   



     


   
   Using similar_text and strpos together   

   



        
      
    查看全部
   
   
  
   相关文章
 
   
   

     


   
   appnode 网站已经建立好了,admin.php等无法访问   

     


   
   如何安全的关闭程序   

     


   
   tp5.1.前后端分离.cros跨域问题.在线上找了各种方法.没辙了   

     


   
   这个php乱码能不能恢复   

     


   
   你们和公司如何签署不泄露公司项目的协议呢   

     


   
   如何将表单内容通过php展示出来 求解！   

     


   
   财务系统域名选择问题   

     


   
   ssl配置问题   

        
        
    查看全部
                 收藏的人(4)

Convert a String into an Array of Characters - mul

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间