I am working on getting some song lyrics using an API, and converting the lyrics string into an array of words. I am getting some unusual behaviors in preg_replace function. When I did some debugging using var_dump, I see that var_dump returns a value of 10 for the string "you", which tells me that there might be something wrong. After that preg_replace acts weirdly.
This is my code:
$source = get_chart_lyrics_data("madonna","frozen");
$pieces = explode("\n", $source);
$lyrics = array();
for($i=0;$i<count($pieces);$i++){
if($i>10){
$words = explode(" ",$pieces[$i]);
foreach($words as $_word){
if($_word=="")
continue;
var_dump($_word);
$word = strtolower($_word);
var_dump($word);
$word = trim($word);
var_dump($word);
$word = preg_replace("/[^A-Za-z ]/", '', $word);
var_dump($word);
$lyrics[$word]++;
}
}
}
This is the first 4 lines this code returns:
string(10) “You”
string(10) “you”
string(10) “you”
string(8) “lyricyou”
How come var_dump is returning a value of 10 for "you"? And why preg_replace is acting like that?
Thanks.
The likeliest answer is that the string contains non-printable characters beyond "you". To figure out what exactly it contains, you'll have to look at the raw bytes. Do this with
echo bin2hex($word)
. This outputs a string like666f6f...
, where every 2 characters are one byte in hexadecimal notation. You may make that more readable with something like:Now use your favourite ASCII/Unicode table (depending on the encoding of the string) to figure out what individual characters those represent and where you got them from.
Perhaps your string is encoded in UTF-16, in which case you should see telltale
00
bytes every two characters.