Split a text into single words

I would like to split a text into single words using PHP. Do you have any idea how to achieve this?

My approach:

function tokenizer($text) {
    $text = trim(strtolower($text));
    $punctuation = '/[^a-z0-9äöüß-]/';
    $result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
    for ($i = 0; $i < count($result); $i++) {
        $result[$i] = trim($result[$i]);
    }
    return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));

Is this a good approach? Do you have any idea for improvement?

Thanks in advance!

标签： php split

6条回答

祖国的老花朵

2楼-- · 2020-01-27 05:17

You can also use the method explode : http://php.net/manual/en/function.explode.php

$words = explode(" ", $sentence);

0人赞添加讨论(0) 举报

Summer. ? 凉城

3楼-- · 2020-01-27 05:23

Use the class \p{P} which matches any unicode punctuation character, combined with the \s whitespace class.

$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);

This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as "don't" and "he said 'ouch!'"

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

4楼-- · 2020-01-27 05:29

Tokenize - strtok.

<?php
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$delim = ' \n\t,.!?:;';

$tok = strtok($text, $delim);

while ($tok !== false) {
    echo "Word=$tok<br />";
    $tok = strtok($delim);
}
?>

0人赞添加讨论(0) 举报

老娘就宠你

5楼-- · 2020-01-27 05:32

Do:

str_word_count($text, 1);

Or if you need unicode support:

function str_word_count_Helper($string, $format = 0, $search = null)
{
    $result = array();
    $matches = array();

    if (preg_match_all('~[\p{L}\p{Mn}\p{Pd}\'\x{2019}' . preg_quote($search, '~') . ']+~u', $string, $matches) > 0)
    {
        $result = $matches[0];
    }

    if ($format == 0)
    {
        return count($result);
    }

    return $result;
}

0人赞添加讨论(0) 举报

何必那么认真

6楼-- · 2020-01-27 05:33

I would first make the string to lower-case before splitting it up. That would make the i modifier and the array processing afterwards unnecessary. Additionally I would use the \W shorthand for non-word characters and add a + multiplier.

$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$result = preg_split('/\W+/', strtolower($text), -1, PREG_SPLIT_NO_EMPTY);

Edit Use the Unicode character properties instead of \W as marcog suggested. Something like [\p{P}\p{Z}] (punctuation and separator characters) would cover the characters more specific than \W.

0人赞添加讨论(0) 举报

放荡不羁爱自由

7楼-- · 2020-01-27 05:37

you can also use PHP strtok() function to fetch string tokens from your large string. you can use it like this:

 $result = array();
 // your original string
 $text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
 // you pass strtok() your string, and a delimiter to specify how tokens are separated. words are seperated by a space.
 $word = strtok($text,' ');
 while ( $word !== false ) {
     $result[] = $word;
     $word = strtok(' ');
 }

see more on php documentation for strtok()

0人赞添加讨论(0) 举报

Split a text into single words

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间