How to remove redundant spaces/whitespace from a s

2020-05-25 08:11发布

问题:

I was wondering how to remove:

  • All leading/trailing whitespace or new-line characters, null characters, etc.
  • Any redundant spaces within a string (ex. "hello[space][space]world" would be converted to "hello[space]world")

Is this possible with a single Regex, with unicode support for international space characters, etc.?

回答1:

It seems that you might want to use both \s shorthand character class and \p{Zs} Unicode property to match Unicode spaces. However, both steps cannot be done with 1 regex replacement as you need two different replacements, and the ReplaceAllStringFunc only allows a whole match string as argument (I have no idea how to check which group matched).

Thus, I suggest using two regexps:

  • ^[\s\p{Zs}]+|[\s\p{Zs}]+$ - to match all leading/trailing whitespace
  • [\s\p{Zs}]{2,} - to match 2 or more whitespace symbols inside a string

Sample code:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    input := "   Text   More here     "
    re_leadclose_whtsp := regexp.MustCompile(`^[\s\p{Zs}]+|[\s\p{Zs}]+$`)
    re_inside_whtsp := regexp.MustCompile(`[\s\p{Zs}]{2,}`)
    final := re_leadclose_whtsp.ReplaceAllString(input, "")
    final = re_inside_whtsp.ReplaceAllString(final, " ")
    fmt.Println(final)
}


回答2:

You can get quite far just using the strings package as strings.Fields does most of the work for you:

package main

import (
    "fmt"
    "strings"
)

func standardizeSpaces(s string) string {
    return strings.Join(strings.Fields(s), " ")
}

func main() {
    tests := []string{" Hello,   World  ! ", "Hello,\tWorld ! ", " \t\n\t Hello,\tWorld\n!\n\t"}
    for _, test := range tests {
        fmt.Println(standardizeSpaces(test))
    }
}
// "Hello, World !"
// "Hello, World !"
// "Hello, World !"


回答3:

strings.Fields() splits on any amount of white space, thus:

strings.Join(strings.Fields(strings.TrimSpace(s)), " ")


回答4:

Avoiding to use time wasting regexp or external library
I've choose to use plain golang instead of regexp, cause there are special character that are not ASCII in every language.

Go Golang!

func RemoveDoubleWhiteSpace(str string) string {
    var b strings.Builder
    b.Grow(len(str))
    for i := range str {
        if !(str[i] == 32 && (i+1 < len(str) && str[i+1] == 32)) {
            b.WriteRune(rune(str[i]))
        }
    }
    return b.String()
}

And the related test

func TestRemoveDoubleWhiteSpace(t *testing.T) {
    data := []string{`  test`, `test  `, `te  st`}
    for _, item := range data {
        str := RemoveDoubleWhiteSpace(item)
        t.Log("Data ->|"+item+"|Found: |"+str+"| Len: ", len(str))
        if len(str) != 5 {
            t.Fail()
        }
    }
}


回答5:

Use regexp for this.

func main() {
    data := []byte("   Hello,   World !   ")
    re := regexp.MustCompile("  +")
    replaced := re.ReplaceAll(bytes.TrimSpace(data), []byte(" "))
    fmt.Println(string(replaced))
    // Hello, World !
}

In order to also trim newlines and null characters, you can use the bytes.Trim(src []byte, cutset string) function instead of bytes.TrimSpace



标签: regex go