I was wondering how to remove:
- All leading/trailing whitespace or new-line characters, null characters, etc.
- Any redundant spaces within a string (ex. "hello[space][space]world" would be converted to "hello[space]world")
Is this possible with a single Regex, with unicode support for international space characters, etc.?
It seems that you might want to use both \s
shorthand character class and \p{Zs}
Unicode property to match Unicode spaces. However, both steps cannot be done with 1 regex replacement as you need two different replacements, and the ReplaceAllStringFunc
only allows a whole match string as argument (I have no idea how to check which group matched).
Thus, I suggest using two regexps:
^[\s\p{Zs}]+|[\s\p{Zs}]+$
- to match all leading/trailing whitespace
[\s\p{Zs}]{2,}
- to match 2 or more whitespace symbols inside a string
Sample code:
package main
import (
"fmt"
"regexp"
)
func main() {
input := " Text More here "
re_leadclose_whtsp := regexp.MustCompile(`^[\s\p{Zs}]+|[\s\p{Zs}]+$`)
re_inside_whtsp := regexp.MustCompile(`[\s\p{Zs}]{2,}`)
final := re_leadclose_whtsp.ReplaceAllString(input, "")
final = re_inside_whtsp.ReplaceAllString(final, " ")
fmt.Println(final)
}
You can get quite far just using the strings
package as strings.Fields
does most of the work for you:
package main
import (
"fmt"
"strings"
)
func standardizeSpaces(s string) string {
return strings.Join(strings.Fields(s), " ")
}
func main() {
tests := []string{" Hello, World ! ", "Hello,\tWorld ! ", " \t\n\t Hello,\tWorld\n!\n\t"}
for _, test := range tests {
fmt.Println(standardizeSpaces(test))
}
}
// "Hello, World !"
// "Hello, World !"
// "Hello, World !"
strings.Fields() splits on any amount of white space, thus:
strings.Join(strings.Fields(strings.TrimSpace(s)), " ")
Avoiding to use time wasting regexp or external library
I've choose to use plain golang instead of regexp, cause there are special character that are not ASCII in every language.
Go Golang!
func RemoveDoubleWhiteSpace(str string) string {
var b strings.Builder
b.Grow(len(str))
for i := range str {
if !(str[i] == 32 && (i+1 < len(str) && str[i+1] == 32)) {
b.WriteRune(rune(str[i]))
}
}
return b.String()
}
And the related test
func TestRemoveDoubleWhiteSpace(t *testing.T) {
data := []string{` test`, `test `, `te st`}
for _, item := range data {
str := RemoveDoubleWhiteSpace(item)
t.Log("Data ->|"+item+"|Found: |"+str+"| Len: ", len(str))
if len(str) != 5 {
t.Fail()
}
}
}
Use regexp for this.
func main() {
data := []byte(" Hello, World ! ")
re := regexp.MustCompile(" +")
replaced := re.ReplaceAll(bytes.TrimSpace(data), []byte(" "))
fmt.Println(string(replaced))
// Hello, World !
}
In order to also trim newlines and null characters, you can use the bytes.Trim(src []byte, cutset string)
function instead of bytes.TrimSpace