Remove diacritics using Go

How can I remove all diacritics from the given UTF8 encoded string using Go? e.g. transform the string "žůžo" => "zuzo". Is there a standard way?

标签： unicode utf-8 go

2条回答

SAY GOODBYE

2楼-- · 2020-02-05 12:32

You can use the libraries described in Text normalization in Go.

Here's an application of those libraries:

// Example derived from: http://blog.golang.org/normalization

package main

import (
    "fmt"
    "unicode"

    "golang.org/x/text/transform"
    "golang.org/x/text/unicode/norm"
)

func isMn(r rune) bool {
    return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}

func main() {
    t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
    result, _, _ := transform.String(t, "žůžo")
    fmt.Println(result)
}

0人赞添加讨论(0) 举报

神经病院院长

3楼-- · 2020-02-05 12:42

To expand a bit on the existing answer:

The internet standard for comparing strings of different character sets is called "PRECIS" (Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols) and is documented in RFC7564. There is also a Go implementation at golang.org/x/text/secure/precis.

None of the standard profiles will do what you want, but it would be fairly straight forward to define a new profile that did. You would want to apply Unicode Normalization Form D ("D" for "Decomposition", which means the accents will be split off and be their own combining character), and then remove any combining character as part of the additional mapping rule, then recompose with the normalization rule. Something like this:

package main

import (
    "fmt"
    "unicode"

    "golang.org/x/text/secure/precis"
    "golang.org/x/text/transform"
    "golang.org/x/text/unicode/norm"
)

func main() {
    loosecompare := precis.NewIdentifier(
        precis.AdditionalMapping(func() transform.Transformer {
            return transform.Chain(norm.NFD, transform.RemoveFunc(func(r rune) bool {
                return unicode.Is(unicode.Mn, r)
            }))
        }),
        precis.Norm(norm.NFC), // This is the default; be explicit though.
    )
    p, _ := loosecompare.String("žůžo")
    fmt.Println(p, loosecompare.Compare("žůžo", "zuzo"))
    // Prints "zuzo true"
}

This lets you expand your comparison with more options later (eg. width mapping, case mapping, etc.)

It's also worth noting that removing accents is almost never what you actually want to do when comparing strings like this, however, without knowing your use case I can't actually make that assertion about your project. To prevent the proliferation of precis profiles it's good to use one of the existing profiles where possible. Also note that no effort was made to optimize the example profile.

0人赞添加讨论(0) 举报

Remove diacritics using Go

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间