Subject to permlink convertion in bash

2019-08-09 17:02发布

问题:

I'm trying to convert user input, which is a subject for a blog entry into a permlink, which is used as URL and file system path.

I managed to do it using:

echo 'This is a used input, containing junk!!!! öäü' | tr -dc '[:alnum:] ' | tr '[:upper:]' '[:lower:]' | tr -s ' ' '-' | sed -e 's/ö/oe/' | sed -e 's/ü/ue/' | sed -e 's/ä/ae/' | sed -e 's/ß/ss/'

Output: this-is-a-used-input-containing-junk-oau

The code absolutely works!

But is there a nicer way to do this without the need to pass the string through this much subcommands?

Task to perform on the string

  • convert to lowercase
  • replace space with "-"
  • no multiple "-"
  • no special characters, umlauts (covering German umlauts is sufficient, covering all would be a plus)

回答1:

It looks like you're transliterating characters. iconv can handle this:

$ echo 'ö ä ü ß' | iconv -f utf-8 -t ascii//TRANSLIT
oe ae ue ss

This requires your locale to be set to de_DE.UTF-8 (or something similar) to get the results you expect (from your question & profile, I've made the assumption you're dealing with German text).

To set this for just the iconv command, use something like:

$ echo 'ö ä ü ß' | LC_ALL=de_DE.UTF-8 iconv -f utf-8 -t ascii//TRANSLIT

It's also possible you're not using UTF-8 but ISO-8859-1 or ISO-8859-15; consider switching to UTF-8 if possible, or adjust the -f parameter accordingly.

Unfortunately, GNU tr (ie. Linux systems) is stuck in the 7-bit ASCII days(!), and doesn't support converting the case of anything other than a to z (it uses the "xor with 0x20 trick").

Since you are converting your string to 7-bit ascii anyway, we can use tr after iconv for it to work as expected:

echo 'ö ä ü ß' | iconv -f utf-8 -t ascii//TRANSLIT | \
    tr '[:upper:]' '[:lower:]' 

I don't see a problem with the other 2 tr invocations; they all do something different. Convert uppercase to lowercase, remove repeating characters, and remove whitespace.
Combining it in one "smart" command might look good now, but maybe not so good for the guy or gal who has to maintain it in 3 years time :-)

Putting it all together, and adding some line breaks, we end up with:

$ echo 'ö ä ü ß' | \
    iconv -f utf-8 -t ascii//TRANSLIT | \
    tr '[:upper:]' '[:lower:]' | \
    tr -dc '[:alnum:] ' | \
    tr -s ' ' '-'


回答2:

You can shorten sed some to:

echo 'This is a used input, containing junk!!!! öäü' | tr -dc '[:alnum:] ' | tr '[:upper:]' '[:lower:]' | tr -s ' ' '-' | sed 's/ö/oe/;s/ü/ue/;s/ä/ae/;s/ß/ss/'


回答3:

Let say you have the following line

This is a used input, *%$^$^%$[]             containing junk!!!! öäü  ÄÜßÖ

using sed command as follows:

sed -r 's/[ ]+/-/g;s/[^[:alnum:]-]+//g;s/-+/-/g;y/üöäÄÜÖß/uoaauob/;s/.*/\l&/g'

trying the above string with this command

 echo 'This is a used input, *%$^$^%$[]             containing junk!!!! öäü  ÄÜßÖ' |sed -r 's/[ ]+/-/g;s/[^[:alnum:]-]+//g;s/-+/-/g;y/üöäÄÜÖß/uoaauob/;s/.*/\l&/g'

results

this-is-a-used-input-containing-junk-oau-aubo

Note: All upper case characters (including upper case umlauts) are lower case as requested.



回答4:

With perl :

#!/usr/bin/perl
use strict; use warnings;
use utf8;
binmode $_, ":utf8" for qw/STDOUT STDIN STDERR/;
use Text::Iconv;
my $converter = Text::Iconv->new("UTF-8", "ascii//TRANSLIT");

while (my $line = <>) {
    $line = $converter->convert($line);
    $line = lc $line;
    $line =~ s/[[:punct:]]//g;
    $line =~ s/\s/_/g;
    print $line;
}

USAGE:

echo 'This is a used input, containing JUNK!!!! öäü' | ./script.pl

OUTPUT:

this_is_a_used_input,_containing_junk_oeaeue