I'm trying to convert user input, which is a subject for a blog entry into a permlink, which is used as URL and file system path.
I managed to do it using:
echo 'This is a used input, containing junk!!!! öäü' | tr -dc '[:alnum:] ' | tr '[:upper:]' '[:lower:]' | tr -s ' ' '-' | sed -e 's/ö/oe/' | sed -e 's/ü/ue/' | sed -e 's/ä/ae/' | sed -e 's/ß/ss/'
Output: this-is-a-used-input-containing-junk-oau
The code absolutely works!
But is there a nicer way to do this without the need to pass the string through this much subcommands?
Task to perform on the string
- convert to lowercase
- replace space with "-"
- no multiple "-"
- no special characters, umlauts (covering German umlauts is sufficient, covering all would be a plus)
It looks like you're transliterating characters. iconv
can handle this:
$ echo 'ö ä ü ß' | iconv -f utf-8 -t ascii//TRANSLIT
oe ae ue ss
This requires your locale to be set to de_DE.UTF-8
(or something similar) to get the results you expect (from your question & profile, I've made the assumption you're dealing with German text).
To set this for just the iconv command, use something like:
$ echo 'ö ä ü ß' | LC_ALL=de_DE.UTF-8 iconv -f utf-8 -t ascii//TRANSLIT
It's also possible you're not using UTF-8 but ISO-8859-1 or ISO-8859-15; consider switching to UTF-8 if possible, or adjust the -f
parameter accordingly.
Unfortunately, GNU tr
(ie. Linux systems) is stuck in the 7-bit ASCII days(!), and doesn't support converting the case of anything other than a to z (it uses the "xor with 0x20 trick").
Since you are converting your string to 7-bit ascii anyway, we can use tr
after iconv
for it to work as expected:
echo 'ö ä ü ß' | iconv -f utf-8 -t ascii//TRANSLIT | \
tr '[:upper:]' '[:lower:]'
I don't see a problem with the other 2 tr
invocations; they all do something different. Convert uppercase to lowercase, remove repeating characters, and remove whitespace.
Combining it in one "smart" command might look good now, but maybe not so good for the guy or gal who has to maintain it in 3 years time :-)
Putting it all together, and adding some line breaks, we end up with:
$ echo 'ö ä ü ß' | \
iconv -f utf-8 -t ascii//TRANSLIT | \
tr '[:upper:]' '[:lower:]' | \
tr -dc '[:alnum:] ' | \
tr -s ' ' '-'
You can shorten sed
some to:
echo 'This is a used input, containing junk!!!! öäü' | tr -dc '[:alnum:] ' | tr '[:upper:]' '[:lower:]' | tr -s ' ' '-' | sed 's/ö/oe/;s/ü/ue/;s/ä/ae/;s/ß/ss/'
Let say you have the following line
This is a used input, *%$^$^%$[] containing junk!!!! öäü ÄÜßÖ
using sed command as follows:
sed -r 's/[ ]+/-/g;s/[^[:alnum:]-]+//g;s/-+/-/g;y/üöäÄÜÖß/uoaauob/;s/.*/\l&/g'
trying the above string with this command
echo 'This is a used input, *%$^$^%$[] containing junk!!!! öäü ÄÜßÖ' |sed -r 's/[ ]+/-/g;s/[^[:alnum:]-]+//g;s/-+/-/g;y/üöäÄÜÖß/uoaauob/;s/.*/\l&/g'
results
this-is-a-used-input-containing-junk-oau-aubo
Note: All upper case characters (including upper case umlauts) are lower case as requested.
With perl :
#!/usr/bin/perl
use strict; use warnings;
use utf8;
binmode $_, ":utf8" for qw/STDOUT STDIN STDERR/;
use Text::Iconv;
my $converter = Text::Iconv->new("UTF-8", "ascii//TRANSLIT");
while (my $line = <>) {
$line = $converter->convert($line);
$line = lc $line;
$line =~ s/[[:punct:]]//g;
$line =~ s/\s/_/g;
print $line;
}
USAGE:
echo 'This is a used input, containing JUNK!!!! öäü' | ./script.pl
OUTPUT:
this_is_a_used_input,_containing_junk_oeaeue