Web frameworks such as Rails and Django has built-in support for "slugs" which are used to generate readable and SEO-friendly URLs:
- Slugs in Rails
- Slugs in Django
A slug string typically contains only of the characters a-z
, 0-9
and -
and can hence be written without URL-escaping (think "foo%20bar").
I'm looking for a Java slug function that given any valid Unicode string will return a slug representation (a-z
, 0-9
and -
).
A trivial slug function would be something along the lines of:
return input.toLowerCase().replaceAll("[^a-z0-9-]", "");
However, this implementation would not handle internationalization and accents (ë
> e
). One way around this would be to enumerate all special cases, but that would not be very elegant. I'm looking for something more well thought out and general.
My question:
- What is the most general/practical way to generate Django/Rails type slugs in Java?
Normalize your string using canonical decomposition:
private static final Pattern NONLATIN = Pattern.compile("[^\\w-]");
private static final Pattern WHITESPACE = Pattern.compile("[\\s]");
public static String toSlug(String input) {
String nowhitespace = WHITESPACE.matcher(input).replaceAll("-");
String normalized = Normalizer.normalize(nowhitespace, Form.NFD);
String slug = NONLATIN.matcher(normalized).replaceAll("");
return slug.toLowerCase(Locale.ENGLISH);
}
This is still a fairly naive process, though. It isn't going to do anything for s-sharp (ß - used in German), or any non-Latin-based alphabet (Greek, Cyrillic, CJK, etc).
Be careful when changing the case of a string. Upper and lower case forms are dependent on alphabets. In Turkish, the capitalization of U+0069 (i) is U+0130 (İ), not U+0049 (I) so you risk introducing a non-latin1 character back into your string if you use String.toLowerCase()
under a Turkish locale.
http://search.maven.org/#search|ga|1|slugify
And here's the GitHub repository to take a look at the code and its usage:
https://github.com/slugify/slugify
reference library, for other languageS: http://www.codecodex.com/wiki/Generate_a_url_slug
I've extended the answer by @McDowell to include escaping punctuation as hyphens and to remove duplicate and leading/trailing hyphens.
private static final Pattern NONLATIN = Pattern.compile("[^\\w_-]");
private static final Pattern SEPARATORS = Pattern.compile("[\\s\\p{Punct}&&[^-]]");
public static String makeSlug(String input) {
String noseparators = SEPARATORS.matcher(input).replaceAll("-");
String normalized = Normalizer.normalize(noseparators, Form.NFD);
String slug = NONLATIN.matcher(normalized).replaceAll("");
return slug.toLowerCase(Locale.ENGLISH).replaceAll("-{2,}","-").replaceAll("^-|-$","");
}
The proposition of McDowel almost works, but in cases like this Hello World !!
it returns hello-world--
(note the --
at the end of the string) instead of hello-world
.
A fixed version could be:
private static final Pattern NONLATIN = Pattern.compile("[^\\w-]");
private static final Pattern WHITESPACE = Pattern.compile("[\\s]");
private static final Pattern EDGESDHASHES = Pattern.compile("(^-|-$)");
public static String toSlug(String input) {
String nowhitespace = WHITESPACE.matcher(input).replaceAll("-");
String normalized = Normalizer.normalize(nowhitespace, Normalizer.Form.NFD);
String slug = NONLATIN.matcher(normalized).replaceAll("");
slug = EDGESDHASHES.matcher(slug).replaceAll("");
return slug.toLowerCase(Locale.ENGLISH);
}