-->

How do I get the set of all letters in Java/Clojur

2019-02-06 11:26发布

问题:

In Python, I can do this:

>>> import string
>>> string.letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

Is there any way to do something similar in Clojure (apart from copying and pasting the above characters somewhere)? I looked through both the Clojure standard library and the java standard library and couldn't find it.

回答1:

A properly non-ASCII-centric implementation:

private static String allLetters(String charsetName)
{
    CharsetEncoder ce = Charset.forName(charsetName).newEncoder();
    StringBuilder result = new StringBuilder();
    for(char c=0; c<Character.MAX_VALUE; c++)
    {
        if(ce.canEncode(c) && Character.isLetter(c))
        {
            result.append(c);
        }
    }
    return result.toString();
}

Call this with "US-ASCII" and you'll get the desired result (except that uppercase letters come first). You could call it with Charset.defaultCharset(), but I suspect that you'd get far more than the ASCII letters on most systems, even in the USA.

Caveat: only considers the basic multilingual plane. Wouldn't be too hard to extend to the supplementary planes, but it would take a lot longer, and the utility is questionable.



回答2:

If you just want Ascii chars,

(map char (concat (range 65 91) (range 97 123)))

will yield,

(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z 
 \a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z)


回答3:

Based on Michaels imperative Java solution, this is a idiomatic (lazy sequences) Clojure solution:

(ns stackoverflow
  (:import (java.nio.charset Charset CharsetEncoder)))

(defn all-letters [charset]
  (let [encoder (. (Charset/forName charset) newEncoder)]
    (letfn [(valid-char? [c]
             (and (.canEncode encoder (char c)) (Character/isLetter c)))
        (all-letters-lazy [c]
                  (when (<= c (int Character/MAX_VALUE))
                (if (valid-char? c)
                  (lazy-seq
                   (cons (char c) (all-letters-lazy (inc c))))
                  (recur (inc c)))))]
      (all-letters-lazy 0))))

Update: Thanks cgrand for this preferable high-level solution:

(defn letters [charset-name]
  (let [ce (-> charset-name java.nio.charset.Charset/forName .newEncoder)]
    (->> (range 0 (int Character/MAX_VALUE)) (map char)
         (filter #(and (.canEncode ce %) (Character/isLetter %))))))

But the performace comparison between my first approach

user> (time (doall (stackoverflow/all-letters "ascii"))) 
"Elapsed time: 33.333336 msecs"                                                  
(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z \\
a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z)  

and your solution

user> (time (doall (stackoverflow/letters "ascii"))) 
"Elapsed time: 666.666654 msecs"                                                 
(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z \\
a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z) 

is quite interesting.



回答4:

No, because that is just printing out the ASCII letters rather than the full set. Of course, it's trivial to print out the 26 lower case and upper case letters using two for loops but the fact is that there are many more "letters" outside of the first 127 code points. Java's "isLetter" fn on Character will be true for these and many others.



回答5:

string.letters: The concatenation of the strings lowercase and uppercase described below. The specific value is locale-dependent, and will be updated when locale.setlocale() is called.

I modified the answer from Michael Borgwardt. In my implementation there are two lists lowerCases and upperCases for two reasons:

  1. string.letters is lowercases followed by uppercases.

  2. Java Character.isLetter(char) is more than just uppercases and lowercases, so use of Character.isLetter(char) will return to much results under some charsets, for example "windows-1252"

From Api-Doc: Character.isLetter(char):

A character is considered to be a letter if its general category type, provided by Character.getType(ch), is any of the following:

* UPPERCASE_LETTER
* LOWERCASE_LETTER
* TITLECASE_LETTER
* MODIFIER_LETTER
* OTHER_LETTER 

Not all letters have case. Many characters are letters but are neither uppercase nor lowercase nor titlecase.

So if string.letters should only return lowercases and uppercases, the TITLECASE_LETTER, ,MODIFIER_LETTER and OTHER_LETTER chars have to be ignored.

public static String allLetters(final Charset charset) {
    final CharsetEncoder encoder = charset.newEncoder();
    final StringBuilder lowerCases = new StringBuilder();
    final StringBuilder upperCases = new StringBuilder();
    for (char c = 0; c < Character.MAX_VALUE; c++) {
    if (encoder.canEncode(c)) {
    if (Character.isUpperCase(c)) {
    upperCases.append(c);
    } else if (Character.isLowerCase(c)) {
    lowerCases.append(c);
    }
    }
    }
    return lowerCases.append(upperCases).toString();
}

Additionally: the behaviour of string.letters changes when changing the locale. This maybe won't apply to my solution, because changing the default locale does not change the default charset. From apiDoc:

The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system.

I guess, the default charset cannot be changed within the started JVM. So the "change locale" behaviour of string.letters can not be realizied with just Locale.setDefault(Locale). But changing the default locale is anyway a bad idea:

Since changing the default locale may affect many different areas of functionality, this method should only be used if the caller is prepared to reinitialize locale-sensitive code running within the same Java Virtual Machine.



回答6:

I'm pretty sure the letters aren't available in the standard library, so you're probably left with the manual approach.



回答7:

In case you don't remember code points ranges. Brute force way :-P :

user> (require '[clojure.contrib.str-utils2 :as stru2])
nil
user> (set (stru2/replace (apply str (map char (range 0 256))) #"[^A-Za-z]" ""))
#{\A \a \B \b \C \c \D \d \E \e \F \f \G \g \H \h \I \i \J \j \K \k \L \l \M \m \N \n \O \o \P \p \Q \q \R \r \S \s \T \t \U \u \V \v \W \w \X \x \Y \y \Z \z}
user> 


回答8:

The same result as mentioned in your question would be given by the following statement that has to be manually generated in contrast to the Python solution:

public class Letters {

    public static String asString() {
        StringBuffer buffer = new StringBuffer();
        for (char c = 'a'; c <= 'z'; c++)
            buffer.append(c);
        for (char c = 'A'; c <= 'Z'; c++)
            buffer.append(c);
        return buffer.toString();
    }

    public static void main(String[] args) {
        System.out.println(Letters.asString());
    }

}