How do I get the set of all letters in Java/Clojur

2019-02-06 11:38发布

In Python, I can do this:

>>> import string
>>> string.letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

Is there any way to do something similar in Clojure (apart from copying and pasting the above characters somewhere)? I looked through both the Clojure standard library and the java standard library and couldn't find it.

8条回答
淡お忘
2楼-- · 2019-02-06 11:44

I'm pretty sure the letters aren't available in the standard library, so you're probably left with the manual approach.

查看更多
贪生不怕死
3楼-- · 2019-02-06 11:47

No, because that is just printing out the ASCII letters rather than the full set. Of course, it's trivial to print out the 26 lower case and upper case letters using two for loops but the fact is that there are many more "letters" outside of the first 127 code points. Java's "isLetter" fn on Character will be true for these and many others.

查看更多
欢心
4楼-- · 2019-02-06 11:48

Based on Michaels imperative Java solution, this is a idiomatic (lazy sequences) Clojure solution:

(ns stackoverflow
  (:import (java.nio.charset Charset CharsetEncoder)))

(defn all-letters [charset]
  (let [encoder (. (Charset/forName charset) newEncoder)]
    (letfn [(valid-char? [c]
             (and (.canEncode encoder (char c)) (Character/isLetter c)))
        (all-letters-lazy [c]
                  (when (<= c (int Character/MAX_VALUE))
                (if (valid-char? c)
                  (lazy-seq
                   (cons (char c) (all-letters-lazy (inc c))))
                  (recur (inc c)))))]
      (all-letters-lazy 0))))

Update: Thanks cgrand for this preferable high-level solution:

(defn letters [charset-name]
  (let [ce (-> charset-name java.nio.charset.Charset/forName .newEncoder)]
    (->> (range 0 (int Character/MAX_VALUE)) (map char)
         (filter #(and (.canEncode ce %) (Character/isLetter %))))))

But the performace comparison between my first approach

user> (time (doall (stackoverflow/all-letters "ascii"))) 
"Elapsed time: 33.333336 msecs"                                                  
(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z \\
a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z)  

and your solution

user> (time (doall (stackoverflow/letters "ascii"))) 
"Elapsed time: 666.666654 msecs"                                                 
(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z \\
a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z) 

is quite interesting.

查看更多
beautiful°
5楼-- · 2019-02-06 11:48

In case you don't remember code points ranges. Brute force way :-P :

user> (require '[clojure.contrib.str-utils2 :as stru2])
nil
user> (set (stru2/replace (apply str (map char (range 0 256))) #"[^A-Za-z]" ""))
#{\A \a \B \b \C \c \D \d \E \e \F \f \G \g \H \h \I \i \J \j \K \k \L \l \M \m \N \n \O \o \P \p \Q \q \R \r \S \s \T \t \U \u \V \v \W \w \X \x \Y \y \Z \z}
user> 
查看更多
Root(大扎)
6楼-- · 2019-02-06 11:52

The same result as mentioned in your question would be given by the following statement that has to be manually generated in contrast to the Python solution:

public class Letters {

    public static String asString() {
        StringBuffer buffer = new StringBuffer();
        for (char c = 'a'; c <= 'z'; c++)
            buffer.append(c);
        for (char c = 'A'; c <= 'Z'; c++)
            buffer.append(c);
        return buffer.toString();
    }

    public static void main(String[] args) {
        System.out.println(Letters.asString());
    }

}
查看更多
Root(大扎)
7楼-- · 2019-02-06 12:05

string.letters: The concatenation of the strings lowercase and uppercase described below. The specific value is locale-dependent, and will be updated when locale.setlocale() is called.

I modified the answer from Michael Borgwardt. In my implementation there are two lists lowerCases and upperCases for two reasons:

  1. string.letters is lowercases followed by uppercases.

  2. Java Character.isLetter(char) is more than just uppercases and lowercases, so use of Character.isLetter(char) will return to much results under some charsets, for example "windows-1252"

From Api-Doc: Character.isLetter(char):

A character is considered to be a letter if its general category type, provided by Character.getType(ch), is any of the following:

* UPPERCASE_LETTER
* LOWERCASE_LETTER
* TITLECASE_LETTER
* MODIFIER_LETTER
* OTHER_LETTER 

Not all letters have case. Many characters are letters but are neither uppercase nor lowercase nor titlecase.

So if string.letters should only return lowercases and uppercases, the TITLECASE_LETTER, ,MODIFIER_LETTER and OTHER_LETTER chars have to be ignored.

public static String allLetters(final Charset charset) {
    final CharsetEncoder encoder = charset.newEncoder();
    final StringBuilder lowerCases = new StringBuilder();
    final StringBuilder upperCases = new StringBuilder();
    for (char c = 0; c < Character.MAX_VALUE; c++) {
    if (encoder.canEncode(c)) {
    if (Character.isUpperCase(c)) {
    upperCases.append(c);
    } else if (Character.isLowerCase(c)) {
    lowerCases.append(c);
    }
    }
    }
    return lowerCases.append(upperCases).toString();
}

Additionally: the behaviour of string.letters changes when changing the locale. This maybe won't apply to my solution, because changing the default locale does not change the default charset. From apiDoc:

The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system.

I guess, the default charset cannot be changed within the started JVM. So the "change locale" behaviour of string.letters can not be realizied with just Locale.setDefault(Locale). But changing the default locale is anyway a bad idea:

Since changing the default locale may affect many different areas of functionality, this method should only be used if the caller is prepared to reinitialize locale-sensitive code running within the same Java Virtual Machine.

查看更多
登录 后发表回答