I have a Scala-based application (which thus has access to standard Java stuff), leveraging a PostgreSQL database, running on Linux.
I mention the database and OS because, I know Postgres has some kind of dictionary for doing its text-search indexing, and I would think most Linux systems would have some sort of baseline dictionary, at least for simple things like spell-check. Whether it would be easy or practical to leverage these, though, is another matter.
I don't need complete word definitions, but I need to be able to answer questions like the following:
- Which part of speech does a word belong to? (E.g., is word
X
a noun? Is it a verb?)
- Is a word plural? And if so, what's its singular form? (And vice versa.)
The solution doesn't need to be super-fast, but it would be great if it's at least usable for servicing web-requests where a caching solution is used in combination.
I know there are tons of options out there -- googling for "java dictionary" will unearth a load, but it's not at all clear which of these projects are still active, which are more usable (subjective, I know :P), nor which may be overkill for my purposes.
Also, a solution that works either (a) with the stack I already have in place, or (b) as a simple sbt
dependency would be ideal!
As noted in the comments you can use the dictionary on the Linux system. Mine has american-english
installed in /usr/share/dict/american-english
. This dictionary includes almost 100,000 words and might be ok for a simple spell check. If you need another language or language variant you can install that via the package manager.
Using Scala
and this dictionary to do a simple spell check could be done by testing the given word for existence in the set of words.
scala> scala.io.Source.fromFile("/usr/share/dict/american-english").getLines.toSet
//Removed some apostrophes for the mark down.
res0: scala.collection.immutable.Set[String] = Set(professed, groundbreakings, slenderized, Nickelodeons, pathogens, OCasey, metacarpals, pokeys, chary, purifies, Borgs, ...
scala> res0.contains("foo")
res1: Boolean = false
scala> res0.contains("computer")
res2: Boolean = true
dict
is another Linux utility that can be used to find the part of speech and the plurality of a word. I'll borrow the description from it's man page:
dict is a client for the Dictionary Server Protocol (DICT), a TCP
transaction based query/response protocol that provides access to
dictionary definitions from a set of natural language dictionary
databases.
The dict
command can be run locally or against a server. The hard part is you'll have to parse the output to get the info you want which can be done in Scala or your text parsing tool of choice. For example, dict run
gives a couple of definitions for a noun and verb given by the output starting with n
the v
respectively.
n 1: a score in baseball made by a runner touching all four bases safely; ...
v 1: move fast by using one's feet, with one foot off the ground at any given time;...
For plurality dict goose
outputs the following plural form of goose which you would also have to parse to find.
pl. {Geese}