H2O recently added word2vec in its API. It is great to be able to easily train your own word vectors on a corpus you provide yourself.
However even greater possibilities exist from using big data and big computers, of the type that software vendors like Google or H2O.ai, but not so many end-users of H2O, may have access to, due to network bandwidth and compute power limitations.
Word embeddings can be seen as a type of unsupervised learning. As such, great value can be had in a data science pipeline by using pretrained word vectors that were built on a very large corpus as infrastructure in specific applications. Using general purpose pretrained word vectors can be seen as a form of transfer learning. Reusing word vectors is analogous to computer vision deep learning generic lowest layers that learn to detect edges in photographs. Higher layers detect specific kinds of objects composed from the edge layers below them.
For example Google provides some pretrained word vectors with their word2vec package. The more examples the better is often true with unsupervised learning. Further, sometimes it's practically difficult for an individual data scientist to download a giant corpus of text on which to train your own word vectors. And there is no good reason for every user to recreate the same wheel by training word vectors themselves on the same general purpose corpuses (corpi?) like wikipedia.
Word embeddings are very important and have the potential to be the bricks and mortar of a galaxy of possible applications. TF-IDF, the old basis for many natural language data science applications, stands to be made obsolete by using word embeddings instead.
Three questions:
1 - Does H2O currently provide any general purpose pretrained word embeddings (word vectors), for example trained on text found at legal or other public-owned (government) websites, or wikipedia or twitter or craigslist, or other free or Open Commons sources of human-written text?
2 - Is there a community site where H2O users can share their trained word2vec word vectors that are built on more specialized corpuses, such as medicine and law?
3 - Can H2O import Google's pretrained word vectors from their word2vec package?