I am using gensim to load pre-trained fasttext model. I downloaded the English wikipedia trained model from fasttext website.
here is the code I wrote to load the pre-trained model:
from gensim.models import FastText as ft
model=ft.load_fasttext_format("wiki.en.bin")
I try to check if the following phrase exists in the vocal(which rare chance it would as these are pre-trained model).
print("internal executive" in model.wv.vocab)
print("internal executive" in model.wv)
False
True
So the phrase "internal executive" is not present in the vocabulary but we still have the word vector corresponding to that.
model.wv["internal executive"]
Out[46]:
array([ 0.0210917 , -0.15233646, -0.1173932 , -0.06210957, -0.07288644,
-0.06304111, 0.07833624, -0.17026938, -0.21922196, 0.01146349,
-0.13639058, 0.17283678, -0.09251394, -0.17875175, 0.01339212,
-0.26683623, 0.05487974, -0.11843193, -0.01982722, 0.37037706,
-0.24370994, 0.14269598, -0.16363597, 0.00328478, -0.16560239,
-0.1450972 , -0.24787527, -0.01318423, 0.03277111, 0.16175713,
-0.19367714, 0.16955379, 0.1972683 , 0.09044111, 0.01731548,
-0.0034324 , -0.04834719, 0.14321515, 0.01422525, -0.08803893,
-0.29411593, -0.1033244 , 0.06278021, 0.16452256, 0.0650492 ,
0.1506474 , -0.14194389, 0.10778475, 0.16008648, -0.07853138,
0.2183501 , -0.25451994, -0.0345991 , -0.28843886, 0.19964759,
-0.10923116, 0.26665714, -0.02544454, 0.30637854, 0.04568949,
-0.04798719, -0.05769338, 0.25762403, -0.05158515, -0.04426906,
-0.19901046, 0.00894193, -0.17269588, -0.24747233, -0.19061406,
0.14322804, -0.10804397, 0.4002605 , 0.01409482, -0.04675362,
0.10039093, 0.07260711, -0.0938239 , -0.20434211, 0.05741301,
0.07592541, -0.02921724, 0.21137556, -0.23188967, -0.23164661,
-0.4569614 , 0.07434579, 0.10841205, -0.06514647, 0.01220404,
0.02679767, 0.11840229, 0.2247431 , -0.1946325 , -0.0990666 ,
-0.02524677, 0.0801085 , 0.02437297, 0.00674876, 0.02088535,
0.21464555, -0.16240154, 0.20670174, -0.21640894, 0.03900698,
0.21772243, 0.01954809, 0.04541844, 0.18990673, 0.11806394,
-0.21336791, -0.10871669, -0.02197789, -0.13249406, -0.20440844,
0.1967368 , 0.09804545, 0.1440366 , -0.08401451, -0.03715726,
0.27826542, -0.25195453, -0.16737154, 0.3561183 , -0.15756823,
0.06724873, -0.295487 , 0.28395334, -0.04908851, 0.09448399,
0.10877471, -0.05020981, -0.24595442, -0.02822314, 0.17862654,
0.06452435, -0.15105674, -0.31911567, 0.08166212, 0.2634299 ,
0.17043628, 0.10063848, 0.0687021 , -0.12210461, 0.10803893,
0.13644943, 0.10755012, -0.09816817, 0.11873955, -0.03881042,
0.18548298, -0.04769253, -0.01511982, -0.08552645, -0.05218676,
0.05387992, 0.0497043 , 0.06922272, -0.0089245 , 0.24790663,
0.27209425, -0.04925154, -0.08621719, 0.15918174, 0.25831223,
0.01654229, -0.03617229, -0.13490392, 0.08033483, 0.34922174,
-0.01744722, -0.16894792, -0.10506647, 0.21708378, -0.22582002,
0.15625793, -0.10860757, -0.06058934, -0.25798836, -0.20142137,
-0.06613475, -0.08779443, -0.10732629, 0.05967236, -0.02455976,
0.2229451 , -0.19476262, -0.2720119 , 0.03687386, -0.01220259,
0.07704347, -0.1674307 , 0.2400516 , 0.07338555, -0.2000631 ,
0.13897157, -0.04637206, -0.00874449, -0.32827383, -0.03435039,
0.41587186, 0.04643605, 0.03352945, -0.13700874, 0.16430037,
-0.13630766, -0.18546128, -0.04692861, 0.37308362, -0.30846512,
0.5535561 , -0.11573419, 0.2332801 , -0.07236694, -0.01018955,
0.05936847, 0.25877884, -0.2959846 , -0.13610311, 0.10905041,
-0.18220575, 0.06902339, -0.10624941, 0.33002165, -0.12087796,
0.06742091, 0.20762768, -0.34141317, 0.0884434 , 0.11247049,
0.14748637, 0.13261876, -0.07357208, -0.11968047, -0.22124515,
0.12290633, 0.16602683, 0.01055585, 0.04445777, -0.11142147,
0.00004863, 0.22543314, -0.14342701, -0.23209116, -0.00003538,
0.19272381, -0.13767233, 0.04850799, -0.281997 , 0.10343244,
0.16510887, 0.08671653, -0.24125539, 0.01201926, 0.0995285 ,
0.09807415, -0.06764816, -0.0206733 , 0.04697794, 0.02000999,
0.05817033, 0.10478792, 0.0974884 , -0.01756372, -0.2466861 ,
0.02877498, 0.02499748, -0.00370895, -0.04728201, 0.00107118,
-0.21848503, 0.2033032 , -0.00076264, 0.03828803, -0.2929495 ,
-0.18218371, 0.00628893, 0.20586628, 0.2410889 , 0.02364616,
-0.05220835, -0.07040054, -0.03744286, -0.06718048, 0.19264086,
-0.06490505, 0.27364203, 0.05527219, -0.27494466, 0.22256687,
0.10330909, -0.3076979 , 0.04852265, 0.07411488, 0.23980476,
0.1590279 , -0.26712465, 0.07580928, 0.05644221, -0.18824042],
Now my confusion is that Fastext creates vectors for character ngrams of a word too. So for a word "internal" it will create vectors for all its character ngrams including the full word and then the final word vector for the word is the sum of its character ngrams.
However, how it is still able to give me vector of a word or even the whole sentence? Isn't fastext vector is for a word and its ngram? So what are these vector I am seeing for the phrase when its clearly two words?
From the paper Enriching Word Vectors with Subword Information:
So out-of-vocab words are represented as the sum of character ngram vectors. While the intent is to handle out-of-vocab words (unks) like "blargfizzle", it also handles phrases like your input.
If you look at the implementation of the vectors in Gensim you can see this is indeed what it's doing (along with normalization and hashing etc) - I added some comments starting with XXX:
Note that this doesn't mean it can provide vectors for any arbitrary string - it still needs to have data for at least some of the ngrams in an unk, so a string like
xwkxwkzrw
or天爾遠波
will probably fail to return anything if your vectors are trained on English.