The library "readr" helped me
install.packages("read")
library(readr)
guess_encoding(ft_words(model))
| | 0%
# A tibble: 2 x 2
encoding confidence
<chr> <dbl>
1 UTF-8 1
2 Shift_JIS 0.31
parse_character(ft_words(model), locale=locale(encoding="UTF-8"))
[1] "de" "," "." "la" "y"
[6] "en" "que" "el" "</s>" "a"
[11] "los" ":" """ "del" "un"
[16] ")" "se" "con" "por" "las"
[21] "(" "para" "una" "es" "no"
[26] "su" "al" "como" "lo" "/"
[31] "más" "El" "o" "'" "La"
[36] "!" "|" "?" "me" "En"
[41] "..." "-" "sus" "este" "pero"
[46] "ha" "esta" ";" "“" "_"
[51] "”" "si" "sobre" "?" "fue"
[56] "son" "le" "muy" "ser" "ya"
[61] "tu" "todo" "1" "entre" "te"
[66] "mi" "Los" "%" "sin" "también"
...
instead of
[1] "de" "," "." "la"
[5] "y" "en" "que" "el"
[9] "</s>" "a" "los" ":"
[13] """ "del" "un" ")"
[17] "se" "con" "por" "las"
[21] "(" "para" "una" "es"
[25] "no" "su" "al" "como"
[29] "lo" "/" "m??s" "El"
[33] "o" "'" "La" "!"
[37] "|" "?" "me" "En"
[41] "..." "-" "sus" "este"
[45] "pero" "ha" "esta" ";"
[49] "a€?" "_" "a€u009d" "si"
[53] "sobre" "??" "fue" "son"
[57] "le" "muy" "ser" "ya"
However, it does not seem to help when I use functions to get nearest neighbors
parse_character(ft_nearest_neighbors(model, "pera", k = 10L), locale=locale(encoding="UTF-8"))
Error in parse_vector(x, col_character(), na = na, locale = locale, trim_ws = trim_ws) :
is.character(x) is not TRUE
but (notice pi?±a instead of pi?a)
ft_nearest_neighbors(model, "pera", k = 10L)
limonera ciruela manzana mandarina pi?±a fruta sand?-a compota sandia fresa
0.6326169 0.6112964 0.6079050 0.5713655 0.5707002 0.5576053 0.5557024 0.5526152 0.5485740 0.5437940
Now, what helps is enc2utf8 (still, the characters in the output look funny)
ft_nearest_neighbors(model,enc2utf8("pi?a"), k = 10L)
sand?-a papaya sandia anan?? pl??tano anan??s fruta lim?3n mandarina maracuy??
0.6763531 0.6571828 0.6365163 0.6341625 0.6205474 0.6205293 0.6137358 0.6037553 0.6032383 0.5941805
enc2utf8 also helps if you want to obtain individual word vectors
pi?a <- as.vector(ft_word_vectors(model, enc2utf8("pi?a")))