r - FasttextR encoding

Question

Welcome To Ask or Share your Answers For Others

r - FasttextR encoding

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - FasttextR encoding

FasttextR is reading some Spanish words wrong in R (e.g., "participaci?3n". for "participación")from the pre-trained model "cc.es.300.bin" that I download from their website (https://fasttext.cc/docs/en/crawl-vectors.html).

I think the issue is that when I upload the model I have no way to tell R that the encoding should be "UTF-8", and not "Latin1" or others. That is, I can load the Spanish model and get the words wrong, like this:

model <- ft_load("cc.es.300.bin")

but I cannot do this:

model <- ft_load("cc.es.300.bin", encoding="UTF-8")

as it is possible to do with xlsx files, for example:

model <- xlsx::read.xlsx("file.xlsx", sheetIndex = 1, encoding="UTF-8")

I have tried: changing the language and encoding in Windows; reopening and saving the .R file with UTF-8 encoding; changing the locale to Spanish by Sys.setlocale("LC_ALL", "Spanish"). Nothing worked.

Any help will be much appreciated. Regards,

question from:https://stackoverflow.com/questions/65940490/fasttextr-encoding

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T18:56:40+0000

The library "readr" helped me

install.packages("read")
library(readr)

guess_encoding(ft_words(model))

|                                                                                        |   0%
# A tibble: 2 x 2
  encoding  confidence
  <chr>          <dbl>
1 UTF-8           1   
2 Shift_JIS       0.31

parse_character(ft_words(model), locale=locale(encoding="UTF-8"))
   [1] "de"              ","               "."               "la"              "y"              
   [6] "en"              "que"             "el"              "</s>"            "a"              
  [11] "los"             ":"               """              "del"             "un"             
  [16] ")"               "se"              "con"             "por"             "las"            
  [21] "("               "para"            "una"             "es"              "no"             
  [26] "su"              "al"              "como"            "lo"              "/"              
  [31] "más"             "El"              "o"               "'"               "La"             
  [36] "!"               "|"               "?"               "me"              "En"             
  [41] "..."             "-"               "sus"             "este"            "pero"           
  [46] "ha"              "esta"            ";"               "“"               "_"              
  [51] "”"               "si"              "sobre"           "?"               "fue"            
  [56] "son"             "le"              "muy"             "ser"             "ya"             
  [61] "tu"              "todo"            "1"               "entre"           "te"             
  [66] "mi"              "Los"             "%"               "sin"             "también"
...

instead of

 [1] "de"               ","                "."                "la"              
   [5] "y"                "en"               "que"              "el"              
   [9] "</s>"             "a"                "los"              ":"               
  [13] """               "del"              "un"               ")"               
  [17] "se"               "con"              "por"              "las"             
  [21] "("                "para"             "una"              "es"              
  [25] "no"               "su"               "al"               "como"            
  [29] "lo"               "/"                "m??s"             "El"              
  [33] "o"                "'"                "La"               "!"               
  [37] "|"                "?"                "me"               "En"              
  [41] "..."              "-"                "sus"              "este"            
  [45] "pero"             "ha"               "esta"             ";"               
  [49] "a€?"              "_"                "a€u009d"         "si"              
  [53] "sobre"            "??"               "fue"              "son"             
  [57] "le"               "muy"              "ser"              "ya"

However, it does not seem to help when I use functions to get nearest neighbors

parse_character(ft_nearest_neighbors(model, "pera", k = 10L), locale=locale(encoding="UTF-8"))
Error in parse_vector(x, col_character(), na = na, locale = locale, trim_ws = trim_ws) : 
  is.character(x) is not TRUE

but (notice pi?±a instead of pi?a)

ft_nearest_neighbors(model, "pera", k = 10L)
 limonera   ciruela   manzana mandarina     pi?±a     fruta   sand?-a   compota    sandia     fresa 
0.6326169 0.6112964 0.6079050 0.5713655 0.5707002 0.5576053 0.5557024 0.5526152 0.5485740 0.5437940

Now, what helps is enc2utf8 (still, the characters in the output look funny)

ft_nearest_neighbors(model,enc2utf8("pi?a"), k = 10L)
  sand?-a    papaya    sandia    anan??  pl??tano   anan??s     fruta    lim?3n mandarina maracuy?? 
0.6763531 0.6571828 0.6365163 0.6341625 0.6205474 0.6205293 0.6137358 0.6037553 0.6032383 0.5941805

enc2utf8 also helps if you want to obtain individual word vectors

pi?a <- as.vector(ft_word_vectors(model, enc2utf8("pi?a")))

Categories

r - FasttextR encoding

r - FasttextR encoding

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags