Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
78 views
in Technique[技术] by (71.8m points)

r - FasttextR encoding

FasttextR is reading some Spanish words wrong in R (e.g., "participaci?3n". for "participación")from the pre-trained model "cc.es.300.bin" that I download from their website (https://fasttext.cc/docs/en/crawl-vectors.html).

I think the issue is that when I upload the model I have no way to tell R that the encoding should be "UTF-8", and not "Latin1" or others. That is, I can load the Spanish model and get the words wrong, like this:

model <- ft_load("cc.es.300.bin")  

but I cannot do this:

model <- ft_load("cc.es.300.bin", encoding="UTF-8") 

as it is possible to do with xlsx files, for example:

model <- xlsx::read.xlsx("file.xlsx", sheetIndex = 1, encoding="UTF-8")

I have tried: changing the language and encoding in Windows; reopening and saving the .R file with UTF-8 encoding; changing the locale to Spanish by Sys.setlocale("LC_ALL", "Spanish"). Nothing worked.

Any help will be much appreciated. Regards,

question from:https://stackoverflow.com/questions/65940490/fasttextr-encoding

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The library "readr" helped me

install.packages("read")
library(readr)
guess_encoding(ft_words(model))

|                                                                                        |   0%
# A tibble: 2 x 2
  encoding  confidence
  <chr>          <dbl>
1 UTF-8           1   
2 Shift_JIS       0.31
parse_character(ft_words(model), locale=locale(encoding="UTF-8"))
   [1] "de"              ","               "."               "la"              "y"              
   [6] "en"              "que"             "el"              "</s>"            "a"              
  [11] "los"             ":"               """              "del"             "un"             
  [16] ")"               "se"              "con"             "por"             "las"            
  [21] "("               "para"            "una"             "es"              "no"             
  [26] "su"              "al"              "como"            "lo"              "/"              
  [31] "más"             "El"              "o"               "'"               "La"             
  [36] "!"               "|"               "?"               "me"              "En"             
  [41] "..."             "-"               "sus"             "este"            "pero"           
  [46] "ha"              "esta"            ";"               "“"               "_"              
  [51] "”"               "si"              "sobre"           "?"               "fue"            
  [56] "son"             "le"              "muy"             "ser"             "ya"             
  [61] "tu"              "todo"            "1"               "entre"           "te"             
  [66] "mi"              "Los"             "%"               "sin"             "también"
...

instead of

 [1] "de"               ","                "."                "la"              
   [5] "y"                "en"               "que"              "el"              
   [9] "</s>"             "a"                "los"              ":"               
  [13] """               "del"              "un"               ")"               
  [17] "se"               "con"              "por"              "las"             
  [21] "("                "para"             "una"              "es"              
  [25] "no"               "su"               "al"               "como"            
  [29] "lo"               "/"                "m??s"             "El"              
  [33] "o"                "'"                "La"               "!"               
  [37] "|"                "?"                "me"               "En"              
  [41] "..."              "-"                "sus"              "este"            
  [45] "pero"             "ha"               "esta"             ";"               
  [49] "a€?"              "_"                "a€u009d"         "si"              
  [53] "sobre"            "??"               "fue"              "son"             
  [57] "le"               "muy"              "ser"              "ya"   

However, it does not seem to help when I use functions to get nearest neighbors

parse_character(ft_nearest_neighbors(model, "pera", k = 10L), locale=locale(encoding="UTF-8"))
Error in parse_vector(x, col_character(), na = na, locale = locale, trim_ws = trim_ws) : 
  is.character(x) is not TRUE

but (notice pi?±a instead of pi?a)

ft_nearest_neighbors(model, "pera", k = 10L)
 limonera   ciruela   manzana mandarina     pi?±a     fruta   sand?-a   compota    sandia     fresa 
0.6326169 0.6112964 0.6079050 0.5713655 0.5707002 0.5576053 0.5557024 0.5526152 0.5485740 0.5437940 

Now, what helps is enc2utf8 (still, the characters in the output look funny)

ft_nearest_neighbors(model,enc2utf8("pi?a"), k = 10L)
  sand?-a    papaya    sandia    anan??  pl??tano   anan??s     fruta    lim?3n mandarina maracuy?? 
0.6763531 0.6571828 0.6365163 0.6341625 0.6205474 0.6205293 0.6137358 0.6037553 0.6032383 0.5941805

enc2utf8 also helps if you want to obtain individual word vectors

pi?a <- as.vector(ft_word_vectors(model, enc2utf8("pi?a")))


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...