I have a table in SQLite and I’d like to open it with dplyr. I use SQLite Expert Version 35.58.2478, R Studio Version 0.98.1062 on a PC with Win 7.
After connecting to the database with src_sqlite() and reading with tbl() I get the table. But the character enconding is wrong. Reading the same table from a csv-file just works by adding encoding = “utf-8” to the function read.csv but in this case another error in the first column name occurs (please consider the minimal example below).
Note that in the SQLite table the encoding is UTF-8 and SQLite displays the data correctly.
I tried to change the encoding in R Studio options with no success. Also changing the region in windows or in r doesn’t have any effect.
Is there any solution of getting the characters in the table correctly into r using dplyr?
Minimal Example
library(dplyr)
db <- src_sqlite("C:/Users/Jens/Documents/SQLite/my_db.sqlite")
tbl(db, "prozesse")
## Source: sqlite 3.7.17 [C:/Users/Jens/Documents/SQLite/my_db.sqlite]
## From: prozesse [4 x 4]
##
## KH_ID Einschätzung Prozess Gruppe
## 1 1 3 Buchung IT
## 2 2 4 Buchung IT
## 3 3 3 Buchung OLP
## 4 4 5 Buchung OLP
You see the wrong encoding in the name of the second column. This issue occures as well in the colums with ä, ö, ü etc.
The name of the second column is displayed correctly, but the first column is wrong:
read.csv("C:/Users/Jens/Documents/SQLite/prozess.csv", encoding = "UTF-8")
## X.U.FEFF.KH_ID Einschätzung Gruppe Prozess
## 1 1 3 PO visite
## 2 2 3 IT visite
## 3 3 3 IT visite
## 4 2 3 PO visite
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RSQLite.extfuns_0.0.1 RSQLite_0.11.4 DBI_0.3.0
## [4] dplyr_0.2
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 digest_0.6.4 evaluate_0.5.5 formatR_1.0
## [5] htmltools_0.2.6 knitr_1.6 parallel_3.1.1 Rcpp_0.11.2
## [9] rmarkdown_0.3.3 stringr_0.6.2 tools_3.1.1 yaml_2.1.13
I had the same problem. I solved it like below. However, I do not guarantee that the solution is rock solid. Give it a try:
library(dplyr)
library(sqldf)
# Modifying built-in mtcars dataset
mtcars$test <-
c("č", "ž", "š", "č", "ž", "š", letters) %>%
enc2utf8(.)
mtcars$češćžä <-
c("č", "ž", "š", "č", "ž", "š", letters) %>%
enc2utf8(.)
names(mtcars) <-
iconv(names(mtcars), "cp1250", "utf-8")
# Connecting to sqlite database
my_db <- src_sqlite("my_db.sqlite3", create = T)
# exporting mtcars dataset to database
copy_to(my_db, mtcars, temporary = FALSE)
# dbSendQuery(my_db$con, "drop table mtcars")
# getting data from sqlite database
my_mtcars_from_db <-
collect(tbl(my_db, "mtcars"))
# disconnecting from database
dbDisconnect(my_db$con)
convert_to_encoding() function
# a function that encodes
# column names and values in character columns
# with specified encodings
convert_to_encoding <-
function(x, from_encoding = "UTF-8", to_encoding = "cp1250"){
# names of columns are encoded in specified encoding
my_names <-
iconv(names(x), from_encoding, to_encoding)
# if any column name is NA, leave the names
# otherwise replace them with new names
if(any(is.na(my_names))){
names(x)
} else {
names(x) <- my_names
}
# get column classes
x_char_columns <- sapply(x, class)
# identify character columns
x_cols <- names(x_char_columns[x_char_columns == "character"])
# convert all string values in character columns to
# specified encoding
x <-
x %>%
mutate_each_(funs(iconv(., from_encoding, to_encoding)),
x_cols)
# return x
return(x)
}
# use
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
Results
# before conversion
my_mtcars_from_db
Source: local data frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb češćžä test
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ÄŤ ÄŤ
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Ĺľ Ĺľ
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 š š
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ÄŤ ÄŤ
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Ĺľ Ĺľ
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 š š
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 a a
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 b b
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 c c
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 d d
.. ... ... ... ... ... ... ... .. .. ... ... ... ...
# after conversion
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
Source: local data frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb test češćžä
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 č č
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ž ž
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 š š
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 č č
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ž ž
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 š š
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 a a
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 b b
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 c c
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 d d
.. ... ... ... ... ... ... ... .. .. ... ... ... ...
Session information
devtools::session_info()
Session info -------------------------------------------------------------------
setting value
version R version 3.2.0 (2015-04-16)
system x86_64, mingw32
ui RStudio (0.99.441)
language (EN)
collate Slovenian_Slovenia.1250
tz Europe/Prague
Packages -----------------------------------------------------------------------
package * version date source
assertthat * 0.1 2013-12-06 CRAN (R 3.2.0)
chron * 2.3-45 2014-02-11 CRAN (R 3.2.0)
DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
devtools * 1.7.0 2015-01-17 CRAN (R 3.2.0)
dplyr 0.4.1 2015-01-14 CRAN (R 3.2.0)
gsubfn 0.6-6 2014-08-27 CRAN (R 3.2.0)
lazyeval * 0.1.10 2015-01-02 CRAN (R 3.2.0)
magrittr * 1.5 2014-11-22 CRAN (R 3.2.0)
proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
R6 * 2.0.1 2014-10-29 CRAN (R 3.2.0)
Rcpp * 0.11.6 2015-05-01 CRAN (R 3.2.0)
RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
rstudioapi * 0.3.1 2015-04-07 CRAN (R 3.2.0)
sqldf 0.4-10 2014-11-07 CRAN (R 3.2.0)