# Windows下的字符编码

## 排查

tbl[SECTOR %in% input$sector & DATE >= as.Date("2020-10-01") & DATE <= as.Date("2020-11-01")]  我退出Shiny，然后直接在Console里把input$sector用“科创板”文字替换掉执行，果然瞬间(0.01s)便完毕了。但是，我设置browser()在Shiny App中，暂停后运行，这个语句却需要15秒左右。这让我感到十分崩溃，心想，难道是我对于Shiny的input不了解吗？难道是input$sector这个语句和data.table发生了某种冲突？最令我崩溃的是，我一旦改写为下面的语句后，语句执行速度立马恢复成预期状态： tbl2 <- tbl[SECTOR %in% input$sector]
tbl2[DATE >= as.Date("2020-10-01") & DATE <= as.Date("2020-11-01")]


## 根源

1. Shiny里的input\$sector是UTF-8编码，而tbl里的SECTOR列却是native encoding。R里面%in%左右两侧的字符编码不一致的时候，尤其是有一列字符特别长的时候，执行速度会非常非常慢，原因应该是R对两列字符都进行了重新编码。我感觉这是R的一个问题，因为不论长的一列是native或UTF-8编码，速度都会特别慢。但按理说，R只需要对短的那一列字符进行重编码即可，没必要对两边都重编码。于是乎，我晚上在R-bugzilla上提交了个报告

2. 为什么改成了两个语句执行了就变快了呢？原因是data.table对于单独的%in%查询语句会进行优化，并没有使用base R里的%in%代码。但是，按照我的理解，对于非单独的%in%查询语句，data.table理应也进行优化才对，不知何故，没有能够触发优化的逻辑。于是乎，在data.table上也提交了一份报告

## 另，为什么现在才开始准备“在Windows下用UTF-8作为默认编码的R版本”？

Windows 10 (November 2019 release and newer) allows applications to use UTF-8 as their native encoding when interfacing both with the C library (needs to be UCRT) and with the operating system. This new Windows feature, present in Unix systems for many years, finally allows R on Windows to work reliably with all Unicode characters.

Applications that already worked reliably with all Unicode characters on Windows before used proprietary Windows API and wide-character strings, which required implementing and maintaining a lot of Windows-specific code. R did not go that route completely, except for RGui and particularly Windows-specific code interfacing with the file system / operating system (in some cases on Windows this is also needed for other reasons than character encoding). Today, Windows can’t even encode all Unicode characters using one wide character (wide characters are 16-bit, UTF16-LE is used, and hence two wide characters are needed to represent some Unicode characters), so the old Windows way to support Unicode in addition does not seem to have any technical advantage. The new way, via UTF-8, will instead allow to eventually phase out some Windows-specific code from R.