Officedown: Bookmark \@ref fails to work for multibyte strings

Created on 27 Aug 2020 · 7Comments · Source: davidgohel/officedown

Suppose I have a .Rmd file like below:

---
title: "Untitled"
output:
  officedown::rdocx_document:
    default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)



md5-4caffc4710057148fdad652000287a10



# Chapter1 {#ch1}

# Chapter2 {#ch2}

Refer to \@ref(ch1).

When \@ref(ch1) is surrounded by multibyte strings (e.g., Chinese characters), it would possibly encounter errors.

Pure multibyte + ref
- Example: 上下\@ref(ch1)
- Result: correct
Mixed multibyte/singlebyte + ref
- Example: 上a下\@ref(ch1)
- Result: incorrect (上a下@ref(ch1))
ref + multibyte
- Example: \@ref(ch1)。
- Result: compile failed
Error in nchar(u, itype) : invalid multibyte string, element 1

Calls: ... regmatches<- -> regmatches -> Map -> mapply ->

Can you please look into this issue? Thanks.

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 20180)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936
[2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base

other attached packages:
[1] officer_0.3.12 officedown_0.2.0 flextable_0.5.10
[4] ggplot2_3.3.2 tidyr_1.1.1 knitr_1.29
[7] dplyr_1.0.2 reticulate_1.16

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lattice_0.20-41 prettyunits_1.1.1
[4] sysfonts_0.8.1 ps_1.3.4 utf8_1.1.4
[7] rprojroot_1.3-2 assertthat_0.2.1 digest_0.6.25
[10] R6_2.4.1 backports_1.1.9 evaluate_0.14
[13] pillar_1.4.6 gdtools_0.2.2 rlang_0.4.7
[16] curl_4.3 uuid_0.1-4 data.table_1.13.0
[19] callr_3.4.3 Matrix_1.2-18 rmarkdown_2.3
[22] desc_1.2.0 labeling_0.3 devtools_2.3.1
[25] stringr_1.4.0 munsell_0.5.0 tinytex_0.25
[28] compiler_4.0.2 xfun_0.16 pkgconfig_2.0.3
[31] systemfonts_0.2.3 base64enc_0.1-3 pkgbuild_1.1.0
[34] rvg_0.2.5 htmltools_0.5.0 tidyselect_1.1.0
[37] tibble_3.0.3 bookdown_0.20 fansi_0.4.1
[40] crayon_1.3.4 showtextdb_3.0 withr_2.2.0
[43] grid_4.0.2 jsonlite_1.7.0 gtable_0.3.0
[46] lifecycle_0.2.0 magrittr_1.5 scales_1.1.1
[49] zip_2.1.0 cli_2.0.2 stringi_1.4.6
[52] farver_2.0.3 fs_1.5.0 remotes_2.2.0
[55] testthat_2.3.2 xml2_1.3.2 ellipsis_0.3.1
[58] generics_0.0.2 vctrs_0.3.2 tools_4.0.2
[61] showtext_0.9 glue_1.4.1 purrr_0.3.4
[64] processx_3.4.3 pkgload_1.1.0 yaml_2.2.1
[67] colorspace_1.4-1 sessioninfo_1.1.1 memoise_1.1.0
[70] usethis_1.6.1

bug

Source

madlogos

All 7 comments

```````

title: "Untitled"
output:
officedown::rdocx_document:

default

{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

Chapter1 {#ch1}

Chapter2 {#ch2}

Refer to @ref(ch1).

When @ref(ch1) is surrounded by multibyte strings (e.g., Chinese characters), it would possibly encounter errors.

Pure multibyte + ref: 上下@ref(ch1)
Example: 上a下@ref(ch1)
ref + multibyte: @ref(ch1)。
```````

Your issue is related to the fact you are not working with a UTF-8 encoded file.

R, R Markdown and Windows does not work well when encoding is not UTF-8.

Capture d’écran 2020-08-27 à 10 53 27

Untitled.docx

davidgohel on 27 Aug 2020

👍1

Yes, @davidgohel, you are right. Althougth the .Rmd file is in UTF-8, the OS is running on GBK encoding. When I change to bookdown::word_document2, the knitr engine manages to compile the file. But I still get ?? where the bookmark is supposed to appear.

madlogos on 28 Aug 2020

You don't need to try new output format functions.

The result shown below is made with a Windows with french locale. But I made sure the file was encoded as UTF-8 (I am using readr::guess_encoding(), if not UTF-8 encoded, I can change it to UTF8 with fpeek::peek_iconv()).

Could you show the result of

readr::guess_encoding("your/rmd/file")

davidgohel on 28 Aug 2020

The results are

no | encoding | confidence
---|-------------|-----------:
1 | UTF-8 | 1
2 | windows-1252 | 0.28

madlogos on 29 Aug 2020

Hi @madlogos,

I am aslo a Chinese user. The multibyte problem has also bothered me for a long time. Here is my trick for it:

Write @ref as usual;
Save the Rmd file and readr::read_lines it;
Match the strings containing "\\\\@ref\$[^\$]+\\)" pattern;
Split it and make sure the "\\\\@ref\$[^\$]+\\)" on a single line;
Save the character vector to a new Rmd file and render it with the format you like. Done!

For example, 请参考表\@ref(tab: coco)中的数据 should be splited as
[line 1] 请参考表
[line 2] \@ref(tab: coco)
[line 3] 中的数据

Well, I am not sure if this is an effective solution but it works for me. 😄