How to replace square brackets with curly brackets

2020-03-26 08:27发布

问题:

Due to conversions between pandoc-citeproc and latex I'd like to replace this

[@Fotheringham1981]

with this

\cite{Fotheringham1981} .

The issue with treating each bracket separately is illustrated in the reproducible example below.

x <- c("[@Fotheringham1981]", "df[1,2]")
x1 <- gsub("\\[@", "\\\\cite{", x)
x2 <- gsub("\\]", "\\}", x1)

x2[1] # good
## [1] "\\cite{Fotheringham1981}"

x2[2] # bad
## [1] "df[1,2}"

Seen a similar issue solved for C#, but not using R's perly regex - any ideas?

Edit:

It should be able to handle long documents, e.g.

old_rmd <- "$p = \alpha e^{\beta d}$ [@Wilson1971] and $p = \alpha d^{\beta}$
[@Fotheringham1981]."
new_rmd1 <- gsub("\\[@([^\\]]*)\\]", "\\\\cite{\\1}", old_rmd, perl = T) 
new_rmd2 <- gsub("\\[@([^]]*)]", "\\\\cite{\\1}", old_rmd) 

new_rmd1
## "$p = \alpha e^{\beta d}$ \\cite{Wilson1971} and $p = \alpha d^{\beta}$\n    \\cite{Fotheringham1981}."

new_rmd2
## [1] "$p = \alpha e^{\beta d}$ \\cite{Wilson1971} and $p = \alpha d^{\beta}$\n\\cite{Fotheringham1981}."

回答1:

You can use

gsub("\\[@([^]]*)]", "\\\\cite{\\1}", x)

See IDEONE demo

Regex breakdown:

  • \\[@ - a literal [@ symbol sequence
  • ([^]]*) - a capture group 1 that matches 0 or more occurrences of any symbol but a ] (note that if ] appears at the beginning of a character class, it does not need escaping)
  • ] - a literal ] symbol

You do not need to use perl=T with this one because the ] inside a character class is not escaped. Otherwise, it would require using that option.

Also, I believe we should only escape what must be escaped. If there is a way to avoid backslash hell, we should. Thus, you can even use

gsub("[[]@([^]]*)]", "\\\\cite{\\1}", x)

Here is another demo

Why TRE-based regex works better than the PCRE one:

In R 2.10.0 and later, the default regex engine is a modified version of Ville Laurikari's TRE engine [source]. The library's author states that time spent for matching grows linearly with increasing of input text length, while memory requirements are almost constant (tens of kilobytes). TRE is also said to use predictable and modest memory consumption and a quadratic worst-case time in the length of the used regular expression matching algorithm. That is why it seems best to rely on TRE rather than on PCRE regex when dealing with larger documents.



回答2:

You need to use capturing group.

x <- c("[@Fotheringham1981]", "df[1,2]")
gsub("\\[@([^\\]]*)\\]", "\\\\cite{\\1}", x, perl=T)
# [1] "\\cite{Fotheringham1981}" "df[1,2]" 

or

gsub("\\[@(.*?)\\]", "\\\\cite{\\1}", x)
# [1] "\\cite{Fotheringham1981}" "df[1,2]"


回答3:

This matches [@ and then sets up a capture group, i.e. everything within (...), and then .*? matches the shortest string until ] :

gsub("\\[(@.*?)\\]", "\\\\cite{\\1}", x)
## [1] "\\cite{@Fotheringham1981}" "df[1,2]" 

Here is a railroad diagram of the regular expression:

\[(@.*?)\]

Debuggex Demo