Due to conversions between pandoc-citeproc and latex I'd like to replace this
[@Fotheringham1981]
with this
\cite{Fotheringham1981}
.
The issue with treating each bracket separately is illustrated in the reproducible example below.
x <- c("[@Fotheringham1981]", "df[1,2]")
x1 <- gsub("\\[@", "\\\\cite{", x)
x2 <- gsub("\\]", "\\}", x1)
x2[1] # good
## [1] "\\cite{Fotheringham1981}"
x2[2] # bad
## [1] "df[1,2}"
Seen a similar issue solved for C#, but not using R's perly regex - any ideas?
Edit:
It should be able to handle long documents, e.g.
old_rmd <- "$p = \alpha e^{\beta d}$ [@Wilson1971] and $p = \alpha d^{\beta}$
[@Fotheringham1981]."
new_rmd1 <- gsub("\\[@([^\\]]*)\\]", "\\\\cite{\\1}", old_rmd, perl = T)
new_rmd2 <- gsub("\\[@([^]]*)]", "\\\\cite{\\1}", old_rmd)
new_rmd1
## "$p = \alpha e^{\beta d}$ \\cite{Wilson1971} and $p = \alpha d^{\beta}$\n \\cite{Fotheringham1981}."
new_rmd2
## [1] "$p = \alpha e^{\beta d}$ \\cite{Wilson1971} and $p = \alpha d^{\beta}$\n\\cite{Fotheringham1981}."
You can use
gsub("\\[@([^]]*)]", "\\\\cite{\\1}", x)
See IDEONE demo
Regex breakdown:
\\[@
- a literal [@
symbol sequence
([^]]*)
- a capture group 1 that matches 0 or more occurrences of any symbol but a ]
(note that if ]
appears at the beginning of a character class, it does not need escaping)
]
- a literal ]
symbol
You do not need to use perl=T
with this one because the ]
inside a character class is not escaped. Otherwise, it would require using that option.
Also, I believe we should only escape what must be escaped. If there is a way to avoid backslash hell, we should. Thus, you can even use
gsub("[[]@([^]]*)]", "\\\\cite{\\1}", x)
Here is another demo
Why TRE-based regex works better than the PCRE one:
In R 2.10.0 and later, the default regex engine is a modified version of Ville Laurikari's TRE engine [source]. The library's author states that time spent for matching grows linearly with increasing of input text length, while memory requirements are almost constant (tens of kilobytes). TRE is also said to use predictable and modest memory consumption and a quadratic worst-case time in the length of the used regular expression matching algorithm. That is why it seems best to rely on TRE rather than on PCRE regex when dealing with larger documents.
You need to use capturing group.
x <- c("[@Fotheringham1981]", "df[1,2]")
gsub("\\[@([^\\]]*)\\]", "\\\\cite{\\1}", x, perl=T)
# [1] "\\cite{Fotheringham1981}" "df[1,2]"
or
gsub("\\[@(.*?)\\]", "\\\\cite{\\1}", x)
# [1] "\\cite{Fotheringham1981}" "df[1,2]"
This matches [@
and then sets up a capture group, i.e. everything within (...), and then .*?
matches the shortest string until ]
:
gsub("\\[(@.*?)\\]", "\\\\cite{\\1}", x)
## [1] "\\cite{@Fotheringham1981}" "df[1,2]"
Here is a railroad diagram of the regular expression:
\[(@.*?)\]
Debuggex Demo