可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I want to generate random strings in the following way: ABCDE1234E
, i.e each string contains 5 Characters, 4 Numerics, then 1 Char.
I figured out a way to create this using the following code.
library(random)
string_5 <- as.vector(randomStrings(n=5000, len=5, digits=FALSE, upperalpha=TRUE,
loweralpha=FALSE, unique=TRUE, check=TRUE))
number_4 <- as.vector(randomNumbers(n=5000, min=1111, max=9999, col=5, base=10, check=TRUE))
string_1 <- as.vector(randomStrings(n=5000, len=1, digits=FALSE, upperalpha=TRUE,
loweralpha=FALSE, unique=FALSE, check=TRUE))
PAN.Number <- paste(string_5,number_4,string_1,sep = "")
But these functions are taking a long time and the random
library needs a network connection.
> system.time(string_5 <- as.vector(randomStrings(n=5000, len=5, digits=FALSE, upperalpha=TRUE,
+ loweralpha=FALSE, unique=TRUE, check=TRUE)))
user system elapsed
0.07 0.00 3.18
Is there any method that I could try to reduce the execution time?
I also tried using sample()
but I couldn't figure it out.
回答1:
Using "stringi" as suggested by @akrun will be faster, but the following is also very fast and does not require any additional packages:
myFun <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
Example output:
myFun(10)
## [1] "BZHOF3737P" "EPOWI0674X" "YYWEB2825M" "HQIXJ5187K" "IYIMB2578R"
## [6] "YSGBG6609I" "OBLBL6409Q" "PUMAL5632D" "ABRAT4481L" "FNVEN7870Q"
回答2:
We can use stri_rand_strings
from stringi
library(stringi)
sprintf("%s%s%s", stri_rand_strings(5, 5, '[A-Z]'),
stri_rand_strings(5, 4, '[0-9]'), stri_rand_strings(5, 1, '[A-Z]'))
Or more compactly
do.call(paste0, Map(stri_rand_strings, n=5, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]')))
Benchmarks
system.time({
do.call(paste0, Map(stri_rand_strings, n=5000, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]')))
})
# user system elapsed
# 0 0 0
Was able to reproduce the timings even for one part of the expected output using OP's method
system.time(string_5 <- as.vector(randomStrings(n=5000, len=5, digits=FALSE, upperalpha=TRUE,
loweralpha=FALSE, unique=TRUE, check=TRUE)))
# user system elapsed
# 0.86 0.24 5.52
回答3:
You can directly perform what you want:
Sample random 5 capital letters
Sample 4 digits
Sample 1 random capital letter
digits = 0:9
createRandString<- function() {
v = c(sample(LETTERS, 5, replace = TRUE),
sample(digits, 4, replace = TRUE),
sample(LETTERS, 1, replace = TRUE))
return(paste0(v,collapse = ""))
}
This will be more easily controlled, and won't take as long.
回答4:
Your performance problem comes from using the random
package in the first place: it's understandable that you could find the random::randomStrings()
function in an internet search and think it's a good way to generate random strings for use in a program, but the random
package is not intended for general-purpose programming. It works by querying the RANDOM.ORG server, which is intrinsically slower than R's built-in pseudo-random number generators.
From one of the vignettes from the random package:
There are a number of situations in which it is desirable to use non-deterministically determined
random numbers. Examples include
- to seed distributed computing on different nodes with truly indepedent seeds;
- to obtain portable initializations for RNGs that do not depend on particular operating system
or hardware features;
- to validate simulation results using non-deterministic random numbers;
- to provide indeterministic seeds used for lottery drawings or games ...
Note that most of these examples are about seeding or initializing (these are synonyms) R's built-in pseudo-random number generators, rather than replacing them ...