与R中的日志文件工作(Working with log files in R)

2019-09-28 05:30发布

我有一个具有不一致的数据格式的.log文件。

数据看起来是这样的,并存储为“小端UTF-16的Unicode”的文字:

2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
     [XYZ 1000 T1]:1
2017-06-22 01:15:17.945 NOTHING 'D': 989
     [CASE] IN: [ID: 1010]33
     [CASE] IN: [ID: 2010]8
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS

323133.....238813   76378    989899 000000000000

现在,我有一个遵循这种模式的多个日志文件。 我曾尝试扫描()函数read.table和(),它们都不会返回数据回我希望它做的格式。

我期待的数据格式如下:

Date                          String
2017-06-21 00:00:30.483       START THIS THING

但是,我在日志文件中的这些行多次:

 [CASE] IN: [ID: 1010]33
 [CASE] IN: [ID: 2010]8

和这个,

323133.....238813   76378    989899 000000000000

什么是解决这个解决方案的最佳方式是什么? 谢谢!

Answer 1:

使用基础R没有任何性能优化(如只使用一个原始草图(忽略你的时间戳和列名的一部分时间) data.table::fread和包装lubridate ):

log.data <- "2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
     [XYZ 1000 T1]:1
2017-06-22 01:15:17.945 NOTHING 'D': 989
     [CASE] IN: [ID: 1010]33
     [CASE] IN: [ID: 2010]8
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS

323133.....238813   76378    989899 000000000000"

log <- read.csv(text = log.data, sep = "\n", header = F)
log$timestamp <- as.Date(log[,1])

这导致:

> log
                                                 V1  timestamp
1    2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
2                                   [XYZ 1000 T1]:1       <NA>
3          2017-06-22 01:15:17.945 NOTHING 'D': 989 2017-06-22
4                           [CASE] IN: [ID: 1010]33       <NA>
5                            [CASE] IN: [ID: 2010]8       <NA>
6          2017-06-21 00:00:30.483 START THIS THING 2017-06-21
7    2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
8          2017-06-21 00:00:30.483 START THIS THING 2017-06-21
9    2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
10         2017-06-21 00:00:30.483 START THIS THING 2017-06-21
11   2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
12 323133.....238813   76378    989899 000000000000       <NA>

更新1:

因为你发现你的日志文件使用UTF-16小端文件编码(与检查file的Linux / OSX的终端命令),你必须添加文件编码read.csv到令R转换的文件内容正确期间阅读:

log <- read.csv(file = "my.log", sep = "\n", header = F, fileEncoding = "UTF-16LE", encoding = "UTF-8")


文章来源: Working with log files in R
标签: r encoding