我有一个具有不一致的数据格式的.log文件。
数据看起来是这样的,并存储为“小端UTF-16的Unicode”的文字:
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
[XYZ 1000 T1]:1
2017-06-22 01:15:17.945 NOTHING 'D': 989
[CASE] IN: [ID: 1010]33
[CASE] IN: [ID: 2010]8
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
323133.....238813 76378 989899 000000000000
现在,我有一个遵循这种模式的多个日志文件。 我曾尝试扫描()函数read.table和(),它们都不会返回数据回我希望它做的格式。
我期待的数据格式如下:
Date String
2017-06-21 00:00:30.483 START THIS THING
但是,我在日志文件中的这些行多次:
[CASE] IN: [ID: 1010]33
[CASE] IN: [ID: 2010]8
和这个,
323133.....238813 76378 989899 000000000000
什么是解决这个解决方案的最佳方式是什么? 谢谢!
使用基础R没有任何性能优化(如只使用一个原始草图(忽略你的时间戳和列名的一部分时间) data.table::fread
和包装lubridate
):
log.data <- "2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
[XYZ 1000 T1]:1
2017-06-22 01:15:17.945 NOTHING 'D': 989
[CASE] IN: [ID: 1010]33
[CASE] IN: [ID: 2010]8
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
323133.....238813 76378 989899 000000000000"
log <- read.csv(text = log.data, sep = "\n", header = F)
log$timestamp <- as.Date(log[,1])
这导致:
> log
V1 timestamp
1 2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
2 [XYZ 1000 T1]:1 <NA>
3 2017-06-22 01:15:17.945 NOTHING 'D': 989 2017-06-22
4 [CASE] IN: [ID: 1010]33 <NA>
5 [CASE] IN: [ID: 2010]8 <NA>
6 2017-06-21 00:00:30.483 START THIS THING 2017-06-21
7 2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
8 2017-06-21 00:00:30.483 START THIS THING 2017-06-21
9 2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
10 2017-06-21 00:00:30.483 START THIS THING 2017-06-21
11 2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
12 323133.....238813 76378 989899 000000000000 <NA>
更新1:
因为你发现你的日志文件使用UTF-16小端文件编码(与检查file
的Linux / OSX的终端命令),你必须添加文件编码read.csv
到令R转换的文件内容正确期间阅读:
log <- read.csv(file = "my.log", sep = "\n", header = F, fileEncoding = "UTF-16LE", encoding = "UTF-8")