Reading a binary map file in r

2020-07-31 04:02发布

问题:

I am trying to read a binary file in R containing a simple 2D array of 360x180 values. For reference, the binary file can be found here:

http://transcom.project.asu.edu/download/transcom03/smoothmap.fix.2.bin

Here is what the readme for this .bin says:

The file 'smoothmap.fix.2.bin' contains a single real, binary array dimensioned 360 x 180. The array contains the numbers 1 through 22, denoting each of the 22 basis functions in the TransCom 3 experiment. This file was written on an SGI Origin 2000 hosting UNIX.

And my code:

to.read <- file("smoothmap.fix.2.bin", "rb")
raw.transcom <- readBin(to.read, integer(), n = 360*180, size = 4, endian = "big")
transcom <- matrix(raw.transcom, 180, 360, byrow = F)

Now raw.transcom contains only junk values:

unique(raw.transcom)
 [1]     259200          0 1101004800 1082130432 1092616192 1097859072 1100480512 1102053376 1086324736
[10] 1077936128 1101529088 1095761920 1096810496 1099956224 1091567616 1084227584 1090519040 1094713344
[19] 1099431936 1073741824 1093664768 1088421888 1065353216 1098907648

Why would that be?

I've been looking at this for an hour now and I'm stumped. Played around with endian-ness settings and the 'size' in readBin, but that did not help.

How can I read in this file correctly?

回答1:

Well, I didn't have time to poke at the "R" way to do this, but I do have access to GDL and found this, so I threw together:

Data  = read_binary('smoothmap.fix.2.bin',DATA_TYPE=4,ENDIAN='big');
Data = Data[1:64800]
Data = reform(Data,[360,180])

openw,unit,'testfile.dat',/get_lun
printf,unit,Data
free_lun,unit

and managed to generate: http://rud.is/dl/testfile.dat.gz

If you grab that and do:

x <- as.numeric(scan("testfile.dat.gz", "numeric"))

length(x)
## [1] 64800

table(x)
##   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22 
## 7951 1643 1189  796  868 1688  864 2345 2487  509  733 1410 5144 2388 2433 4111 7617 2450 1671 2058 9161 2334 2950 

It definitely looks like it's got the right values for the definition you specified and you can turn that into a matrix.

Check back, though, as I now need to figure out how to do this in R :-)


UPDATE

GOT IT!

I'm rly kinda glad I found the IDL code to verify the R results.

x <- readBin("smoothmap.fix.2.bin", "raw", file.size("smoothmap.fix.2.bin"))
x <- x[-(1:4)]
x <- x[-((length(x)-3):length(x))]

table(readBin(rawConnection(x), "numeric", 360*180, 4, endian="big"))
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22 
## 7951 1643 1189  796  868 1688  864 2345 2487  509  733 1410 5144 2388 2433 4111 7617 2450 1671 2058 9161 2334 2950 

Ideally, we'd check for first 4 and last 4 bytes being equal, but this hack shld get you through.


Putting it all together

Added validation bits of code…

#' Read in a binary array, likely written with IDL
#' 
#' @param x path to file (auto-expanded & tested for existence)
#' @param n number of `float` elements to read in
#' @param endian endian-ness (default `big`)
#' @return numeric vector of length `n`
read_binary_float <- function(x, n, endian="big") {

  x <- normalizePath(path.expand(x))

  x <- readBin(con = x, what = "raw", n = file.size(x))

  first4 <- x[1:4] # extract front bits
  last4 <- x[(length(x)-3):length(x)] # extract back bits

  # convert both to long ints      

  f4c <- rawConnection(first4)
  on.exit(close(f4c), add=TRUE)
  f4 <- readBin(con = f4c, what = "integer", n = 1, size = 4L, endian=endian)

  l4c <- rawConnection(last4)      
  on.exit(close(l4c), add=TRUE)      
  l4 <- readBin(con = l4c, what = "integer", n = 1, size = 4L, endian=endian)

  # validation

  stopifnot(f4 == l4) # check front/back are equal
  stopifnot(f4 == n*4) # check if `n` matches expected record count

  # strip off front and back bits

  x <- x[-(1:4)]
  x <- x[-((length(x)-3):length(x))]

  # slurp it all in

  rc <- rawConnection(x)      
  on.exit(close(rc), add=TRUE)

  readBin(con = rc, what = "numeric", n = n, size = 4L, endian=endian)

}

Quick example:

library(magrittr)

read_binary_float("smoothmap.fix.2.bin", 360*180) %>% 
  matrix(nrow = 360, ncol = 180) %>% 
  image()

This file seems to conform to the Fortran "unformatted I/O" spec : https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnc4/index.html : which confirmed the

"# records" | record | record | … | record | "# records"

we saw. So the function could be generalized to support more than just float conversion:

read_binary_array <- function(x, type=c("byte", "integer", "float"), endian="big") {

  type <- match.arg(trimws(tolower(type)), c("byte", "integer", "float"))
  type_size <- unname(c("byte"=1, "integer"=4, "float"=4)[type])

  x <- normalizePath(path.expand(x))

  x <- readBin(con = x, what = "raw", n = file.size(x))

  first4 <- x[1:4]
  last4 <- x[(length(x)-3):length(x)]

  f4c <- rawConnection(first4)
  on.exit(close(f4c), add=TRUE)
  f4 <- readBin(con = f4c, what = "integer", n = 1, size = 4L, endian=endian)

  l4c <- rawConnection(last4)
  on.exit(close(l4c), add=TRUE)
  l4 <- readBin(con = l4c, what = "integer", n = 1, size = 4L, endian=endian)

  stopifnot(f4 == l4) # check front/back are equal
  stopifnot((f4 %% type_size == 0)) # shld have nothing left over

  n_rec <- f4 / type_size
  message(sprintf("Reading in %s records...", scales::comma(n_rec)))

  x <- x[-(1:4)]
  x <- x[-((length(x)-3):length(x))]

  rc <- rawConnection(x)
  on.exit(close(rc), add=TRUE)

  what <- switch(type, byte="raw", integer="integer", float="numeric")
  dat <- readBin(con = rc, what = what, n = n_rec, size = type_size, endian=endian)

  dat

}


回答2:

This is incomplete, posted for progress.

There might be an undocumented "feature" in the data file, in that the first eight bytes are not part of the data. (The file is 259208, but 360*180*4==259200.) However, I do find something else interesting:

d <- readBin(file("~/Downloads/smoothmap.fix.2.bin", "rb"), integer(), n = 360*180, size = 4, endian = "big")

head(d)
# [1] 259200      0      0      0      0      0

I'm going to infer that the first 4-byte integer (259200) indicates the size of the data, so I suggest we can discard it. You might argue that you have the proper length of a vector here, but that's because you forced readBin to stop loading data. From ?readBin:

   n: integer.  The (maximal) number of records to be read.  You
      can use an over-estimate here, but not too large as storage
      is reserved for 'n' items.

So it should be safe to read beyond your expected file size, it will handle EOF by itself. I'll increase by 10, arbitrarily:

length(d)
# [1] 64800
d <- readBin(file("~/Downloads/smoothmap.fix.2.bin", "rb"), integer(), n = 360*180+10, size = 4, endian = "big")
length(d)
# [1] 64802
tail(d)
# [1] 1098907648 1098907648 1098907648 1098907648 1098907648     259200

(Note that even though I suggested reading another 10 bytes, only two more were available. So you know, the rationale for the n argument is for pre-allocation of memory, nothing more.) That 259200 is there again, I'll infer that confirms the end of data, so we should be able to safely discard those two (first/last) numbers.

d <- d[-c(1, length(d))]

The first non-zero number is:

head(which(d>0))
# [1] 4321 4322 4323 4324 4325 4326
d[4321]
# [1] 1101004800

and looking at the bits:

intToBits(d[4321])
#  [1] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 01 01
# [26] 00 00 00 00 00 01 00

So if you infer a direct binary interpretation, that value is 2820, which doesn't match the smoothmap.readme description of available values. Furthermore, what we are expecting to see:

intToBits(22)
#  [1] 00 01 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

So it appears that your bits are ... not in the right order, or something like that. If you intToBits all unique values, you'll notice that all bits 1-19 (the least-significant bits) are zero.

From here, I'm at a loss ...

sapply(unique(d), function(a) packBits(rev(intToBits(a)), type="integer"))
#  [1]    0 1410  258 1154 3714 6530 3458  770  514 5506 2690 1666 2434 2178 1282  130  642 4482    2 3202 1794  508  386