how to write R to loop each worksheet of every fil

2019-08-11 08:59发布

I have a scripte here and it works fine to get a number in a certain column. Now i want to collect not only on the 1st sheet of every files in the directory, but each sheets of each files.

now the .csv file R has written shows 2 column, column A is filename, and B is the number R grabbed.

What modification should I add to the below script to have a csv output that shows 3 column, A is filename, B is sheetnames, C is the number?

require(xlsx)
#setwd
setwd("D:\\Transferred Files\\")
files <- (Sys.glob("*.xls"))
f<-length(files)

DF <- data.frame(txt=rep("", f),num=rep(NA, f),stringsAsFactors=FALSE)

# files loop
for(i in 1:f)
{
  A<-read.xlsx(file=files[i],1,startColumn=1, endColumn=20, startRow=1, endRow=60)
  #Find price
  B<-as.data.frame.matrix(A)
  P<-B[which(apply(B, 1, function(x) any(grepl("P", x)))),which(apply(B, 2, function(x) any(grepl("P", 

x))))+6]

  #fill price DF
  DF[i, ] <-c(files[i],P)
}
write.csv(DF, "prices.csv", row.names=FALSE)

I tried XLconnet, but can't really make it work into this.

标签: r excel csv
1条回答
你好瞎i
2楼-- · 2019-08-11 09:25

You have a good start, but you are asking how to add into the loop the worksheets within a file. If you read ?read.xlsx, you'll see two arguments that you are glossing over (well, using one, ignoring the other) in your code:

Usage:

     read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,
       startRow=NULL, endRow=NULL, colIndex=NULL,
       as.data.frame=TRUE, header=TRUE, colClasses=NA,
       keepFormulas=FALSE, encoding="unknown", ...)

Arguments:

    file: the path to the file to read.

sheetIndex: a number representing the sheet index in the workbook.

sheetName: a character string with the sheet name.

You should only need to provide one of the two.

You might ask "how do I know how many sheets there are in a worksheet?" (for sheetIndex) or even "what are the sheet names?" (for sheetName). ?getSheets to the rescue:

Usage:

     getSheets(wb)

Arguments:

      wb: a workbook object as returned by 'createWorksheet' or
          'loadWorksheet'.

Value:

     'getSheets' returns a list of java object references each pointing
     to an worksheet.  The list is named with the sheet names.

You'll need to use loadWorkbook(file) instead of read.xlsx in order to get the sheet names, but a little reading of the manuals will provide you the information you need to switch over. (You can use something like getSheets(loadWorkbook(file)), but in my experience I try to avoid opening the same file multiple times in the same script, regardless of auto-closing.)

As an alternative, Hadley's readxl package shows promise in its simplicity, speed, and stability. It has excel_sheets() and read_excel() that should fill your needs. (In fact, that's all it has ... simplicity is "A Good Thing (tm)".)

Edit:

library(XLConnect)
## Loading required package: XLConnectJars
## XLConnect 0.2-11 by Mirai Solutions GmbH [aut],
##   Martin Studer [cre],
##   The Apache Software Foundation [ctb, cph] (Apache POI, Apache Commons
##     Codec),
##   Stephen Colebourne [ctb, cph] (Joda-Time Java library)
## http://www.mirai-solutions.com ,
## http://miraisolutions.wordpress.com
## Attaching package: 'XLConnect'
## The following objects are masked from 'package:xlsx':
##     createFreezePane, createSheet, createSplitPane, getCellStyle, getSheets, loadWorkbook, removeSheet, saveWorkbook, setCellStyle, setColumnWidth, setRowHeight

wb1 <- loadWorkbook('Book1.xlsx')
shts1 <- getSheets(wb1)
shts1
## [1] "Orig"   "Sheet2" "Sheet8" "Sheet3" "Sheet4" "Sheet5" "Sheet6" "Sheet7"
for (ws in shts1) {
    message(ws)                              # just announcing myself
    dat <- readWorksheet(wb1, ws)
    message(paste(dim(dat), collapse=' x ')) # do something meaningful, not this
}
## Orig
## 128 x 11
## Sheet2
## 128 x 11
## Sheet8
## 128 x 19
## Sheet3
## 17 x 11
## Sheet4
## 128 x 11
## Sheet5
## 128 x 11
## Sheet6
## 128 x 11
## Sheet7
## 128 x 11

Edit #2:

As a more detailed iterative example:

library(XLConnect)
for (fn in list.files(pattern="*.xlsx")) {
    message('Opening: ', fn)
    wb <- loadWorkbook(fn)
    shts <- getSheets(wb)
    message(sprintf('    %d Sheets: %s', length(shts),
                    paste(shts, collapse=', ')))
    for (sh in shts) {
        dat <- readWorksheet(wb, sh)
        ## do something meaningful with the data
    }        
}

I'm not really sure what you are doing with your code (since you never said what was contained within any of the spreadsheets), but an alternative approach (that I would use in place of the previous double-for example) is to enclose everything within lists:

dat <- sapply(list.files(pattern='*.xlsx'), function(fn) {
    wb <- loadWorkbook(fn)
    sapply(getSheets(wb), function(sh) readWorksheet(wb, sh))
})

str(dat, list.len=2)
## List of 4
##  $ Book1.xlsx:List of 8
##   ..$ Orig  :'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   ..$ Sheet2:'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   .. [list output truncated]
##  $ Book2.xlsx:List of 8
##   ..$ Orig  :'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   ..$ Sheet2:'data.frame':   128 obs. of  11 variables:
##   .. ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   .. .. [list output truncated]
##   .. [list output truncated]
##   [list output truncated]

If you don't care about differentiating from which workbook a particular sheet came from -- and subsequently simplify processing the data -- then you can "flatten" the nested lists into a single list:

flatdat <- unlist(dat, recur=FALSE)
str(flatdat, list.len=3)
## List of 555
##  $ Book1.xlsx.Orig  :'data.frame':   128 obs. of  11 variables:
##   ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   ..$ c1  : num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
##   .. [list output truncated]
##  $ Book1.xlsx.Sheet2:'data.frame':   128 obs. of  11 variables:
##   ..$ i   : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ x   : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   ..$ c1  : num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
##   .. [list output truncated]
##  $ Book1.xlsx.Sheet8:'data.frame':   128 obs. of  19 variables:
##   ..$ i    : num [1:128] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ x    : num [1:128] 1606527 7484 437881 1601729 1341668 ...
##   ..$ c1   : num [1:128] 1 1 1 1 1 1 1 1 1 1 ...
##   .. [list output truncated]
##   [list output truncated]

Now, processing your data is perhaps simpler. Your code for looking for "P"s is a bit flawed in that you are assigning a data.frame to a cell within another data.frame, typically frowned upon.

This will likely turn in to another question for you. For that, I strongly urge you to provide a better detailed question, including what a sample worksheet looks like and what you expect your output to look like.

查看更多
登录 后发表回答