Subset data with dynamic conditions in R

2019-02-25 00:07发布

问题:

I have a dataset of 2500 rows which are all bank loans. Each bank loan has an outstanding amount and collateral type. (Real estate, Machine tools.. etc)

I need to draw a random selection out of this dataset where for example the sum of outstanding amount = 2.5Million +-5% and maximum 25% loans with the same asset class.

I found the function optim, but this asks for a function and looks to be constructed for optimization a portfolio of stocks, which is much more complex. I would say that there is an easy way of achieving this?

I created a sample data set which could illustrate my question better:

dataset <- data.frame(balance=c(25000,50000,35000,40000,65000,10000,5000,2000,2500,5000)
                      ,Collateral=c("Real estate","Aeroplanes","Machine tools","Auto Vehicles","Real estate",
                                    "Machine tools","Office equipment","Machine tools","Real estate","Auto Vehicles"))

If I want for example 5 loans out of this dataset which sum of outstanding balance = 200.000 (with 10% margin) and not more than 40% is allowed to be the same collateral type. (so maximum 2 out of 5 in this example)

Please let me know if additional information is necessary. Many thanks, Tim

回答1:

This function I made works:

pick_records <- function(df,size,bal,collat,max.it) {
  i <- 1
  j <- 1
  while ( i == 1 ) {
    s_index <- sample(1:nrow(df) , size)
    print(s_index)
    output <- df[s_index,]
    out_num <- lapply(output,as.numeric)
    tot.col <- sum(as.numeric(out_num$Collateral))
    if (sum(out_num$balance) < (bal*1.1) &
          sum(out_num$balance) > (bal*0.9) &
          all(  table(out_num$Collateral)/size  <= collat)   ) {
      return(output)
      break
    }
    print(j)
    j <- j + 1
    if ( j == max.it+1) {
      print('No solution found')
      break}     
  }
} 

> a <- pick_records(dataset,5,200000,0.4,20)
> a
  balance       Collateral
3   35000    Machine tools
7    5000 Office equipment
4   40000    Auto Vehicles
5   65000      Real estate
2   50000       Aeroplanes

Where df is your dataframe, size is the number of records you want and max.it the number of maximum iterations to find a solution before returning a no solution found error, bal is the limit for balance and collat the same for Collateral. You can change those as you please.

Let me know if you don't get any part of it.