How to extract where clause as array in spark sql?

2020-06-29 04:38发布

问题:

I am trying to extract where clause from SQL query. Multiple conditions in where clause should be in form array. Please help me.

Sample Input String:

select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null 

Output Expected:

Array("col1=1", "(col2 between 1 and 10 or col2 between 190 and 200)", "col2 is not null")

Thanks in advance.

EDIT:

My question here is like... I would like to split all the conditions as separate items... let's say my query is like

select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null

The output I'm expecting is like

List("col1=1", "col2 between 1 and 10", "col2 between 190 and 200", "col2 is not null")

The thing is the query may have multiple levels of conditions like

select * from table where col1=1 and (col2 =2 or(col3 between 1 and 10 or col3 is between 190 and 200)) and col4='xyz'

in output each condition should be a separate item

List("col1=1","col2=2", "col3 between 1 and 10", "col3 between 190 and 200", "col4='xyz'")

回答1:

I wouldn't use Regex for this. Here's an alternative way to extract your conditions based on Catalyst's Logical Plan :

val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
    f.condition.productIterator.flatMap{
      case And(l,r) => Seq(l,r)
      case o:Predicate => Seq(o)
    }
}.toList.flatten

println(predicates)

Output :

List(('col1 = 1), ((('col2 >= 1) && ('col2 <= 10)) || (('col2 >= 190) && ('col2 <= 200))), isnotnull('col2))

Here the predicates are still Expressions and hold information (tree representation).

EDIT : As asked in comment, here's a String (user friendly I hope) representation of the predicates :)

val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
    f.condition.productIterator.flatMap{
      case o:Predicate => Seq(o)
    }
}.toList.flatten

def stringifyExpressions(expression: Expression): Seq[String] = {
  expression match{
    case And(l,r) => (l,r) match {
      case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
      case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
    }
    case Or(l,r) => Seq(Seq(l,r).flatMap(stringifyExpressions).mkString("(",") OR (", ")"))
    case eq: EqualTo => Seq(s"${eq.left.toString} = ${eq.right.toString}")
    case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
    case p: Predicate => Seq(p.toString)
  }
}

val stringRepresentation = predicates.flatMap{stringifyExpressions}

println(stringRepresentation)

New Output :

List('col1 = 1, ('col2 between 1 and 10) OR ('col2 between 190 and 200), 'col2 is not null)

You can keep playing with the recursive stringifyExpressions method if you want to customize the output.

EDIT 2 : In response to your own edit :

You can change the Or / EqualTo cases to the following

def stringifyExpressions(expression: Expression): Seq[String] = {
  expression match{
    case And(l,r) => (l,r) match {
      case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
      case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
    }
    case Or(l,r) => Seq(l,r).flatMap(stringifyExpressions)
    case EqualTo(l,r) =>
      val prettyLeft = if(l.resolved && l.dataType == StringType) s"'${l.toString}'" else l.toString
      val prettyRight = if(r.resolved && r.dataType == StringType) s"'${r.toString}'" else r.toString
      Seq(s"$prettyLeft=$prettyRight")
    case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
    case p: Predicate => Seq(p.toString)
  }
}

This gives the 4 elements List :

List('col1=1, 'col2 between 1 and 10, 'col2 between 190 and 200, 'col2 is not null)

For the second example :

select * from table where col1=1 and (col2 =2 or (col3 between 1 and 10 or col3 between 190 and 200)) and col4='xyz'

You'd get this output (List[String] with 5 elements) :

List('col1=1, 'col2=2, 'col3 between 1 and 10, 'col3 between 190 and 200, 'col4='xyz')

Additional note: If you want to print the attribute names without the starting quote, you can handle it by printing this instead of toString :

node.asInstanceOf[UnresolvedAttribute].name