Best way to extract segments / values from VARCHAR

2019-03-01 15:19发布

问题:

Take the following example data:

SELECT 'HelpDesk Call Reference F0012345, Call Update, 40111' AS [Subject]
UNION ALL
SELECT 'HelpDesk Call Reference F0012346, Call Resolved, 40112' AS [Subject]
UNION ALL
SELECT 'HelpDesk Call Reference F0012347, New call logged, 40113' AS [Subject]

What i would like to do is extract this data as follows:

As you can see, i need to extract the Ref, Type & OurRef as seperate columns to ensure efficient set based SQL when processing the resulting emails.

Usually for this scenario i would use a function such as this:

CREATE FUNCTION dbo.fnParseString (
    @Section SMALLINT ,
    @Delimiter CHAR ,
    @Text VARCHAR(MAX)
)
RETURNS VARCHAR(8000)
AS 
    BEGIN
        DECLARE @NextPos SMALLINT;
        DECLARE @LastPos SMALLINT;
        DECLARE @Found SMALLINT;

        SELECT  @NextPos = CHARINDEX(@Delimiter, @Text, 1) ,
                @LastPos = 0 ,
                @Found = 1

        WHILE @NextPos > 0
            AND ABS(@Section) <> @Found 
            SELECT  @LastPos = @NextPos ,
                    @NextPos = CHARINDEX(@Delimiter, @Text, @NextPos + 1) ,
                    @Found = @Found + 1

        RETURN LTRIM(RTRIM(CASE
            WHEN @Found <> ABS(@Section) OR @Section = 0 THEN NULL
            WHEN @Section > 0 THEN SUBSTRING(@Text, @LastPos + 1, CASE WHEN @NextPos = 0 THEN DATALENGTH(@Text) - @LastPos ELSE @NextPos - @LastPos - 1 END)
            ELSE SUBSTRING(@Text, @LastPos + 1, CASE WHEN @NextPos = 0 THEN DATALENGTH(@Text) - @LastPos ELSE @NextPos - @LastPos - 1 END)
        END))
    END

For example i then replace the white space before the ref to include a comma and split as follows:

WITH    ExampleData
          AS ( SELECT   'HelpDesk Call Reference F0012345, Call Update, 40111' AS [Subject]
               UNION ALL
               SELECT   'HelpDesk Call Reference F0012346, Call Resolved, 40112'
               UNION ALL
               SELECT   'HelpDesk Call Reference F0012347, New call logged, 40113'
             )
    SELECT  dbo.fnParseString(2, ',', REPLACE([Subject], 'HelpDesk Call Reference ', 'HelpDesk Call Reference, ')) AS [Ref] ,
            dbo.fnParseString(3, ',', REPLACE([Subject], 'HelpDesk Call Reference ', 'HelpDesk Call Reference, ')) AS [Type] ,
            dbo.fnParseString(4, ',', REPLACE([Subject], 'HelpDesk Call Reference ', 'HelpDesk Call Reference, ')) AS [OurRef]
    FROM    ExampleData

As you can see, i have a solution that gets the end-result i'm after, but the use of a messy udf isnt ideal & i was wondering if sql-server has a better way of doing stuff like this - perhaps inline regular expressions? I.e. i think PATINDEX() accepts regular expressions as a search string - this in conjunction with SUBSTRING() could do what i need but i dont really know where to start?

Edit: Please note that this is a simplified example, the subject is variable and i'll also be adapting the same technique to parse the body, the body will have 8 items of data that i need to parse out using a variety of delimiters, so this rules out the use of ParseName() as it only allows 4 parts, and i can’t use fixed length (i.e. substring()) as the length will be very varied (especially if different helpdesks are involved (which they are) - this is why i was thinking along the lines of PATINDEX() & SUBSTRING()

回答1:

I suggest to use this:

;WITH CTE
AS
(
SELECT 'HelpDesk Call Reference F0012345, Call Update, 40111' AS [Subject]
UNION ALL
SELECT 'HelpDesk Call Reference F0012346, Call Resolved, 40112' AS [Subject]
UNION ALL
SELECT 'HelpDesk Call Reference F0012347, New call logged, 40113' AS [Subject]
)
, CTEPart
as
(
SELECT [Subject], REPLACE(SUBSTRING([Subject], 25, 1000), ', ', '.') Part
FROM CTE
)
SELECT
    [Subject],
    PARSENAME(Part, 1) AS [Ref],
    PARSENAME(Part, 2) AS [Type],
    PARSENAME(Part, 3) AS [OurRef]
FROM CTEPart


回答2:

After additional work we decided not to use the approach in Art's answer (even though it worked).

We needed a far more powerfull way of validating and extracting the substrings, So I went the Regular Expression via CLR route (Thanks to Pondlife for pointing me in the right direction).

The approach i took was as follows:

First I compiled the following CLR: (Converted to VB from a C# example Here)

Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Imports System.Text.RegularExpressions
Imports System.Text

Partial Public Class UserDefinedFunctions

    Public Shared ReadOnly Options As RegexOptions = RegexOptions.IgnorePatternWhitespace Or RegexOptions.Multiline

    <SqlFunction()> _
    Public Shared Function RegexMatch(ByVal input As SqlChars, ByVal pattern As SqlString) As SqlBoolean
        Dim regex As New Regex(pattern.Value, Options)
        Return regex.IsMatch(New String(input.Value))
    End Function

    <SqlFunction()> _
    Public Shared Function RegexReplace(ByVal expression As SqlString, ByVal pattern As SqlString, ByVal replace As SqlString) As SqlString
        If expression.IsNull OrElse pattern.IsNull OrElse replace.IsNull Then
            Return SqlString.Null
        End If

        Dim r As New Regex(pattern.ToString())
        Return New SqlString(r.Replace(expression.ToString(), replace.ToString()))
    End Function

    ' returns the matching string. Results are separated by 3rd parameter
    <SqlFunction()> _
    Public Shared Function RegexSelectAll(ByVal input As SqlChars, ByVal pattern As SqlString, ByVal matchDelimiter As SqlString) As SqlString
        Dim regex As New Regex(pattern.Value, Options)
        Dim results As Match = regex.Match(New String(input.Value))

        Dim sb As New StringBuilder()
        While results.Success
            sb.Append(results.Value)

            results = results.NextMatch()

            ' separate the results with newline|newline
            If results.Success Then
                sb.Append(matchDelimiter.Value)
            End If
        End While

        Return New SqlString(sb.ToString())

    End Function

    ' returns the matching string
    ' matchIndex is the zero-based index of the results. 0 for the 1st match, 1, for 2nd match, etc
    <SqlFunction()> _
    Public Shared Function RegexSelectOne(ByVal input As SqlChars, ByVal pattern As SqlString, ByVal matchIndex As SqlInt32) As SqlString
        Dim regex As New Regex(pattern.Value, Options)
        Dim results As Match = regex.Match(New String(input.Value))

        Dim resultStr As String = ""
        Dim index As Integer = 0

        While results.Success
            If index = matchIndex Then
                resultStr = results.Value.ToString()
            End If

            results = results.NextMatch()

            index += 1
        End While

        Return New SqlString(resultStr)

    End Function

End Class

I installed this CLR as follows:

EXEC sp_configure 
    'clr enabled' ,
    '1'

GO

RECONFIGURE
USE [db_Utility]

GO
CREATE ASSEMBLY SQL_CLR_RegExp FROM 'D:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\SQL_CLR_RegExp.dll' WITH
PERMISSION_SET = SAFE

GO
-- =============================================
-- Returns 1 or 0 if input matches pattern
-- VB function: RegexMatch(ByVal input As SqlChars, ByVal pattern As SqlString) As SqlBoolean
-- =============================================
CREATE FUNCTION [dbo].[RegexMatch]
    (
      @input [nvarchar](MAX) ,
      @pattern [nvarchar](MAX)
    )
RETURNS [bit]
    WITH EXECUTE AS CALLER
AS EXTERNAL NAME 
    [SQL_CLR_RegExp].[SQL_CLR_RegExp.UserDefinedFunctions].[RegexMatch]
GO

-- =============================================
-- Returns a comma separated string of found objects
-- VB function: RegexReplace(ByVal expression As SqlString, ByVal pattern As SqlString, ByVal replace As SqlString) As SqlString
-- =============================================
CREATE FUNCTION [dbo].[RegexReplace]
    (
      @expression [nvarchar](MAX) ,
      @pattern [nvarchar](MAX) ,
      @replace [nvarchar](MAX)
    )
RETURNS [nvarchar](MAX)
    WITH EXECUTE AS CALLER
AS EXTERNAL NAME 
    [SQL_CLR_RegExp].[SQL_CLR_RegExp.UserDefinedFunctions].[RegexReplace]
GO
-- =============================================
-- Returns a comma separated string of found objects
-- VB function: RegexSelectAll(ByVal input As SqlChars, ByVal pattern As SqlString, ByVal matchDelimiter As SqlString) As SqlString
-- =============================================
CREATE FUNCTION [dbo].[RegexSelectAll]
    (
      @input [nvarchar](MAX) ,
      @pattern [nvarchar](MAX) ,
      @matchDelimiter [nvarchar](MAX)
    )
RETURNS [nvarchar](MAX)
    WITH EXECUTE AS CALLER
AS EXTERNAL NAME 
    [SQL_CLR_RegExp].[SQL_CLR_RegExp.UserDefinedFunctions].[RegexSelectAll]
GO
-- =============================================
-- Returns finding matchIndex of a zero based index
-- RegexSelectOne(ByVal input As SqlChars, ByVal pattern As SqlString, ByVal matchIndex As SqlInt32) As SqlString
-- =============================================
CREATE FUNCTION [dbo].[RegexSelectOne]
    (
      @input [nvarchar](MAX) ,
      @pattern [nvarchar](MAX) ,
      @matchIndex [int]
    )
RETURNS [nvarchar](MAX)
    WITH EXECUTE AS CALLER
AS EXTERNAL NAME 
    [SQL_CLR_RegExp].[SQL_CLR_RegExp.UserDefinedFunctions].[RegexSelectOne]
GO 

I then wrote the following wrapping function to simplify use:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author:      <Jordon Pilling>
-- Create date: <30/01/2013>
-- Description: <Calls RegexSelectOne with start and end text and cleans the result>
-- =============================================
CREATE FUNCTION [dbo].[RegexSelectOneWithScrub]
(
    @Haystack VARCHAR(MAX),
    @StartNeedle VARCHAR(MAX),
    @EndNeedle VARCHAR(MAX)
)
RETURNS VARCHAR(MAX)
AS
BEGIN
    DECLARE @ReturnStr VARCHAR(MAX)

    --#### Extract text from HayStack using Start and End Needles
    SET @ReturnStr = dbo.RegexSelectOne(@Haystack, REPLACE(@StartNeedle, ' ','\s') + '((.|\n)+?)' + REPLACE(@EndNeedle, ' ','\s'), 0)

    --#### Remove the Needles
    SET @ReturnStr = REPLACE(@ReturnStr, @StartNeedle, '')
    SET @ReturnStr = REPLACE(@ReturnStr, @EndNeedle, '')

    --#### Trim White Space
    SET @ReturnStr = LTRIM(RTRIM(@ReturnStr))

    --#### Trim Line Breaks and Carriage Returns
    SET @ReturnStr = dbo.SuperTrim(@ReturnStr)

    RETURN @ReturnStr

END
GO

This allowed usage as follows:

DECLARE @Subject VARCHAR(250) = 'HelpDesk Call Reference F0012345, Call Update, 40111' 
DECLARE @Ref VARCHAR(250) = NULL

IF dbo.RegexMatch(@Subject, '^HelpDesk\sCall\sReference\sF[0-9]{7},\s(Call\sResolved|Call\sUpdate|New\scall\slogged),(|\s+)([0-9]+|unknown)$') = 1
    SET @Ref = ISNULL(dbo.RegexSelectOneWithScrub(@Subject, 'HelpDesk Call Reference', ','), 'Invalid (#1)')
ELSE
    SET @Ref = 'Invalid (#2)'

SELECT @Ref

This was far quicker to use for multiple searches, and far more powerfull when dealing with lots of text with differeent start and end phrases etc.



回答3:

This example is Oracle query. All functions used are ANSI SQL standard which will work in any SQL. This example cuts only REF part of the string. You simply have to repeat all steps for Type, OutRef, etc... This example assumes that your ref will always contain 0-zero, and there will be always ',' after ref, which can be replaced with whitespace or any other character. THe NVL() cna be used: INSTR(str, NVL(',', ' ')...). I think this approach is more generic then hardcoding values into SUBSTR...:

SELECT str, SUBSTR(str, ref_start_pos, ref_end_pos) final_ref
 FROM
 (
  SELECT str, ref_start_pos, INSTR(str, ',', ref_start_pos)-ref_start_pos AS ref_end_pos
    FROM
    (
     SELECT str, INSTR(str, '0')-1 AS ref_start_pos
       FROM
       (
        SELECT 'HelpDesk Call Reference F0012345, Call Update, 40111' AS str
          FROM dual
        UNION ALL
        SELECT 'HelpDesk Call Reference F0012346, Call Resolved, 40112' 
          FROM dual
       )
     )
   )
  /

  SQL>

  STR                                                    |  FINAL_REF
  ------------------------------------------------------------------------
  HelpDesk Call Reference F0012345, Call Update, 40111   |  F0012345
  HelpDesk Call Reference F0012346, Call Resolved, 40112 |  F0012346

SQL Server Version (Added by OP):

SELECT  [str] ,
        SUBSTRING([str], ref_start_pos, ref_end_pos) AS final_ref
FROM    ( SELECT    [str] ,
                    ref_start_pos ,
                    CHARINDEX(',', [str], ref_start_pos) - ref_start_pos AS ref_end_pos
          FROM      ( SELECT    [str] ,
                                CHARINDEX('Reference', [str]) + 10 AS ref_start_pos
                      FROM      ( SELECT    'HelpDesk Call Reference F0012345, Call Update, 40111' AS [str]
                                  UNION ALL
                                  SELECT    'HelpDesk Call Reference F0012346, Call Resolved, 40112' AS [str]
                                ) AS T1
                    ) AS T2
        ) AS T3