I need to do the following modifications to a varchar(20) field:
- substitute accents with normal letters (like è to e)
- after (1) remove all the chars not in a..z
for example
'aèàç=.32s df'
must become
'aeacsdf'
are there special stored functions to achieve this easily?
UPDATE: please provide a T-SQL not CLR solution. This is the workaround I temporarly did because it temporarly suits my needs, anyway using a more elegant approach would be better.
CREATE FUNCTION sf_RemoveExtraChars (@NAME nvarchar(50))
RETURNS nvarchar(50)
AS
BEGIN
declare @TempString nvarchar(100)
set @TempString = @NAME
set @TempString = LOWER(@TempString)
set @TempString = replace(@TempString,' ', '')
set @TempString = replace(@TempString,'à', 'a')
set @TempString = replace(@TempString,'è', 'e')
set @TempString = replace(@TempString,'é', 'e')
set @TempString = replace(@TempString,'ì', 'i')
set @TempString = replace(@TempString,'ò', 'o')
set @TempString = replace(@TempString,'ù', 'u')
set @TempString = replace(@TempString,'ç', 'c')
set @TempString = replace(@TempString,'''', '')
set @TempString = replace(@TempString,'`', '')
set @TempString = replace(@TempString,'-', '')
return @TempString
END
GO
What you are looking for is something to remove Diacritics from individual characters. I'm afraid the solution you have is going to be almost as good as you can get, at least with pure SQL. dotNet/CLR does provide a simple method for doing this, though. Sorry, I know you are wanting to avoid another CLR solution, but Microsoft SQL Server doesn't provide a T-SQL equivalent for this.
If you're lucky, you have the collation set in your database as "SQL_Latin1_General_CP1_CI_AS" or any variant starting with "SQL_Latin1_General". This is equivalent to Windows-1252 which is very well documented. You'll be able to "translate" each character to an English equivalent by reviewing the characters and mapping an equivalent using a SQL CASE statement like you have been.
I do have one quick correction for your code, though. You are going to want to use varchar in your variables and parameters. It creates additional overhead performing data type conversion back and forth and has the potential for introducing unicode characters that only exist as unicode into the mix. Plus, a security related reason specific to your situation can be found on Bruce Schneier's blog.
Update Some great information on Diacritics and Windows internationalization can be found on Michael S Kaplan's blog.
With two auxiliary tables in your schema, you can transform your data using just a
SELECT
statement.First declare a view to implement a numbers table, from 1 to 65536. The following technique is due to Itzik Ben-Gan:
Next declare a mapping between characters with diacritics and and their undiacritic equivlants. This example data is not complete mapping, but serves as an example:
Finally declare a table with a column containing test data. The data is sourced from the question and from DForck42's answer:
With these objects in place, the following
SELECT
statement processes the test data in a relational way, save for the the non-relational hack to concatenate the characters back together. The 'blackbox XML method' is due to Anith Sen:The query produces the following result set:
Let me clarify something first: the accented characters you show are not actually Unicode (as one answer implies); these are 8-bit ASCII characters. One thing to keep in mind: you see characters like è and à simply because this is how your code page (the code page used by your OS and/or SQL Server [I'm not sure which one]) displays them. In a different code page, these characters would be represented by totally different symbols (e.g. if you use a Cyrillic or Turkish code page).
Anyway, say you want to replace these 8-bit chars with the closest US/Latin character equivalent for your default code page [I assume these are characters from some variation of a Latin character set]. This is how I approached a similar problem (disclaimer: this is not a very elegant solution, but I could not think of anything better at the time):
Create a UDF to translate an 8-bit ASCII character to a 7-bit printable ASCII equivalent, such as:
The code above is general-purpose, so you can adjust the character mappings to remove all non-alphabetic characters, e.g. you can use code like this in the match for printable 7-bit ASCII character (this assumes case-insensitive collation):
To see if your character mapping for 8-bit ASCII symbols works correctly, run the following code:
Now you can create a UDF to process a string:
The best way to achieve this is very simple and efficient :
which outputs 'aeeohello!'
The string must not be unicode. If you have a nvarchar just cast it to varchar before using the collate.
Here is a function that answers the OP needs :
Then, the command:
outputs
You can avoid hard-coded
REPLACE
statements by using aCOLLATE
clause with an accent-insensitive collation to compare the accented alphabetic characters to non-alphabetic ones: