Wednesday, April 19, 2006

Practical Uses of PatIndex() ... Or Why CharIndex() is not enough

PatIndex vs. CharIndex
One question that I am frequently asked by developers new to T-SQL is whether they should use CharIndex() or PatIndex() and what the difference between the two is. SQL Server Books Online explains that the difference between the two is that PatIndex can use wildcard characters. This is often unclear to many developers because they associate the term wildcard with the percent sign (%) only. What Books Online does not make clear is that PatIndex() can make use of the full spectrum of wildcard characters. This gives it power well beyond that of CharIndex().

PatIndex is Like Like
To really see the full capability of PatIndex(), you must take a look at the Like command in SQL Server Books Online. Books Online describes the available wildcards characters as follows:

Wildcard character Description Example
% Any string of zero or more characters. WHERE title LIKE '%computer%' finds all book titles with the word 'computer' anywhere in the book title.
_ (underscore) Any single character. WHERE au_fname LIKE '_ean' finds all four-letter first names that end with ean (Dean, Sean, and so on).
[ ] Any single character within the specified range ([a-f]) or set ([abcdef]). WHERE au_lname LIKE '[C-P]arsen' finds author last names ending with arsen and beginning with any single character between C and P, for example Carsen, Larsen, Karsen, and so on.
[^] Any single character not within the specified range ([^a-f]) or set ([^abcdef]). WHERE au_lname LIKE 'de[^l]%' all author last names beginning with de and where the following letter is not l.


If we adapt this table for PatIndex(), it would look something like this:

Wildcard character Description Example
% Any string of zero or more characters. WHERE PatIndex('%computer%', title) > 0 finds all book titles with the word 'computer' anywhere in the book title.
_ (underscore) Any single character. WHERE PatIndex('_ean', au_fname) > 0 finds all four-letter first names that end with ean (Dean, Sean, and so on).
[ ] Any single character within the specified range ([a-f]) or set ([abcdef]). WHERE PatIndex('[C-P]arsen', au_lname) > 0 finds author last names ending with arsen and beginning with any single character between C and P (Carsen, Larsen, Karsen, and so on).
[^] Any single character not within the specified range ([^a-f]) or set ([^abcdef]). WHERE PatIndex('de[^l]%', au_lname) > 0 finds all author last names beginning with de and where the following letter is not l (finds Derry but not Delton).


Let's Get Practical
Let's take a real world example. If we have a varchar field that may contain non-numeric data and we want to Select the records and include the data in this field if it contains numbers only and nothing else, we can use PatIndex().

Background:
As the DBA for my company, my team supports all database needs for our company's applications including a web application. A key data field within this application is the customer's internal company ID for their employees. Since this ID is defined by the client, we must allow the use of alpha-numeric codes. A separate customer driven project calls for us to export their roster data to send to a third party. One of the requirements of the project is that if an employee's company ID contains data other than a number, we should assume that the value is incorrect and not include it.

Solution:
I decided to use PatIndex() in a Case statement with the not within a range wildcards. The range in this case would be 0 to 9 ( [0-9] ), and I would express not within this range by using the expression [^0-9]. My resulting Case statement looks like:

CompanyID = Case When PatIndex('%[^0-9]%', IsNull(CompanyID, '*')) > 0 Then Null Else CompanyID End

Going a little further:
If I want to go a little further with this concept, I could write a User-Defined Function ( named fnIsInt()) that accepts a varchar value and returns a 0 or a 1 indicating if the value could be converted to an integer data type as is without any other string manipulation. This function would work much like the built-in IsNumeric() function. The difference between the two functions is that IsNumeric() allows additional characters such as the decimal point, currency symbols, commas.

Since this article is focusing on practical uses of PatIndex, I will only be checking to see if the string value contains digits only with the exception of allowing it to begin with a negative sign. I will not include checking to see if the value is within the allowable range of the Int data type.



Create Function dbo.fnIsInt(
@value varchar(11))
Returns int
As
Begin
Declare @IsInt int
Set @IsInt = 0

If PatIndex('%[^0-9]%', @value) > 0 Or PatIndex('-%[^0-9]%', @value) > 0
Begin
Set @IsInt = 0
End
Else
Begin
Set @IsInt = 1
End

Return @IsInt
End


Testing the function:
Select IsNumeric('4'), PatIndex('%[^0-9]%', '4'), dbo.fnIsInt('4')



----------- ----------- -----------
1 0 1

Select IsNumeric('4.2'), PatIndex('%[^0-9]%', '4.2'), dbo.fnIsInt('4.2')


----------- ----------- -----------
1 2 0

Select IsNumeric('4-2'), PatIndex('%[^0-9]%', '4-2'), dbo.fnIsInt('4-2')


----------- ----------- -----------
0 2 0

Select IsNumeric('-4'), PatIndex('%[^0-9]%', '-4'), dbo.fnIsInt('-4')


----------- ----------- -----------
1 1 0

Select IsNumeric('4,2'), PatIndex('%[^0-9]%', '4,2'), dbo.fnIsInt('4,2')

----------- ----------- -----------
1 2 0

Select IsNumeric('$42'), PatIndex('%[^0-9]%', '$42'), dbo.fnIsInt('$42')

----------- ----------- -----------
1 1 0

The Big Deal?
Okay, I admit that I could have done the above Case statement just as easily using Like commands. Since I did not need to know the position of the offending characters, PatIndex() was more than was needed for that situation. If the project had required that I remove the non-digit characters rather than simply returning Null, PatIndex() would have been the perfect solution.

I can easily leverage the full power of PatIndex() by creating a User-Defined Function that accepts a varchar value and returns a varchar value with all non-digit characters removed.


Create Function dbo.fnDigitsOnly(
@value varchar(50))
Returns varchar(50)
As
Begin
If PatIndex('%[^0-9]%', @value) > 0
Begin
While PatIndex('%[^0-9]%', @value) > 0
Begin
Set @value = Stuff(@value, PatIndex('%[^0-9]%', @value), 1, '')
End
End

Return @value
End


The Big Finish
So, what's the difference between CharIndex() and PatIndex()? Well, like Books Online says, PatIndex() can use wildcards. To put it as simple as I can, PatIndex() combines the capabilities of the CharIndex() function and the Like command. 99.99% of the time, CharIndex() and PatIndex() are used interchangeably. That 0.01% of the time that you need something more, you'll be glad to have PatIndex() in your T-SQL toolbox.

No comments: