Spyglass MTG Blog

Implement Pattern Recognition in Your Database Management

Written by Eliza Schuh | Apr 24, 2024 2:15:00 PM

For most people, the use of pattern-matching starts and ends with the Find tool, better known as Control-F on Windows and Linux and Command-F on Mac. The Find tool matches exact patterns and words on a webpage or pdf, showing the user where and how many times a word occurs. However, that kind of exact pattern recognition has severe limitations. By introducing abstractions, pattern recognition transforms from a timesaver into a powerful tool for data engineering. Using SQL, Python 3.11, and PySpark, we will demonstrate how to match and mutate strings and databases.

Symbol

Description

Example Wildcard

Example Matches

%

Percent

Represents zero or more characters

S%

S, SpyglassMTG, Spyg1a$$

_

Underscore

Represents a single character

M_G

MTG, M9G, M#G

[]

Brackets

Represents a single character enclosed within the brackets

[AbCZ6]zure

Azure, bzure, 6zure

-  

Dash

Used within brackets. Represents a range of single characters potentially represented by the brackets. Both ends are inclusive.

P[x-z]Spark

Pxspark, PySpark, PzSpark

Microsoft[1-3]65

Microsoft165, Microsoft265, Microsoft365

TABLE 1

With a single exact pattern match, there’s no way to select all emails from Figure 1 that end with “com.” In order to perform this abstraction, we must employ metacharacters, non-literal character types that have a special meaning to the interpreter. Luckily, SQL includes a type of metacharacter called a wildcard (Table 1), a special character that can act as an abstract stand-in for another character. With SQL wildcards, users get a first level of abstraction. Instead of asking for each exact phrase, such as “SELECT * WHERE email = ‘kateblack@student.com’ OR ‘briannawhitman@student.com’”, now the searcher can ask for “SELECT * WHERE email LIKE %student.com”. In addition, wildcards can allow for multiple search criteria. For example, SELECT * WHERE email LIKE ‘a%co_’ will select only rows with emails that begin with the letter ‘a’ and end in ‘co_’ with ‘_’ representing any single character.

Example Database

johnsmith@student.net

andrewwilliams@student.com

annejones@student.gov

avajameson@student.cow

kateblack@student.com

briannawhitman@student.com

FIGURE 1

‘%student.com’

andrewwilliams@student.com

kateblack@student.com

briannawhitman@student.com

FIGURE 2

‘a%co_’

andrewwilliams@student.com

avajameson@student.cow

FIGURE 3

Alas, even SQL wildcards have their limits. SQL also supports Regular Expressions, or RegEx, using the RLIKE function. RegEx gets a bad reputation. It is relatively unknown outside of the tech industry, and inside of it, it is mostly known as a pain to read. But RegEx provides users with much more powerful metacharacters. Instead of multiple general-purpose wildcard characters, regex has only one: “.”. The period is used to represent any single character, much like the underscore in SQL. RegEx brackets work similarly to SQL wildcard brackets. RegEx also has more specific wildcards, as well as negation. The more specific whitespace characters have capitalized opposites. For example, \s represents a whitespace character, but \S represents a non-whitespace character.

Symbol

Description

Example

Example Matches

.

Period

Matches any character

S.napse

Synapse, S3napse, S~napse, S napse,

^

Negation

Matches any character except the following, used inside brackets

S^tudent

Spudent, Srudent, Squdent

\s

Whitespace

Represents a single whitespace character

Pattern\sRecognition

Pattern Recognition

\d

Digit

Represents a single digit character

GPT\d

GPT3, GPT4

\w

Word

Represents a single word character (alphanumeric or underscore)

Cl\wud

Cloud, Cl0ud, Cl_ud

TABLE 2

This seems limiting, but RegEx allows users to adjust the repetition of a character using quantifiers.

Symbol

Description

Example

Example Matches

*

Asterisk

Matches 0 or more times

D.*TA

DaaaaaaaaTA, DTA, D8d s4gTA

+

Plus

Matches 1 or more times

Engin.+r

EngineEr, Enginer, Engin84 r

?

Question Mark

Matches 0 or 1 time

Reg.?x

RegEx, RegX, Reg X

{n}

Curly braces

Matches exactly n times

Pa.{2}ern

Pattern, Pa 1ern

{n,m}

Range

Matches between n and m times (inclusive)

S.{2,5}L

SQQL, SWELL, SQQ QQL

{n,}

At least

Matches at least n times

(inclusive)

Quanti.{1,}ier

Quantifier, Quanti5f5ier

TABLE 3

RegEx allows users another degree of specificity in their patterns; Anchors. Anchors offer the ability to match not only symbols but the position of those symbols within context.

Symbol

Description

Example and example string

Example Matches

^

Beginning of Line Caret

Following string must occur at the beginning of line. Not used inside brackets.

^.{3}

“abc def ghi”

abc

$

End of Line

Following string must occur at the end of line. Not used inside brackets.

$.{3}

“abc def ghi”

ghi

\b

Boundary

Following string must occur at a word boundary, meaning beginning or end of a word.

\b.

“Wildcard”

“W”, “d”

\B

Non-boundary

The opposite of \b

\B.*

“Wildcard”

“ildcar”

TABLE 4            

RegEx includes many more special characters than can be listed here, creating a powerful searching function for both extracting information from and adding context to strings. But how can it be used? For a simple task on a single string, we should use the built in Regex library for python. This library is imported as re. This library comes with four main functions, listed below.

Function

Description

search

Searches the string and returns a match object which includes the match position and the matching pattern. Returns NoneType if no match is found.

findall

Searches the string and returns all non-overlapping matches of the RegEx pattern supplied. An empty list is returned if no match is found.

split

Like Python’s split() function, this splits a string into a list of its substrings, using a RegEx pattern as a delimiter. Returns list containing full string if no match is found.

sub

Like Python’s replace() function, this function replaces all instances of a pattern-matched substring. Returns unchanged string if no match is found.

TABLE 5

Python’s re library works nicely will single strings, but PySpark is needed to perform RegEx functions on a database. For this, we must import pyspark.sql.SparkSession and pyspark.sql.functions. RegEx can be performed through SQL on PySpark databases in a SparkSession using a RLIKE function. RegEx can also be used as a filter condition through rlike(). RLIKE in SQL and rlike() and pyspark.sql.functions.regexp_extract() in PySpark will only return the leftmost match. Unlike re.findall, this function allows for pattern grouping and indexing. In this way, regexp_extract() can return only what comes before or after a given substring. The indexing is 1-based. For example, if you wanted only the substring before the first word that started with case-insensitive ‘b’, one could query a dataframe column that included the string "The quick brown fox jumps over the lazy dog" like so:

              df.withColumn(“match”, regexp_replace(col(“mycol”, “(.*)(\s[b|B])(.*)”, 1 )

This would return “The quick." The first and third groups (.*) represent 0 or more of any character. The second group, (\s[b|B]), represents a whitespace character followed by a ‘b’ or a ‘B’.

PySpark requires a workaround to repeat the functionality of re.findall() and return a list of multiple matching substrings. PySpark supports pandas, which has a findall() method that returns a pandas Series of lists containing the matching substrings. This is imported as pyspark.pandas.Series.str.findall. Pyspark’s regexp_replace() function acts as though performing re.sub() on an entire column, replacing all matching substrings with a provided alternative.

Function

Description

SQL: RLIKE

Returns a Boolean if a match is found. Functionally, similar to re.search().

PySpark: rlike()

regexp_extract()

Returns the leftmost substring matching the pattern. Allows for grouping substrings by pattern and returning only a specific index. The indexing is 1-based.

regexp_replace()

Like re.sub(), replaces all matching substrings.

pyspark.pandas.Series.str.findall()

Returns a pandas series including a list of all matches in a given column.

TABLE 6

If you are interested in learning more or implementing pattern recognition in your company’s database managementlet us know!

 

Sources:

https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/

https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference 

https://docs.python.org/3/library/re.html#regular-expression-objects

https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.regexp_extract.html