Implement Pattern Recognition in Your Database Management

By Eliza Schuh on April 24, 2024

For most people, the use of pattern-matching starts and ends with the Find tool, better known as Control-F on Windows and Linux and Command-F on Mac. The Find tool matches exact patterns and words on a webpage or pdf, showing the user where and how many times a word occurs. However, that kind of exact pattern recognition has severe limitations. By introducing abstractions, pattern recognition transforms from a timesaver into a powerful tool for data engineering. Using SQL, Python 3.11, and PySpark, we will demonstrate how to match and mutate strings and databases.

Symbol		Description	Example Wildcard	Example Matches
%	Percent	Represents zero or more characters	S%	S, SpyglassMTG, Spyg1a$$
_	Underscore	Represents a single character	M_G	MTG, M9G, M#G
[]	Brackets	Represents a single character enclosed within the brackets	[AbCZ6]zure	Azure, bzure, 6zure
-	Dash	Used within brackets. Represents a range of single characters potentially represented by the brackets. Both ends are inclusive.	P[x-z]Spark	Pxspark, PySpark, PzSpark
-	Dash		Microsoft[1-3]65	Microsoft165, Microsoft265, Microsoft365

TABLE 1

With a single exact pattern match, there’s no way to select all emails from Figure 1 that end with “com.” In order to perform this abstraction, we must employ metacharacters, non-literal character types that have a special meaning to the interpreter. Luckily, SQL includes a type of metacharacter called a wildcard (Table 1), a special character that can act as an abstract stand-in for another character. With SQL wildcards, users get a first level of abstraction. Instead of asking for each exact phrase, such as “SELECT * WHERE email = ‘kateblack@student.com’ OR ‘briannawhitman@student.com’”, now the searcher can ask for “SELECT * WHERE email LIKE %student.com”. In addition, wildcards can allow for multiple search criteria. For example, SELECT * WHERE email LIKE ‘a%co_’ will select only rows with emails that begin with the letter ‘a’ and end in ‘co_’ with ‘_’ representing any single character.

Example Database

johnsmith@student.net

andrewwilliams@student.com

annejones@student.gov

avajameson@student.cow

kateblack@student.com

briannawhitman@student.com

FIGURE 1

‘%student.com’

andrewwilliams@student.com

kateblack@student.com

briannawhitman@student.com

FIGURE 2

‘a%co_’

andrewwilliams@student.com

avajameson@student.cow

FIGURE 3

Alas, even SQL wildcards have their limits. SQL also supports Regular Expressions, or RegEx, using the RLIKE function. RegEx gets a bad reputation. It is relatively unknown outside of the tech industry, and inside of it, it is mostly known as a pain to read. But RegEx provides users with much more powerful metacharacters. Instead of multiple general-purpose wildcard characters, regex has only one: “.”. The period is used to represent any single character, much like the underscore in SQL. RegEx brackets work similarly to SQL wildcard brackets. RegEx also has more specific wildcards, as well as negation. The more specific whitespace characters have capitalized opposites. For example, \s represents a whitespace character, but \S represents a non-whitespace character.

Symbol		Description	Example	Example Matches
.	Period	Matches any character	S.napse	Synapse, S3napse, S~napse, S napse,
^	Negation	Matches any character except the following, used inside brackets	S^tudent	Spudent, Srudent, Squdent
\s	Whitespace	Represents a single whitespace character	Pattern\sRecognition	Pattern Recognition
\d	Digit	Represents a single digit character	GPT\d	GPT3, GPT4
\w	Word	Represents a single word character (alphanumeric or underscore)	Cl\wud	Cloud, Cl0ud, Cl_ud

TABLE 2

This seems limiting, but RegEx allows users to adjust the repetition of a character using quantifiers.

Symbol		Description	Example	Example Matches
*	Asterisk	Matches 0 or more times	D.TA*	DaaaaaaaaTA, DTA, D8d s4gTA
+	Plus	Matches 1 or more times	Engin.+r	EngineEr, Enginer, Engin84 r
?	Question Mark	Matches 0 or 1 time	Reg.?x	RegEx, RegX, Reg X
{n}	Curly braces	Matches exactly n times	Pa.{2}ern	Pattern, Pa 1ern
{n,m}	Range	Matches between n and m times (inclusive)	S.{2,5}L	SQQL, SWELL, SQQ QQL
{n,}	At least	Matches at least n times (inclusive)	Quanti.{1,}ier	Quantifier, Quanti5f5ier

TABLE 3

RegEx allows users another degree of specificity in their patterns; Anchors. Anchors offer the ability to match not only symbols but the position of those symbols within context.

Symbol		Description	Example and example string		Example Matches
^	Beginning of Line Caret	Following string must occur at the beginning of line. Not used inside brackets.	^.{3}	“abc def ghi”	abc
$	End of Line	Following string must occur at the end of line. Not used inside brackets.	$.{3}	“abc def ghi”	ghi
\b	Boundary	Following string must occur at a word boundary, meaning beginning or end of a word.	\b.	“Wildcard”	“W”, “d”
\B	Non-boundary	The opposite of \b	\B.*	“Wildcard”	“ildcar”

TABLE 4

RegEx includes many more special characters than can be listed here, creating a powerful searching function for both extracting information from and adding context to strings. But how can it be used? For a simple task on a single string, we should use the built in Regex library for python. This library is imported as re. This library comes with four main functions, listed below.

Function	Description
search	Searches the string and returns a match object which includes the match position and the matching pattern. Returns NoneType if no match is found.
findall	Searches the string and returns all non-overlapping matches of the RegEx pattern supplied. An empty list is returned if no match is found.
split	Like Python’s split() function, this splits a string into a list of its substrings, using a RegEx pattern as a delimiter. Returns list containing full string if no match is found.
sub	Like Python’s replace() function, this function replaces all instances of a pattern-matched substring. Returns unchanged string if no match is found.

TABLE 5

Python’s re library works nicely will single strings, but PySpark is needed to perform RegEx functions on a database. For this, we must import pyspark.sql.SparkSession and pyspark.sql.functions. RegEx can be performed through SQL on PySpark databases in a SparkSession using a RLIKE function. RegEx can also be used as a filter condition through rlike(). RLIKE in SQL and rlike() and pyspark.sql.functions.regexp_extract() in PySpark will only return the leftmost match. Unlike re.findall, this function allows for pattern grouping and indexing. In this way, regexp_extract() can return only what comes before or after a given substring. The indexing is 1-based. For example, if you wanted only the substring before the first word that started with case-insensitive ‘b’, one could query a dataframe column that included the string "The quick brown fox jumps over the lazy dog" like so:

df.withColumn(“match”, regexp_replace(col(“mycol”, “(.*)(\s[b|B])(.*)”, 1 )

This would return “The quick." The first and third groups (.*) represent 0 or more of any character. The second group, (\s[b|B]), represents a whitespace character followed by a ‘b’ or a ‘B’.

PySpark requires a workaround to repeat the functionality of re.findall() and return a list of multiple matching substrings. PySpark supports pandas, which has a findall() method that returns a pandas Series of lists containing the matching substrings. This is imported as pyspark.pandas.Series.str.findall. Pyspark’s regexp_replace() function acts as though performing re.sub() on an entire column, replacing all matching substrings with a provided alternative.

Function	Description
SQL: RLIKE	Returns a Boolean if a match is found. Functionally, similar to re.search().
PySpark: rlike()
regexp_extract()	Returns the leftmost substring matching the pattern. Allows for grouping substrings by pattern and returning only a specific index. The indexing is 1-based.
regexp_replace()	Like re.sub(), replaces all matching substrings.
pyspark.pandas.Series.str.findall()	Returns a pandas series including a list of all matches in a given column.

TABLE 6

If you are interested in learning more or implementing pattern recognition in your company’s database management, let us know!

Sources:

https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/

https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference

https://docs.python.org/3/library/re.html#regular-expression-objects

https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.regexp_extract.html

Eliza Schuh

Data Engineer

SpyglassMTG Blog

Implement Pattern Recognition in Your Database Management

Eliza Schuh

Need Help?

Stay Informed

Lincoln, RI - USA