For most people, the use of pattern-matching starts and ends with the Find tool, better known as Control-F on Windows and Linux and Command-F on Mac. The Find tool matches exact patterns and words on a webpage or pdf, showing the user where and how many times a word occurs. However, that kind of exact pattern recognition has severe limitations. By introducing abstractions, pattern recognition transforms from a timesaver into a powerful tool for data engineering. Using SQL, Python 3.11, and PySpark, we will demonstrate how to match and mutate strings and databases.
Symbol |
Description |
Example Wildcard |
Example Matches |
|
% |
Percent |
Represents zero or more characters |
S% |
S, SpyglassMTG, Spyg1a$$ |
_ |
Underscore |
Represents a single character |
M_G |
MTG, M9G, M#G |
[] |
Brackets |
Represents a single character enclosed within the brackets |
[AbCZ6]zure |
Azure, bzure, 6zure |
- |
Dash |
Used within brackets. Represents a range of single characters potentially represented by the brackets. Both ends are inclusive. |
P[x-z]Spark |
Pxspark, PySpark, PzSpark |
Microsoft[1-3]65 |
Microsoft165, Microsoft265, Microsoft365 |
TABLE 1
With a single exact pattern match, there’s no way to select all emails from Figure 1 that end with “com.” In order to perform this abstraction, we must employ metacharacters, non-literal character types that have a special meaning to the interpreter. Luckily, SQL includes a type of metacharacter called a wildcard (Table 1), a special character that can act as an abstract stand-in for another character. With SQL wildcards, users get a first level of abstraction. Instead of asking for each exact phrase, such as “SELECT * WHERE email = ‘kateblack@student.com’ OR ‘briannawhitman@student.com’”, now the searcher can ask for “SELECT * WHERE email LIKE %student.com”. In addition, wildcards can allow for multiple search criteria. For example, SELECT * WHERE email LIKE ‘a%co_’ will select only rows with emails that begin with the letter ‘a’ and end in ‘co_’ with ‘_’ representing any single character.
Example Database |
johnsmith@student.net |
andrewwilliams@student.com |
annejones@student.gov |
avajameson@student.cow |
kateblack@student.com |
briannawhitman@student.com |
FIGURE 1
‘%student.com’ |
andrewwilliams@student.com |
kateblack@student.com |
briannawhitman@student.com |
FIGURE 2
‘a%co_’ |
andrewwilliams@student.com |
avajameson@student.cow |
FIGURE 3
Alas, even SQL wildcards have their limits. SQL also supports Regular Expressions, or RegEx, using the RLIKE function. RegEx gets a bad reputation. It is relatively unknown outside of the tech industry, and inside of it, it is mostly known as a pain to read. But RegEx provides users with much more powerful metacharacters. Instead of multiple general-purpose wildcard characters, regex has only one: “.”. The period is used to represent any single character, much like the underscore in SQL. RegEx brackets work similarly to SQL wildcard brackets. RegEx also has more specific wildcards, as well as negation. The more specific whitespace characters have capitalized opposites. For example, \s represents a whitespace character, but \S represents a non-whitespace character.
Symbol |
Description |
Example |
Example Matches |
|
. |
Period |
Matches any character |
S.napse |
Synapse, S3napse, S~napse, S napse, |
^ |
Negation |
Matches any character except the following, used inside brackets |
S^tudent |
Spudent, Srudent, Squdent |
\s |
Whitespace |
Represents a single whitespace character |
Pattern\sRecognition |
Pattern Recognition |
\d |
Digit |
Represents a single digit character |
GPT\d |
GPT3, GPT4 |
\w |
Word |
Represents a single word character (alphanumeric or underscore) |
Cl\wud |
Cloud, Cl0ud, Cl_ud |
TABLE 2
This seems limiting, but RegEx allows users to adjust the repetition of a character using quantifiers.
Symbol |
Description |
Example |
Example Matches |
|
* |
Asterisk |
Matches 0 or more times |
D.*TA |
DaaaaaaaaTA, DTA, D8d s4gTA |
+ |
Plus |
Matches 1 or more times |
Engin.+r |
EngineEr, Enginer, Engin84 r |
? |
Question Mark |
Matches 0 or 1 time |
Reg.?x |
RegEx, RegX, Reg X |
{n} |
Curly braces |
Matches exactly n times |
Pa.{2}ern |
Pattern, Pa 1ern |
{n,m} |
Range |
Matches between n and m times (inclusive) |
S.{2,5}L |
SQQL, SWELL, SQQ QQL |
{n,} |
At least |
Matches at least n times (inclusive) |
Quanti.{1,}ier |
Quantifier, Quanti5f5ier |
TABLE 3
RegEx allows users another degree of specificity in their patterns; Anchors. Anchors offer the ability to match not only symbols but the position of those symbols within context.
Symbol |
Description |
Example and example string |
Example Matches |
||
^ |
Beginning of Line Caret |
Following string must occur at the beginning of line. Not used inside brackets. |
^.{3} |
“abc def ghi” |
abc |
$ |
End of Line |
Following string must occur at the end of line. Not used inside brackets. |
$.{3} |
“abc def ghi” |
ghi |
\b |
Boundary |
Following string must occur at a word boundary, meaning beginning or end of a word. |
\b. |
“Wildcard” |
“W”, “d” |
\B |
Non-boundary |
The opposite of \b |
\B.* |
“Wildcard” |
“ildcar” |
TABLE 4
RegEx includes many more special characters than can be listed here, creating a powerful searching function for both extracting information from and adding context to strings. But how can it be used? For a simple task on a single string, we should use the built in Regex library for python. This library is imported as re. This library comes with four main functions, listed below.
Function |
Description |
search |
Searches the string and returns a match object which includes the match position and the matching pattern. Returns NoneType if no match is found. |
findall |
Searches the string and returns all non-overlapping matches of the RegEx pattern supplied. An empty list is returned if no match is found. |
split |
Like Python’s split() function, this splits a string into a list of its substrings, using a RegEx pattern as a delimiter. Returns list containing full string if no match is found. |
sub |
Like Python’s replace() function, this function replaces all instances of a pattern-matched substring. Returns unchanged string if no match is found. |
TABLE 5
Python’s re library works nicely will single strings, but PySpark is needed to perform RegEx functions on a database. For this, we must import pyspark.sql.SparkSession and pyspark.sql.functions. RegEx can be performed through SQL on PySpark databases in a SparkSession using a RLIKE function. RegEx can also be used as a filter condition through rlike(). RLIKE in SQL and rlike() and pyspark.sql.functions.regexp_extract() in PySpark will only return the leftmost match. Unlike re.findall, this function allows for pattern grouping and indexing. In this way, regexp_extract() can return only what comes before or after a given substring. The indexing is 1-based. For example, if you wanted only the substring before the first word that started with case-insensitive ‘b’, one could query a dataframe column that included the string "The quick brown fox jumps over the lazy dog" like so:
df.withColumn(“match”, regexp_replace(col(“mycol”, “(.*)(\s[b|B])(.*)”, 1 )
This would return “The quick." The first and third groups (.*) represent 0 or more of any character. The second group, (\s[b|B]), represents a whitespace character followed by a ‘b’ or a ‘B’.
PySpark requires a workaround to repeat the functionality of re.findall() and return a list of multiple matching substrings. PySpark supports pandas, which has a findall() method that returns a pandas Series of lists containing the matching substrings. This is imported as pyspark.pandas.Series.str.findall. Pyspark’s regexp_replace() function acts as though performing re.sub() on an entire column, replacing all matching substrings with a provided alternative.
Function |
Description |
SQL: RLIKE |
Returns a Boolean if a match is found. Functionally, similar to re.search(). |
PySpark: rlike() |
|
regexp_extract() |
Returns the leftmost substring matching the pattern. Allows for grouping substrings by pattern and returning only a specific index. The indexing is 1-based. |
regexp_replace() |
Like re.sub(), replaces all matching substrings. |
pyspark.pandas.Series.str.findall() |
Returns a pandas series including a list of all matches in a given column. |
TABLE 6
If you are interested in learning more or implementing pattern recognition in your company’s database management, let us know!
Sources:
https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/
https://docs.python.org/3/library/re.html#regular-expression-objects