MrClean

Mr. Clean swept his way into American homes in 1958 after Proctor and Gamble bought the rights from its creator, Linwood Burton. Interestingly, Regular Expressions were first described in the 50’s by Stephen Cole Kleene. Coincidence? I think not.

However, today’s post is not about household cleaning products, nor is it a history lesson in Computer Science. It is about scrubbing data in VBA using Regular Expressions (RegExp).

Substitute ()

I see many folks using multiple nested levels of the Substitute() function to try to clean their data. This can be time consuming and can lead to some inaccuracies. Instead, we can use Regular Expressions (RegExp) to increase efficiency and accuracy as well as handle complex strings and large data sets quickly and efficiently.

User Defined Functions

I also see folks create User Defined Functions (UDF’s) to manipulate strings. Here is a nice example the other day from Doug Jenkins over at Newton Excel Bach. (btw, If you are not following Doug’s blog, you should be) But is there a better way? Enter Regular Expressions (RegExp).

Regular Expressions (RegExp)

A Regular Expression is a sequence of characters that create a pattern. The sequence could be something complicated like <([A-Z][A-Z0-9]*)\b[^>]*>(.*?) to something simple like \d . Regular Expressions are very useful in VBA for working with many different scenarios of strings and introducing automation for transforming your data before loading to target databases for OLAP such as Essbase, Power Pivot or SSAS.

Regular Expression – Methods

Regular Expressions in VBA offers 3 Methods:

  1. Test
  2. Replace
  3. Execute

For today, I will focus on the Execute Method and look at the other Methods in future posts.

An example

Here’s a silly nonsensical string, 12kj$%23fg^&*34950…345. Let’s say I’m only interested in the numbers in the string so I want to return 122334950345. To return just the numbers from the string, I’ll use the pattern [0-9].

Option Explicit

Sub RegExFoo()
    
    'Author: Winston Snyder
    'Date: 12/11/2013
    'Extract a pattern of interest from an input string

    'Declare variables
        Dim wb As Workbook
        Dim ws As Worksheet
        Dim RE As Object
        Dim Match As Object
        Dim Matches As Object
        Dim strNumber As String
        Dim i As Long
    
    'Initialize variables
        Set wb = ThisWorkbook
        Set ws = wb.Worksheets("Sheet1")
        Set RE = CreateObject("VBScript.RegExp")
    
    'Criteria for Regular Expression
        With RE
            .Pattern = "[0-9]"
            .Global = True
            Set Matches = .Execute(ws.Range("A1").Value)
        End With
        
    'Loop Matches collection to build string of all numbers in the sample string
        strNumber = ""
        For i = 0 To Matches.Count - 1
            strNumber = strNumber + Matches(i)
        Next i
        
    'Output
        MsgBox strNumber
    
    'Tidy up
        'Destroy objects
            Set Matches = Nothing
            Set RE = Nothing
            Set ws = Nothing
            Set wb = Nothing
    
End Sub

Output : 122334950345

Great! The code returned exactly what I was looking for.

Case Study: RegExp

The sample is a bit obscure and seems somewhat unlikely, but who knows – might be just the thing someone is looking for. Let’s take a look as something a bit more realistic.

Let’s say we receive a load file from Financial Analysis and Planning (FP&A) that they would like loaded to a cube as a forecast scenario. For our sample, we’ll use Customer Codes and Total Revenues. We receive the file, but we immediately see that there is a problem.

DataRegEx3

There is no delimiter between the customer code and the total revenue amount. Additionally, both sets of substrings are of varying length meaning we cannot use text to columns without some manual cleanup work. Let’s look at one way we might split these strings using Regular Expressions.

I’m going to:

  1. Read the strings into an array
  2. Loop the array
  3. Split the string into substrings – load the substrings into new arrays
  4. Output the contents of the new arrays to a worksheet

A Quick Note – Early / Late Binding

A discourse on Early / Late Binding is beyond the scope of this post. Suffice to say, I am using Late Binding as demonstrated:

 Dim RegEx                               As Object
'Create a regular epression object
        Set RegEx = GetRegEx
Private Function GetRegEx() As Object

    On Error Resume Next
    Set GetRegEx = CreateObject("VBScript.RegExp")
    
End Function

There are additional links at the bottom of the post for more information on Early / Late Binding.

The Complete Code

You can open a new workbook, launch the Visual Basic Editor (VBE) add a new module and paste the code below into the module. I broke the Subs() and Functions() into separate snippets to improve readability. Alternatively, you can download the workbook, the link is at the bottom of the post.

 Option Explicit
 Sub SplitStringNoDelimiter()
    
    'Author:        Winston Snyder
    'Date:          12/15/2013
    'Purpose:       Split string into text and value components
    'Comments:      No delimiter
    '               Loop array for output
    '---------------------------------------------------------------------------------------------------------------------------------------------
    
    Dim RegEx                               As Object
    Dim wb                                  As Workbook
    Dim wsInput                             As Worksheet
    Dim wsOutput                            As Worksheet
    
    Dim rngInput                            As Range
    Dim rngOutputDescriptions               As Range
    Dim rngOutputValues                     As Range
    
    Dim arrInput()                          As Variant
    Dim arrOutputDescriptions()             As Variant
    Dim arrOutputValues()                   As Variant
    
    Dim i                                   As Long
    Dim lngRowsData                         As Long
    
    Const strPatternDescriptions            As String = "\D+"
    Const strPatternValues                  As String = "\d+(\.\d{1,2})?"
    Const lngColumnDescriptions             As Long = 1
    Const lngColumnValues                   As Long = 2
    
    'Excel enrionment - speed things up
        With Application
            .ScreenUpdating = False
            .DisplayAlerts = False
            .EnableEvents = False
            .Calculation = xlCalculationManual
        End With
    
    'Initialize
        Set wb = ThisWorkbook
        With wb
            Set wsInput = .Worksheets("Input")
            Set wsOutput = .Worksheets("Output")
        End With
        
    'Clear previous output
        wsOutput.UsedRange.ClearContents
        
    'Input range without header
        With wsInput
            lngRowsData = GetRow(ws:=wsInput)
            Set rngInput = .Range(.Cells(2, 1), .Cells(lngRowsData, 1))
        End With
        
    'Transfer input range to input array
        arrInput = rngInput
        
    'Dimension output arrays
        ReDim arrOutputDescriptions(LBound(arrInput) To UBound(arrInput))
        ReDim arrOutputValues(LBound(arrInput) To UBound(arrInput))
        
    'Create a regular epression object
        Set RegEx = GetRegEx
        
    'Loop through each string in the input array
        For i = LBound(arrInput) To UBound(arrInput)
            
            'Pass the string to regular expression function to return the descriptive portion of the string
                arrOutputDescriptions(i) = GetSubString(objRegEx:=RegEx, _
                                                        strString:=CStr(arrInput(i, 1)), _
                                                        strPattern:=strPatternDescriptions)
                                                      
            'Pass the string to regualr expressions functions to return the value portion of the string
                arrOutputValues(i) = GetSubString(objRegEx:=RegEx, _
                                                  strString:=CStr(arrInput(i, 1)), _
                                                  strPattern:=strPatternValues)
        Next i
        
    'Output all elements of each array to an output range
    'Description in Column 1, Values in Column 2
        
        'Descriptions
            Call OutputArray(ws:=wsOutput, _
                             vTmpArray:=arrOutputDescriptions, _
                             lngColumn:=lngColumnDescriptions)
        
        'Values
            Call OutputArray(ws:=wsOutput, _
                             vTmpArray:=arrOutputValues, _
                             lngColumn:=lngColumnValues)
                             
    'Add a header to the data
        With wsOutput
            .Range("A1").EntireRow.Insert shift:=xlDown
            .Cells(1, 1) = "Descriptions"
            .Cells(1, 2) = "Values"
        End With

    'Tidy up
        'Erase arrays
            Erase arrInput
            Erase arrOutputDescriptions
            Erase arrOutputValues
            
        'Destroy objects
            Set RegEx = Nothing
            Set rngInput = Nothing
            Set wsInput = Nothing
            Set wsOutput = Nothing
            Set wb = Nothing
            
        'Restore Excel environment
            With Application
                .ScreenUpdating = True
                .DisplayAlerts = True
                .EnableEvents = True
                .Calculation = xlCalculationAutomatic
            End With
End Sub

 

Private Sub OutputArray(ws As Worksheet, _
                        vTmpArray() As Variant, _
                        lngColumn As Long)

    Dim j As Long
    
    For j = LBound(vTmpArray) To UBound(vTmpArray)
        ws.Cells(j, lngColumn).Value = vTmpArray(j)
    Next j
                      
End Sub

 

Private Function GetRow(ws As Worksheet) As Long

    GetRow = ws.Cells(Rows.Count, 1).End(xlUp).Row

End Function

 

Private Function GetRange(ws As Worksheet, _
                          lngRowsStart As Long, _
                          lngRowsEnd As Long, _
                          lngColumn As Long) As Range

    Dim rng As Range

    With ws
        Set rng = .Range(.Cells(lngRowsStart, lngColumn), .Cells(lngRowsEnd, lngColumn))
    End With
    
    Set GetRange = rng
    
End Function

 

Private Function GetRegEx() As Object

    On Error Resume Next
    Set GetRegEx = CreateObject("VBScript.RegExp")

End Function

 

Private Function GetSubString(objRegEx As Object, _
                              strString As String, _
                              strPattern As String) As String
                              
    Dim reMatches As Object
    Dim strResult As String
    
    strResult = "No Match"
    
    With objRegEx
        .Pattern = strPattern
        .Global = True

        Set reMatches = .Execute(strString)
        
        If reMatches.Count <> 0 Then
            strResult = reMatches.Item(0)
        End If
    End With
    
    GetSubString = strResult

End Function

The Results

REOutput
Great! Exactly what I was looking for.

Tidy up

    Final Thoughts

    Regular Expressions are found in nearly all programming languages and much like Duct Tape, they have a Million and one uses. You are truly only limited by your imagination and ability to concoct the correct patterns. I use Regular Expressions as I Extract data from various Data Silos to Transform and normalize data prior to Loading to target reporting databases. Let us know how you use Regular Expressions in the comments.

    Downloads

    Download the workbook from OneDrive.

    Additional Resources