Transforming Data with Regular Expressions


Did you know TIBCO Spotfire Professional has built in regular expression capabilities that can be used to validate and clean your data either before or after it has been loaded?

Data quality is a big issue with all customers.  Whether the data quality is an issue with customers merging data from different sources that use different standards, or whether the data is coming straight from an instrument or a public forum, like a survey, the data needs to be 'standardized' before it is analyzed in TIBCO Spotfire.

This standardization can be accomplished in TIBCO Spotfire by using the RxReplace function inside an expression (either in the Calculated Column UI, or the Custom Expression UI).

RxReplace is a Regular Expression Replace function which allows you to search for a pattern in your data, and then replace it with another pattern.  For example, search for all dates (in the form dd/mm/yyyy) and replace them with the pattern mm/dd/yyyy)

We will see this example and a few other common ones below.

Removing New Line Characters

The first example is one that happens all the time in training. We allow people to load their own data and they will typically load it from Excel.  Many Excel data cells have carriage returns in them , as shown below.



When importing this into Spotfire it imports the carriage returns as an ASCII special character, which makes it hard to read (as shown below in both the header of a table visualization and in a properties dialog).



 

If the columns are string based, it will not affect the analysis too much, but it can be an annoyance to look at in general. To fix it, when loading data, we can apply a 'replace column' transformation.

We can use the RxReplace function, passing in the original column, searching for the carriage return (identified as  '\n'), and replacing it with an empty string.

 

This will input the values in an easier to view format.



Converting Date Formats

Assume you are loading data from Europe that will be merged with data from US. The date formats are different. One is DD/MM/YYYY and the other is MM/DD/YYYY. What do you do? Assuming your data is in the format show below, DD/MM/YYYY, you can use the RxReplace method shown below to transform it to MM/DD/YYYY.



Without getting into too many details on the Regular Expression syntax, the pattern we are looking for is a string that starts with 2 digits (note that the \ character needs to be escaped in Spotfire's expression language) following by a / character, then 2 more digits, followed by another / character, and then 4 digits at the end of the string. When found, it will be replaced with the month ($2), followed by the day ($1), and then the year ($3), all separated by the / character.

And the result is just what we want. Below you can see the original column and then the updated column after the expression has been executed.

 



         
Converting Name Formats

A third example would be one I got from Brian Prather.  Assume you are in a situation where one data table had a 'Names' column that was formatted as <Last name, First name> and a second data table had a 'Names' column that was formatted as <First name Last name>. If you need to merge these into Spotfire, you will need to perform a regular expression transformation on one of them to standardize on one format.

Assuming we want to format into <First name Last name>, we could use the following expression using the RxReplace method:


 

In this expression, we want to search for a string that starts with any amount of characters followed by a comma and then any amount of characters. This will match any names in the format <last name, first name>.  When found, it will be replaced with the last name ($2) followed by the first name ($1).  This is since grouping was used inside the pattern.


And the result is just what we want. Below you can see the original column and then the updated column after the expression has been executed.




This tip is not meant to be a tip teaching regular expression, you can spend months on that. There are plenty of good resources on the internet to help you out. This tip was meant to show you the power of using them to transform, clean, and standardize your data either before you load it in Spotfire, either as a new column in the data table, or as a replacement for an original column in the data table (as shown below).

 



If you are intersted in learning more about regular expressions and other types of data transformations, please have a look at our Data Cleaning and Transformation techniques TIBCO Spotfire Professional webinar.

Have any other cool Regular expressions to share? Reply to this post with them.