Adding Categories to your Data


TIBCO Spotfire Professional provides a variety of ways to add categories into your data above those that are included in the raw data table. Four methods we will discuss are Columns from Calculations, Tagging, Columns from Binning, and Filter Out/To.

Tagging and Calculated Columns can both be used to create additional categories in your data. For example, I can create a tag to place rows into a new category with 3 values; low, normal, and high. The benefit of this is I now have a new filter and column created inside my data table that I can use to color by, size by, and perform other operations on.

In the example below, we create a new category in our data table called 'Utilization Type' by tagging.

 




 

Similarly, I can create either a calculated column or custom expression to do the same using either an if statement or a case statement (we will see an example in a bit).

Which one should I use? Well it depends on your data and what your purpose is.

Calculated Columns

If you plan on replacing data, and you can easily define your 'categories' using an algorithm, then I would choose to use a calculated column or a custom expression. Calculated columns and expressions both get re-calculated by default when the original data table is replaced automatically. This makes this option more efficient, so that you do not have to repeat performing the same tagging sequence.

The following is an example of an expression created to do the same things as the tagging above:

case when ([Utilization Type])=("Vacation") then "Vacation" when ([Utilization Type])=("Production Task") then "Production Task" when ([Utilization Type])=("Self Dev Task") then "Self Dev Task" when ([Utilization Type])=("Travel") then "Travel" when ([Utilization Type])=("Support") then "Support" when ([Utilization Type])=("Holiday") then "Holiday" else "Teaching" end

NOTE: You will need to remember if you add more values in the category, that you will need to update your expression, otherwise it will be incorrect and new values will be added to the 'Teaching' value as that is after the else statement in the expression.

Tagging

Tagging is commonly used when you need to place your data into additional categories, that may not be easily computed using an algorithm. For example, you may be running through your analysis of stock data and you want to tag records as 'buy, 'hold', and 'sell'. Unlike the Utilization Type, your analysis for putting records into 'buy', 'sell', and 'hold' is more complex and involves some ad-hoc analysis, so tagging is the most efficient method. Since this analysis is used in more of an ad-hoc fashion, the tags are saved with the analysis but not automatically re-applied when new data is replaced or reloaded.

You can still enable them to work when you reload data, but you need to specify a key columns to uniquely identify the records. This topic is described at the following URL:

http://stn.spotfire.com/spotfire_client_help/data/data_details_on_select_key_columns.htm

Binned Columns

Similar to Calculated Columns, you can also define new categories by using the New Column by Binning dialog. (available from Insert > Binned Column'). From this dialog, you can add a new column that places your data into predefined bins (categories), using a variety of methods.



Methods include bins by specific limits, even intervals, even distribution of unique values, standard deviations, and substrings. The type of methods available are dependent on the data type of the source column.

This capability is typically used when you have a specific algorithm you wish to apply to create your categories, but it is more advanced than simple if or case statements as shown in the earlier calculated column example. It is also very useful when you want to take continuous data and convert it to categorical data to perform further analysis.

NOTE: You can still create bins using the expression language in a calculated column (see image below), but the Insert Binned Column dialog is an easier to use interface rather than entering the code manually.

 

Filter to and Filter out

Finally, if you are trying to create a column that contains only two categories, it is possible to do this by first marking the records of interest, and then right-clicking , and selecting 'Marked Rows > Filter To' or 'Marked Rows > Filter Out'.


Both options will create a new Filter. The first one will filter out every row except the ones marked, and the second one will filter out the marked rows only.




This functionality can be used in a variety of ad-hoc analysis techniques, including removing outliers or other data points to see how the rest of the analysis changes. It can also be used, as we mentioned earlier, to create a column with 2 categories. The default name of the filter can be changed to something more descriptive by right-clicking in the filter and selecting 'rename' or by going to 'Edit > Column Properties'.

What to extend it even more?

While the new columns are stored with the analysis file, you may want to persist this information outside the file so that it can be leveraged in other applications and other analysis files. One such solution is to use a combination of Information Services and our SDK to create a customized tool that will write back the new columns into the source database. Information Services can be setup on the Spotfire Server to run stored procedures, so you can setup a stored procedure which accepts values and performs an insert into the database. You would then need to use the SDK to create a tool , which the user can launch to specify the column(s) they wish to write back. The tool would grab the columns, and pass them into the stored procedure as input parameters. Alternately , you could use a Script Control inside a specific Spotfire file rather than writing a Custom Tool.

Summary

To summarize, there are a variety of techniques for creating additional categories in your data table. While many of them can be interchanged, each has its own strength and weakness depending on the scope of the analysis (a quick ad hoc analysis or a persistent guided analysis), the number of categories required, and the algorithms used to create the categories.

Any additional thoughts or information on creating categories, reply to this post with your thoughts.