Originally published at Code-free Visual Path Analysis: Watch Now.
Marketers need to visually analyze customer paths. IT professionals should be able to visually analyze server logs. Healthcare professionals want to visually analyze treatment paths.
There is no reason any of these tasks should require advanced coding skills.
Check out these demo videos we recently put together for the Teradata Path Analysis Guided Analytics Interface. You’ll see how easy it is to visually explore paths without writing any code. You can export lists of customers (or servers, or patients) who have completed paths or are on specific paths. And you can investigate text associated with events on these paths. All you need to be able to do is specify a few parameters in the interface and click a few buttons.
In this demo, we use the predictive paths capabilities of the Path Analysis Interface to identify two sets of customers. One set of customers is at risk of churn. The other group is prospects we may be able to push across the line to conversion.
In this video, we look at “cart abandonment” scenarios with an online banking data set and an eCommerce data set. Also, we showcase the “Add Drops” feature that makes it visually apparent where prospects and customers drop off paths within the Path Analysis Interface.
The text analytics capabilities of the Path Analysis Interface are very unique and also very powerful. In this demo, we use text to provide context around complaints within a multi-channel banking data set.
Here, we are looking at healthcare billing data. We want to make it apparent that path analysis use cases are about much more than marketing. Healthcare professionals may also want to look at paths to certain procedures, paths around treatment and recoveries, or paths to specific diagnoses.
If you’re interested in visually exploring paths and patterns, please contact your Teradata account executive or send me a note at email@example.com. We can have you up and running with the Teradata Path Analysis Guided Analytics Interface on Teradata, Aster, or the Teradata Analytics Platform in no time!
XGBoost has gotten a lot of attention recently as the algorithm has been very successful in machine learning competitions. We in Aster engineering have been getting a lot of requests to provide this function to our customers. In AA 7.0, we’ve released an XGBoost/Gradient Boosting function.
The techniques of XGBoost can be used to improve the performance of any classifier. Most often, it’s used with decision trees, which is how we’ve built it in Aster.
Decision trees are a supervised learning technique that tries to develop rules (“decisions”) to predict the outcome associated with an observation. Each rule is a binary choice based on the value of a single predictor: the next binary choice depends on the value of that predictor, and so on, until a prediction can be made. The rules can be easily summarized and visualized as a tree, as shown below.
In this tree, the outcome is 0, 1, 2, 3, or 4, where 0 indicates no heart disease, and 1 through 4 represent increasing severity of heart disease. The first “rule” is based on the value of the “Thal” column. If it is anything other than 6 or 7, the predicted outcome is 0. If the value in the Thal column is 6 or 7, the next step is to look at the value in the STDep column. If it is less than 0.7, the next step is to look at the value in the Ca column; if it is greater than or equal to 0.7, the next step depends on the value in the ChestPain column. To make a prediction for an observation, follow the rules down the tree until you reach a leaf node. The number at the leaf node is the predicted result for that observation.
A couple of techniques that can significantly improve the performance of decision trees are bagging and boosting. Bagging stands for “bootstrap aggregation”. Bootstrapping is a statistical technique where multiple datasets are created from a single dataset by taking repeated random samples, with replacement, from the original dataset. In this way you create a large number of slightly different datasets. Bagging starts by bootstrapping a large number of datasets and creating a decision tree for each one. Then, combine the trees by either majority vote (for classification problems) or averaging (for regression problems).
Random forest is a very popular variant of bagging. With random forests, you use bootstrapping to create new datasets as you do with bagging, but at each split, you only consider a subset of the predictors. This forces the algorithm to consider a wider range of predictors, creating a more diverse set of trees and a more robust model.
Boosting is a different approach. With boosting, you build trees sequentially. Each tree focuses specifically on the errors made by the previous tree. The idea is to gradually build a better model by improving the performance of the model at each step. This is different from bagging and random forest because at each stage you try to improve the model, by specifically looking at the points that the previous model didn’t predict correctly, instead of just creating a bunch of models and averaging them all together.
There are several approaches to boosting. XGBoost is based on gradient boosting.
The gradient boosting process starts by creating a decision tree to fit the data. Then, you use this tree to make a prediction for each observation and calculate the error for each prediction. Even though you’re predicting the same data that you used to build the tree, the tree is not a perfect model, so there will be some error. In the next iteration, this set of prediction errors becomes the new dataset. That is, each data point in the data set is replaced by the delta between the actual result and the predicted result. At each iteration, you replace the dataset with the errors made by the previous iteration. Then, you build a tree that tries to fit this new dataset of the deltas, make new predictions, and so on. When you add these trees together, the result should be closer to the original actual value that you were trying to fit, because you’re adding a model of the error. This process is repeated for a specified number of iterations.
Gradient boosting and XGBoost use a number of other optimizations to further improve performance.
Regularization is a common technique in machine learning. It refers to penalizing the number or the magnitude of the model parameters. It’s a way to prevent overfitting, or building a model that fits the training data so closely that it becomes unflexible and doesn’t perform well on different data.
When working with decision trees, regularization can be used to control the complexity of the tree, either by reducing the number of leaf nodes or the values assigned to each leaf node.
Typically in gradient boosting, when you add the trees together, each tree is multiplied, by a number less than 1 to slow the learning process down (boosting is often described as a way to “learn slowly”). The idea is that moving gradually toward an optimal solution is better than taking large steps which might lead you to overshoot the optimal result.
Subsampling is also a common technique in machine learning. It refers to building trees using only a subset of the rows or columns. The idea is to force the process to consider a more diverse set of observations (rows) or predictors (columns), so that it builds a more robust model.
The Aster XGBoost function also boosts trees in parallel. This is a form of row subsampling, where each vworker gets assigned a subset of the rows, and creates a set of boosted trees based on that data.
Stopping criteria are another important factor when building decision trees. In the Aster XGBoost function, you specify the exact number of boosting steps. The function also has stopping criteria that control the size of each tree; these arguments are analogous to those used in the other Aster decision tree functions Single_Tree_Drive, Forest_Drive, and AdaBoost_Drive.
Here’s the syntax of XGBoost_Drive. Refer to the Aster Analytics Foundation User Guide (Release 7.00.02, September 2017) for more information about the function arguments.
Here’s an example. The dataset is available from the UCI Machine Learning Repository. It’s a set of fetal monitoring observations classified into 3 categories. There are 2126 observations and 21 numeric attributes. The first few rows are shown below.
As usual when training a model, we divide the dataset into training and test sets, and use the training set to build the model. Here’s a sample function call:
The function displays a message when it finishes:
We can use the XGBoost_Predict function to try out the model on the test dataset:
Here are the first few rows of the output:
select id, nsp, prediction from ctg_predict;
To conclude, we’re very excited to make this algorithm available to our customers. Try it out!
James, G., Witten, D., Hastie, T., & Tibshirani R. (2013). An Introduction to Statistical Learning with Applications in R. Available at: http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Available at: https://web.stanford.edu/~hastie/Papers/ESLII.pdf
Friedman, J. “Greedy Function Approximation: A Gradient Boosting Machine.” IMS 1999 Reitz Lecture. Available at: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
Chen, T., Guestrin, C. “XGBoost: A Scalable Tree Boosting System.” KDD ’16, August 13-17, 2016, San Francisco, CA. Available at: http://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf
Dataset used in example:
Cardiotocography Data Set. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Does your advanced analytics platform has what it takes to access the same data and combine all of the analytics and run in-database ? - Check out Sri Raghavan's Blog
Using a popular analytic technique to understand behaviors and patterns, data scientists reveal a subtle but critical network of influence and competition, giving this gaming company the ability to attract and retain gamers in this $109B industry.
Insight that can only be found when you combine multiple sources of data with analytics. With Teradata Aster® Analytics, users apply cFilter. A function tailor-made for understanding behaviors and opinions.
Looking into the data, amazing patterns emerge.
Understanding the relationships that drive user behavior can help developers create better games to attract users, prevent churn, and determine how gamers influence each other.
Understanding relationships and influence with a collaborative filter helps multiple industries.
They can understand their customers behaviors. Then influence the customers who, in turn, influence their network.
Do you know which customers are likely to churn? Which prospects are likely to convert?
Historical path analysis is a critical factor in such predictions. The problem is path analysis is hard. And even when companies have such capabilities, they often reside in the hands of a few specialists – or vendor consultants.
The business analysts, marketers and customer support professionals who could ultimately act on these predictive insights to improve customers’ and prospects’ journeys are effectively left out in the cold. Even the specialists are ultimately confined to the limits of their tools.
Ask anyone who has used a traditional business intelligence tool to understand customer paths. It requires significant time and patience to shoehorn this type of analysis into a tool that was not designed for it. To begin with, just manipulating the data to build an event table for a BI tool is a significantly high hurdle. And even at the end of such a project, organizations end up with a static, inflexible report on historical data that does little to help businesses prevent future churn or accelerate future conversions. (This is hardly a criticism of BI tools, as their benefits and value are well documented. I’m only pointing out that path analysis historically is not one of their strong suits.)
Other advanced approaches leverage statistical tools like R and programming languages like Python. They may incorporate sophisticated analysis techniques like Naïve Bayes text classification and Support Vector Machine (SVM) modeling. But, at the end of the day, these are not tools or techniques for businesspeople.
And at the end of the day, what matters is providing your business teams the opportunity to influence the customer experience in a manner that is positive for your business.
The solution is to bring path analysis – including predictive path analysis – to the business. For such a solution to succeed, it must be:
The new Predictive Paths capability in the Teradata Path Analysis Guided Analytics Interface makes this interface a solution to consider.
Using the interface, marketers and analysts use a simple form to specify an event of interest – a churn event or conversion event, for example – and whether they want to see paths to or from that event. The interface returns results in the forms of several visualizations, including tree, sigma, Sankey and sunburst diagrams, as well as a traditional bar chart.
Within the tree diagram, users can select partial paths to their event of interest and create a list of users who have completed that partial path but not yet completed the final event. For example, if you are looking at an online banking data set and see that a path of “fee complaint, to fee reversal, to funds transfer” precedes a large number of churn events, in three clicks you can generate a list of customers who have completed the path “fee complaint, to fee reversal, to funds transfer” but not yet churned. Thus, you have just used Predictive Paths to identify potential churners without writing a line of code.
This video demo shows how marketers and business analysts can predict next steps for customers with the Path Analysis Guided Analytics Interface.
Watch this short video to see how Predictive Paths works within the Path Analysis interface. If you’re interested in bringing these capabilities to your business teams, please contact your Teradata account executive today.
A Slide Video from #TDPARTNERS17 presentation on "Machine Learning Vs Rules Based Systems". Adjust YouTube controls for play speed for following easily - 1.5 x, 0.5x etc.,
The original blog post is here =>Data Science - Machine Learning vs Rules Based Systems
Last week I spent at Anaheim Convention Center helping out with Aster demo on the Expo floor, participating in Advanced Analytics session and co-presenting business session on Sunday and presenting my data science session on Tuesday. Below I assembled all resources available online about these events.
Along with my deer friends data scientists Michael.Riordan@Teradata.com, firstname.lastname@example.org, email@example.com we presented several use cases in various applications of analytics on Aster. Mine were couple of demos:
John Carlile prepared excellent session on text analytics use cases I helped him with on one of our POCs - Rumpelstiltskin Analytics - turning text documents into insight gold with me as a co-presenter. Please contact John at firstname.lastname@example.org for more details (pdf attached). I would add the session covered analysis of user reviews of major hotel operator across many chains, "fake" review detection and unsupervised and supervised techniques such as LDA and logistic regression.
My presentation Building Big Data Analytic Pipelines with Teradata Aster and R (morning block) contained two parts:
General session catalog for PARTNERS is available here.
Harnessing an analytical technique known as text clustering, companies in multiple industries can analyze customer call center data to find key word trends and phrases that may quickly alert them to potential customer service problems, manufacturing defects or negative sentiment.
Video featuring Karthik.Guruswamy - Principal Consultant & Data Scientist
Safety Cloud – a transformation of multiple types of text data through analytics. A visualization leading to significant innovation. Applying natural language processing to these analytical techniques allows for sentiment analysis. Giving businesses an insight without looking at every document the dots represent.
The Star – lines thick and thin, seemingly simple but revealing critical insights and behaviors hidden amongst the data only discovered with analytics.
Using an analytical technique perfect for time-series data, Data Scientists used Hidden Markov Models to find hidden states.
Michelle Tanco, Data Scientist
Trial/Error and Fail Fast culture doesn't mean data scientists are unwilling to try time tested methods. It only means they are willing to take a lot of 'quickfire' risks for better results!
Using an agile approach, a cross-functional team of Doctors, Cancer Researchers, Data Scientists, Data Visualization Experts, and Technologists set out on a mission to understand over 1,000 genetic patterns of cancer in order to develop personalize medical treatments aligned to the genetic makeup of humans.
Decoding the human genome is the next frontier in science and discovery in medicine. Today, the combination of data, analytics, and visualization tools are cutting edge innovation in life sciences. View the video below.
Genome World Window - Stephen Brobst and Andrew Cardno
Combining the collaborative expertise of data scientists, geophysicists and data visualization an integrated oil company developed new understandings of complex reservoir management with data and analytics. This business case easily transcends multiple industries focused on asset utilization and optimization.