Kaggle.com, Your Home for Data Science. Kaggle® has created a fun environment for Data Scientists to share ideas, compete against each other, get jobs, post jobs and hone their skills. The competitions run the gamut from Sports to Investing to Health care to Marketing and at times offer awesome cash prizes making it worth more than just bragging rights.
Our team has a lot of fun tackling these problems and decided to utilize one of our favorite tools, Alteryx®, exclusively to produce a competitive submission to a competition on Kaggle®. To my knowledge no one has used Alteryx® to take on a competition like this. Alteryx® not only has the ability to collect, clean, and prep your data, but also has some interesting predictive capabilities that we use often in our data investigation, modeling, and predictive analysis. So why not submit to one of the competitions,with the added challenge of only using Alteryx®!
Disclaimer: Teams of people spend a lot of time on these competitions with huge computing power and much more advanced statistical methods to win. While winning would be great, that’s not the goal of this exercise, and I’m sure we could do even more with this to win!
The competition we are choosing to tackle is one that mirrors an issue many companies big and small face every day; Google Analytics Customer Revenue Prediction. The goal of this competition is to analyze a stores customer set and predict the revenue per customer. This problem happens in every business every day. Google Analytics is easy to implement and marketing dollars are easy to spend, but how do you know if the spend make sense, or how can you tell the Lifetime Value of the customers your campaign brought you?
A side note, RStudio, another NuView favorite, has partnered with Kaggle® on this project to offer a $45K worth of prizes with prizes specifically for using R. Our team will be tackling this using R specifically as well, so stay tuned!
To begin, we need to get an understanding of what we are looking at. Our training data set is made up of individual online sessions (sessionId), for a specific visitor(FullVisitorId) and visit (visitId). For each record the data points we get include: the Channel Grouping which is an often analyzed component of a teams marketing efforts, are we getting strongDirect Traffic, how does our Organic Traffic convert, etc. We also receive information on the Device used, Geography, Social Engagement, and Traffic Source. The traffic source includes the campaign details. These details are essentially necessary in determining the effectiveness of a campaign, what source drove the traffic, what medium did the visitor use. These details are often overlooked when setting up a campaign, but it’s important to understand what data points you can collect, and how you want to collect them, so you can make the right decisions about your campaigns.
We also have a data category for Totals, this includes the visits, hits, pageviews, bounces and revenue (target value).
Each of these categories of data are stored in the training set as JSON objects. Lucky for us Alteryx® has a quick JSON parse tool that can take these categories and transform them into a column format that we can work with in the rest of our workflow.
After parsing the JSON we can now take a look at what data we have and what we want to work with. We then need to roll up the transactions to the visitorId to make the prediction,since ultimately, we need to predict the customer level prediction.
So now let’s take a look at the predictive tools that are readily at our disposal.
Alteryx® has quite a few different model options, and we can test them side by side and find the best option as needed. I’m going to work with 2 to run side by side,the Linear Regression and the Decision Tree. With Alteryx® I was able to easily select the variables that I wanted to include in each model.
After running each model next to each other, I can set up each to deliver a predicted score for our revenue variable for each of the test visitor ids. This allows us to create an output for each model and submit each as a submission to Kaggle® to check our results.
There are definitely more in-depth methods for analyzing the effectiveness of each of these models, but isn’t it fun to see them competing on an actual leaderboard! Let’s check and see how each of them did.
Well there we can see that in this run off between the linear model and the decision tree model, the linear model performed better on the leaderboard, 1.615 vs 1.660. Obviously we aren’t winning any awards with these simple models, but that doesn’t mean there’s not a lot more we can do with them, all within Alteryx®.
While these results produced competitive scores, we aren’t taking home any hardware yet. Some additional changes that could be made would be to split the data set a little bit. One option would be to break the prediction into a 2-step problem where we first predict Sale or no Sale,revenue or no revenue since a large portion of the data set results in no revenue. This would allow us to predict a binary result first, and on those that we predict revenue, we can predict a revenue value. This concept is much more important in practice. For example, what indicators do we have that a visit will end in a sale? Many companies will see single digit conversion rates, or sales rates on their website from a given visit, can we isolate the indicators of sale or no sale and adjust our marketing plan to improve this probability? Then, if we expect a sale, how much will they spend?
Another way to split this problem that is very pertinent in practice is the idea of marketing attribution. In practice we see that selling a product on a website does not usually come from a single visit to the site. Often, we see the customer was informed of a company’s product through one channel and visit, they are reminded of our product through another visit,they shop around and review our competitors and ultimately finally make a purchase. From a single visit standpoint, seeing that the purchase came from a direct channel may lead us to believe that direct is the only way we make sales, but it’s actually, the organic search that brought a client to us first, followed by display ads, and paid search ads that actually get the customer to make the decision. The direct visit is just the end of the sale process that lasted much longer. This type of split would involve understanding the paths taken by a user to a particular purchase. We can tackle this type of problem as well,but we’ll leave that for next time.
Alteryx® is a great example of emerging tools in data science that help data predictions processes be more efficient and more accurate. At NuView, we not only want to use the top tier tools but also keep an eye on process efficiency and transferability for our clients. The best part about this process is that this could be passed to our clients and repeated(consistently) in their own environments. We are big believers that the best models are the ones that are iterated upon consistently.