Market Basket Analysis to Classify Shopping Trips

Market basket analysis is a modelling technique based upon the theory that if you buy a certain set of items, you are more likely to also buy another set of corresponding items. For example, in the famous "beer and diaper" story, store owners found that male shoppers who bought diapers often also bought beer. Just by placing these items next to each other in the store, owners increased sales dramatically.

A useful related analysis is to determine what type of trip a customer is on, based upon the contents of their shopping cart. Whether they're on a last minute run for school supplies or picking up their monthly prescriptions, classifying trips into types enables stores to better target their customers.

For example, linking shopping trip types with customer information can aid in promotional opportunities by tailoring coupons and advertisement. If a group of customers only use a store, for example Walmart, for their grocery shopping, it may be advantageous to send them more food related promotions. Therefore, by knowing a customer's shopping behavior, stores can better retain their existing behaviors or even alter them to more desired ones.

Walmart recently hosted a Kaggle competition with the aim of improving a shopper's experience by segmenting their store visits into different trip types. They created 38 distinct categories using only a transactional dataset of the items customers have purchased. The task was then to categorize new shopping trips into one of these types.

To begin, we load in the dataset and look at the first few rows:

We see that there are 647,054 rows and seven different data fields which are described as:

  • TripType - a categorical id representing the type of shopping trip the customer made. This is the ground truth that we are predicting. TripType_999 is an "other" category.
  • VisitNumber - an id corresponding to a single trip by a single customer
  • Weekday - the weekday of the trip
  • Upc - the UPC number of the product purchased
  • ScanCount - the number of the given item that was purchased. A negative value indicates a product return.
  • DepartmentDescription - a high-level description of the item's department
  • FinelineNumber - a more refined category for each of the products, created by Walmart

To get a better understanding of the data, we can look at how many unique values exists per data field:

We see that for the 38 different TripType's, there are a total of 95,675 unique VisitNumbers which correspond to all seven days of the week, 68 different DepartmentDescriptions, and 39 ScanCount values. Below, we take a closer look at the distributions of TripType, ScanCount, and DepartmentDescription: 

We see that TripType's 39 and 40 make up a large proportion of the total population and most shopping trips consists of either buying one or two items, or returning a single item. Furthermore, the top four departments visited are for purchasing groceries. This is interesting since Walmart is not primarily thought as of a grocery store. 

In a more thorough analysis we would being to create new features based on these plots but for now we will only focus on the data provided above. 


Given the data set we can now begin to classify each customer's visit number into a trip type.
The XGBoost (eXtreme Gradient Boosting) library provides a great tool to tackle this problem. 
It is designed and optimized for boosted tree algorithms and has been shown to be highly scalable, portable, and accurate for classification problems. For more details on the algorithm and parameters please refer to the XGBoost documentation.

To begin, we train the algorithm with a set of known customers' visits and corresponding trip types. 
The input to the algorithm is a set of tuning parameters, the training set, the number of iterations to run the algorithm for and a few other variables that can be found in the referenced documentation. In this example we run the algorithm for 300 iterations with the goal of minimizing the multiclass logLoss between the training features and trip type. Shown below is the first ten rounds of training. 

Once the algorithm has been trained to completion, we can predict the trip type for a new set of testing users.

Therefore, for this set of new testing users, we can accurately predict their trip type with nearly 70% accuracy. There is a lot of room for improvement with optimizing parameters and creating new features, but this brief introduction already shows the power of the algorithm and analysis. 

Final Thoughts

Market basket analyses are a great way to increase revenue by knowing the types of items that are bought in connection with each other. In our example, we looked at an alternate approach of using these purchased items to infer what type of shopping trip a customer may be one. With this knowledge in hand, a store can optimize a customer's shopping experience, thus increasing revenue, customer retention, and loyalty.