Cours Stats

Back in 2005, I joined the IRI-France ad-hoc research team at Chambourcy as a PhD-internship in quantitative marketing. It was a great honour, not to mention a huge opportunity, to work for one of the most data-intensive company in the field of physical retailing (my PhD was about assortment planning in supermarkets…)

I had access to massive datasets (dozens of GB which was at the time already kind of ‘big’) and I spent quite a big deal of money buying a 80Go external hard-drive in order to be able to handle the data.

At the time, I used the SAS software v8, which was able to read the data from the physical disk even if it exceeded the 250 Mo of RAM I had on my new computer… Only after my PhD, I decided to switch to R.

I remember having have to deal with two major issues regarding these store panel datasets: the hierarchical structure (I even wrote an entire section on that matter), and the visualisation and smoothing of densities.

For a first try in data-visualisation using R, I naturally decided to go back to those first problems of mine.

1- Hierarchical structure of panel data

Retailer panel data record sales (in volume, units, and euro) of each product identified by its EAN (European Article Numbering).

  • Product are classified into cartegories (in this case, pasta), sub-categories (here, fresh pasta, as opposed to dried pasta), sub-sub-categories (tagliatelle, spaghetti…).
  • For each product, we know the manufacturer and the brand (a given manufacturer can have several brands, although it’s not allways the case)
  • There are also informations about the stores where the products are sold, namely the chain-name, the square-footage, and the ‘circuit’ (1 to 4, from small supermarket to big hypermarket)

Of course, not all product are in all stores. Some manufacturers may have several brands in a given store and other brands in other stores. Store chains appear to have stores in different crcuits… In a word, it’s a mess. And the first challenge in analysing retail sales is to get a good idea of what are the main factors for dirving sales in this complex hierarchy.

The way I dealt with this issue was to make dozens of plots comparing weekly sales among products, among sub-catgories, among stores, chains, or circuits. Now, I’ve realized that a Shiny App may be a highly more efficient way to go. And here it is!

This Shiny App presents a collection of spaghetti plots for pasta sales regarding different explanatory factors that you can choose from in the contextual menu above the graph:

The list of factors is as follows:

  • store
  • Product (defined by their EAN)
  • sstype (sub-sub-category, in this case: tagliatelle, spaghetti…)
  • fabr (short for Fabricant, which means Manufacturer)
  • marque (brand)
  • ensnum (the retail chain, for example Carrefour, Auchan…)
  • circuit (from 1: small supermarkets, to 4: big hypermarkets)

(allow some time for calculation, as the data is first parsed into dplyr, and only then, the plot is generated)

2- Visualisation and smoothing of densities

The second issue I had to deal with was tightly linked to the core of my PhD. I wanted to see wether the distribution of prices within assortments had an impact on sales (for the record, it does…). But the first step was to design an indicator of the shape of this distribution which is, by nature, plurimodal (maybe I’ll explain in another post how I constructed this).

Anyhow, I first needed to draw the density of prices, and had to deal with bandwidth adjustment. The lower the adjustment parameter, the less homogeneous the density.

So I needed to check what might be the “best” parameter. Of course, I made plenty of calculus to justify my choice, but the truth is I started by making plenty of graphs with different value of the parameter in order to have an idea of what I was looking for.

And Shiny would have helped a lot at that time…

Below is a very simple App that shows how the estimated density is sensitive to the bandwidth adjustment and the kernel shape (I chose the Epanechnikov shape for my research as it has interesting properties near the end of the distribution):

(This App should run faster as it only uses the data from a single store)

That’s all

I would like to say that all this new tools for data exploration are really helping researchers and practioners. But I wouldn’t like to sound like those old professors of mine who used to say they had to spend so many time in the library because they didn’t have such tools as the web.

Have a good day.