By Rijul Nanda, Youchan Lee, Austin Chin, Rainey Shah
This blog represents a report of our final project in EE 461P. For our term project, we chose to participate in a Kaggle competition that aimed to predict the total number of several items sold for 57 companies in Russia while being scored on the RMSE metric. The competition itself presented a regression problem in which data for more than two million transactions over the course of 33 months was provided. The link to the Kaggle competition is here. The best scores in the competition were nearly 0.75, providing a basis for comparison as we performed our analysis.
Our methodology for predictions involved preprocessing the data to remove outliers and fix IDs for shops that were misrepresented. We then ran our first models, resulting in very poor performing RMSE values for all models. These scores were generally greater than 2.100, which indicated very weak performance as compared to the top performers on the leaderboard. We then performed feature engineering, creating features representing names of products and mean value of products sold, as well as a feature for the GDP per capita. At different points, we wrote models to determine the effect of each feature on our scores. Each iteration of feature inclusion into the models resulted in a slightly improved score, reaching a minimum of 0.940. We then finally wrote models testing our feature engineering only the last three months of data, resulting in a final minimum RMSE of 0.938.
Predicting future sales for certain items, stores, and regions on a time series basis is a common practice in many business settings. Effective predictions can lead to valuable business insights regarding profit margins and risks. With the emergence of the field of data science in the 21st century, many data scientists and machine learning experts have constructed a variety of models to produce such insights. For our final project, we chose to compete in a Kaggle competition to predict the number of items sold in 57 Russian stores in November 2015, being able to utilize their sales data from January 2013 to October 2015. We wrote several models, namely those based on XGBoost, Linear Regression, Random Forests, and Light GBM. By the end, we were able to compare such models in terms of their relative performances and on the basis of important data science model characteristics such as interpretability, reproducibility, and computing efficiency. Our inspiration for choosing this competition was to get a glimpse of certain data science practices that can always be improved upon and understand how such practices can affect decision making in business settings.
Data and Visualizations
Our dataset was provided by 1C Company, a large Russian software firm that kindly allowed their dataset to be used as a Kaggle competition. The data consisted of 2,935,849 individual transactions and detailed the date, shop, item, price, and quantity of each.
Some supplemental information was also provided regarding the item categories, item names, and shop names, though this data was unfortunately provided in Russian. This was expected to be a major hurdle, but due to the existence of a translated dataset and the items already being categorized, it ended up being fairly trivial to deal with.
Visualizing the data immediately resulted in several interesting insights. Firstly, aggregating sales by month resulted in the graphic below:
Here, there’s clearly a moderately strong annual trend. Sales in December (Date Block 11 and 23) are significantly higher than any other month, and sales tend to lull in the middle of the year. This made month an obvious feature to add, allowing our models to understand the connection between date blocks 12 apart. There’s also a notable downward trend overall, indicating that either the shops sampled in the dataset are faring poorly or that some shops’ transactions stopped being recorded at all.
The distribution of individual items sold over the period was also fairly peculiar. Despite the most popular item selling 187,642 times, the median sales for each item was only 33. Considering this was over a 33 month period, the typical item in our dataset was only sold a single time each month across all shops. As such, we knew our final predictions for sales of an item in a given shop for the final month were going to be very low values made on a very limited amount of data.
Data preprocessing is often the most crucial step when attempting to predict on datasets. This process involves cleaning the data to reduce the amount of noise present from missing data and outliers.
For our purposes, the preprocessing steps involved visualizing the data and noticing anomalies. First we fixed shop IDs. This is because some shops were repeated in the dataset, specifically shops 0,1 and 10. These shops had duplicate entries in the dataset with slight differences in punctuation, perhaps indicating online shopping orders, but were ultimately the same shop. We thus set these shops to have the same shop ID.
Next, to reduce the skewing of data by outliers, we removed any items whose price was greater than 50,000 or less than 0. This was decided by considering the statistical description of the data: with an average price of 890.8 Russian Rubles, and a standard deviation of 1729.8 rubles, an item price of 50,000 is well outside of a reasonable price for an item. We also made sure to remove any individual item counts that were less than 0 or greater than 1000. Items that sold more than 1000 total times were considered to be outliers by once again considering the statistical descriptions — an average value of 1.24 items/day and a standard deviation of 2.618 items/day leads to removal of items that sold more than 1000 times. Additionally, negative item counts are considered outliers, as they do not aid in analysis. These negative item counts may represent returns or stolen items, and are easier to understand as outliers.
With these few steps, the models would be able to look at an overall cleaner dataset where they would not have to account for certain abnormalities in the dataset. Outlier removal and other preprocessing of the data significantly improves the accuracy of models by limiting noise and removing noisy data. Further, cleaning the dataset allows us to ensure that predictions are accurate. This is most evident in the fixing of the shop IDs — without this, our results would likely not represent the true values for those three shops.
After our initial preprocessing, we performed Feature Engineering.
The first feature we constructed, gdp_pc_usd, indicated the regional GDP per capita for different Russian stores. We came up with this feature by consulting IMF’s World Economic Outlook Database to match the regional GDP with the respective shop id’s and geographical regions of the stores in our dataset. The inspiration for pursuing this feature was the positive relationship that exists between GDP per capita and spending in typical consumer economies.
Name and Revenue Feature
Another feature that we included is the name feature. This specific feature engineering technique let us divide the items sold by the shops by even more defined categories based on item type so that our models could utilize more information based on what kinds of items were being sold. A revenue feature was also created and was calculated by multiplying the number of items in a day and multiplying it by the item price. These two features were derived from the data given to us by Kaggle. Lastly, we provided a feature based on the specifics in the wording of the items being sold and filled all the N/A categories with zero so that it would hold a specific value as opposed to none.
Lag and Mean Value Features
One of the last two features we gave our model was the lag feature. This provided our models a more cleanly defined way to make predictions on past iterations of a certain metric. For example, model shop average count lag — 1 was one of the features created in this process and that would provide the models a shop average count 1 month back, in our case, being October. We could go as many iterations back as possible and this data would let our models make inferences based on T-1, T-2, T-21, etc. categories and come up with certain trends based on past data. The last feature we personally utilized was the mean value feature and through this feature, we were able to make a category for the means of any metric, including the features that we had just created. Overall, through the utilization of many different feature engineering techniques, our models were given more information they could learn from to strengthen their predictions.
Model Selection and Analysis
After performing all of the above preprocessing and feature engineering, we began the process of model selection and writing. The models we chose were all regression models, as this was the nature of our problem. That is, we must predict the number of each item sold across all stores for November 2015. To this end, we wrote models for each iteration of our processing of the data — immediately before preprocessing, immediately after preprocessing (thus excluding the feature engineering), immediately after the initial feature engineering (including the name features, but no others), and then after all feature engineering steps. This was done to more explicitly see the effects of each step, and to confirm that each step does not decrease our overall predictive accuracy. Of note, our models were scored on the basis of the root mean squared error (RMSE) of the predictions in comparison to the true values. Since we could submit on Kaggle, we were able to compare our results to others who have participated in this competition. In this case, the best score was around 0.75, providing us with a strong score to compare our performance to.
Before writing our models to predict, we first created a final training set. This was done in order to have a final set that could be augmented with all of our feature engineering steps, but that also featured an item_count_month feature. This would be the feature that we predict on, as the task is to predict the amount of each item that is sold during one specific month. We additionally clipped the final predictions to be between 0 and 20, as the final item count per month was specified by the competition description to be between these values. We then performed a test-train split manually on the dataset in order to create a training set, validation set, and utilize the test set for our submissions.
In the following explanations, the entire dataset was utilized.
Simple Custom Decision Tree
Before utilizing known libraries and models, we first wrote our own decision tree. This model performed a binary process: predict 0 if the total items sold across all months was 0, or predict the average value for the previous two months if the item count was greater than 0. This model was simple and interpretable and was fairly robust. This model performed quite well given its simplicity, scoring a RMSE value of 1.14103. Considering the clipping technique mentioned above, the model can be represented by the following flowchart.
Modeling after DecisionTree
The first models we wrote utilizing known techniques were those with nearly no preprocessing outside of fixing the shop ID. These models did not boast strong results. SImply fixing the shop ID resulted in the following scores for DecisionTreeRegressor, RandomForestsRegressor, and LogisticRegresion are shown below:
These results did not present very strong results, and performed significantly worse than the simple decision tree model we previously wrote. Given the results we noticed from the above models, we did not attempt to predict utilizing XGBoost. This is because the results did not display promising results that would improve in a significant manner with another model.
The next models we wrote featured the first feature engineering steps. Specifically, the models utilized the name features. The results from DecisionTreeRegressor, RandomForestsRegressor, LinearRegression, and XGBoostRegressor are shown below:
These results display a significant improvement upon the previous model, displaying the fact that this feature engineering and outlier removal provides an improvement upon the previous models. This basic feature engineering therefore presents features that are significant for our analysis.
We then wrote models that featured all of the feature engineering steps. These models utilized the name features, lag features, mean value feature, and GDP feature. The results are displayed below for DecisionTreeRegressor, RandomForestsRegressor, LinearRegression, and XGBoostRegressor models:
The improvement from these models is less significant than the improvement seen from including only the name features. This is likely due to the fact that the first models were very rudimentary, as they only featured some preprocessing, thus indicating a need for feature engineering.
After the above modeling, we then repeated the modeling for the data only utilizing the previous three months of data. The following results display the RMSE scores with all of our preprocessing and feature engineering steps, but only utilizing the previous three months of data, rather than all previous months:
These results display the fact that some of the data from all previous months likely provided some noisy information. Additionally, utilizing only the previous three months of data reduced the time it took all models to run. A previous problem we faced was a long runtime for our models, as they were predicting on the entire dataset. Utilizing only the previous three months allows for models with stronger predictive accuracy, and a significantly lower runtime (17 hours in comparison to 3 hours).
Overall, our modeling indicates that the best model is a Light GBM model, boasting a final score of 0.93892. This value was achieved after including all of the preprocessing and feature engineering steps, and represents the fact that gradient boosting provides the strongest results for this problem out of all models that we wrote. LinearRegression seems to provide generally weak results, and XGBoost, another gradient boosting algorithm, provides the second best result overall.
Our process of predicting item counts per month followed a relatively standard process: we first visualized the data, then preprocessed the data, performed some feature engineering to aid in the prediction, and then wrote models to finally predict on all of the data we were provided with and created.
The data visualization aided in understanding the scope of the data presented. Visualizing the data allowed us to determine several features to add and peculiarities to watch out for. For example, by creating various graphs it became obvious that we should add a monthly feature to deal with annual trends, that we would have to figure out how to deal with extremely high outliers, and that most items had a very low number of sales. All of this information was useful moving forward and helped streamline the rest of the process.
The preprocessing resulted in the removal of outliers and redundant information, as some shop IDs were repeated, and outliers presented noise in the data that would influence the results in negative ways. This was confirmed by our initial modeling, as these models boasted very weak results without any outlier removal.
The feature engineering, combined with our outlier removal, resulted in the strongest results overall. Specifically, adding name features, a lag feature, and a trend feature overall resulted in a dramatic increase in our scores, eventually culminating with our highest score overall with a LightGBM model of 0.93892.
The following table highlights the different models we used and their relative scores on important characteristics. For each of the models’ following characteristics, a relative score of 1–10 was assigned, with 10 being the most desirable score. The test set performance characteristic refers to how well the model performed on the test set. The reproducibility characteristic indicates how easy it would be to duplicate the model and its predicted outcomes. The interpretability characteristic indicates how easy it would be to understand a certain model. Finally, the computing efficiency characteristic indicates the required efficiency to train a certain model in terms of its space and time complexity. While each score was assigned based on our group’s qualitative observations, we recommend a more objective and quantitative analysis for future projects that actually measures the aforementioned characteristics.
For future work, these models can likely be improved by further tuning. Utilizing the GridSearchCV library, the most optimal set of hyperparameters for each model can be achieved, likely resulting in stronger performance overall. Additionally, further feature engineering would likely improve our scores. Adding features that highlight item popularity over time, or consumer spending power over time, would likely add important information to the dataset that would improve predictive power of our models. Finally, given the data was in a time series format, more analysis could be done to elucidate certain relationships such as items being sold more frequently at certain parts of the year, or according to other unique intervals.
Russian Federation and the IMF. IMF. (2019, October). https://www.imf.org/en/Countries/RUS#countrydata.
1C Company. Predict Future Sales. Retrieved May 10, 2021 from https://www.kaggle.com/c/competitive-data-science-predict-future-sales/discussion.
Remisharon. Predict Future Sales- Translated Dataset. Retrieved May 10, 2021 from https://www.kaggle.com/remisharoon/predict-future-sales-translated-dataset.