This is a preliminary research for the 29th Asia-Pacific Software Engineering Conference (APSEC 2022) paper "Why Visualize Competitive Code? Preliminary Categories for Jupyter Notebooks".
Data visualization becomes a crucial component in data analytics, especially data exploration, understanding, and analysis. Effective data visualization impacts decision-making and aids in discovering and understanding relationships. It leads to benefits in data-intensive software development tasks e.g., feature engineering in machine learning-based software projects. However, it is unknown how visualizations are used in competitive programming. The idea of this paper is to report early results on what visualizations are prevalent in competitive programming. Grandmasters are the highest level reached in competitions (novice, expert, master, and grandmaster). Analyzing the visualizations of 7 high-rank competitors (i.e., Grandmaster) in Kaggle, we identify and present a catalog of visualizations used to both tell a story from the data, as well as explain the process and pipelines involved to explain their coding solutions. Our taxonomy includes nine types from over 821 visualizations in 68 instances of Jupyter notebooks. Furthermore, most visualizations are for data analysis for distribution (DA Distribution), and frequency (DA Frequency) are most used. We envision that this catalog can be useful to better understand different situations in which to employ these visualizations.
Kaggle is an online community platform for people who are interested in data science and machine learning. It allows users to collaborate and compete with others in data science competitions to solve challenges. The users' performances are ranked based on their contribution to four platforms which are:
- Competitions
- Notebooks
- Datasets
- Discussion.
The Kaggle progression system is used to determine competitor ranking which consists of 5 performance tiers:
- Novice
- Contributor
- Expert
- Master
- Grandmaster
Target Competitor: In this study, we focus on the Notebook Grandmaster tier achieving 15 notebook gold medals. Each medal is awarded to popular notebooks which are measured by the number of upvotes on those notebooks which require 50 votes. We selected to focus on notebook grandmasters because they mostly have data visualization in their notebooks for data exploration and can be ranked based on the upvotes of the notebooks.
The purposes of visualization can be catagorized into (1) Data analysis (DA) (2) Machine learning model interpretation (ML).
Showing the data distribution, the graph type in this category is a histogram, box plot, and violin plot.
the visualization display the frequency of the data for data analysis with the graph type count plot, bar plot, and word cloud.
Displaying data on a geographical map to demonstrate the spatial relationships in data
the visualization that shows the percentage of data
Visualizing statistics data with the keyword correlation and confusion matrix in the title.
The visualizations that have a purpose for model interpretation in the training process e.g., the line graph shows the training accuracy.
The visualization that intends to show model interpretation or the result in the testing process containing submission or prediction as a keyword.
The visualization that shows the image that is being used in the model.
Visualizing the feature importance of model
The visualization that cannot fall into any classification type
- Tasha Settewong
- Natanon Ritta
- Raula Gaikovina Kula
- Chaiyong Ragkhitwetsagul
- Thanwadee Sunetnanta
- Kenichi Matsumoto