Pokemon: EDA and Battle Modeling

Exploratory data analysis and models for determining which Pokemon tend to have the biggest pure advantages in battle given no items in play.

CHARIZARD

Thanks for visiting my blog today!

Context

Pokémon is currently celebrating 25 years in the year 2021 and I think this is the perfect time to reflect on Pokémon from a statistical perspective.

Introduction

If you’re around my age (or younger), then I hope you’re familiar with Pokémon. It’s the quintessential Gameboy game we all used to play as kids before we started playing “grown-up” games on more advanced consoles. I remember how excited I used to get when new games would be released and I remember equally as well how dejected I was when I lost my copy of the Pokémon game “FireRed.” Pokémon has followed a basic formula running through each iteration of the game, and it’s pretty simple. It revolves around creatures known as “Pokémon.” You start with a “starter” Pokémon who usually has high potential but starts with very little power and you then can catch Pokémon, train Pokémon, buy items, beat gym leaders, find “legendary” Pokémon (I suppose the word “Pokémon” works as singular and plural), create a secret base, trade Pokémon with other gameboy owners, and of course – beat the Elite 4 like 500 times with different parties of Pokémon. With each new set of games (games were released in sets usually labeled by a color – silver & gold and green & red – as two examples), came a new group of Pokémon. The first generation was pretty bland. There were some cool ones like Charizard, Alakazam, and Gyarados but there were far more weird and boring ones like Slowpoke (Slowpoke has got to be the laziest name ever), Pidgey (second most lazy name?), and Arbok (Cobra backwards, yes it’s a snake). As the generations of Pokémon new to each set of games progressed there was a lot more diversity and creativity. Some generations like Gen 2 and Gen 3 really stuck out in my head. Examples include Skarmory and a special red (not blue) Gyarados in Gen 2 and Blaziken and Mightyena in Gen 3. The most important part of the Pokémon games are battles. Pokémon grow in skill and sometimes even transform in appearance through battling and gaining experience. You also earn badges and win prize money through battles. Battles are thrilling and require planning and tactics. This is expressed in many ways; the order of your Pokémon in your party, the type of Pokémon you bring to counteract your potential opposing Pokémon, the items you use to aid you in battle, when you choose to use certain moves (there is a limit to the amount of times each move can be used), switching Pokémon in battle, sacrificing Pokémon, diminishing the health of a “wild” Pokémon to the point where it doesn’t faint but is weak enough to be caught, and there’s a lot more I could probably add. Ok, so the point of this blog is to look for interesting trends and findings through visualizations in the exploratory data analysis section (EDA) and later I will examine the battles aspect and look for key statistics that indicate when a Pokémon is likely to win or lose a battle.

Let’s get started!

Data

My data comes from Kaggle and spans about 800 Pokémon across the first 6 generations of the series. I’d like to quickly note that this includes mega-Pokémon which are basically just more powerful versions of certain regular Pokémon. Let me show you what I mean; we have Charizard at the top of this post and below we have mega-Charizard.

3D Printed Mega Charizard X - Pokemon by Gnarly 3D Kustoms | Pinshape

Yeah, it’s pretty cool. Anyway, we also have legendary Pokémon in this dataset but, while legendary Pokémon may have mega types, they are not a fancy copy of another Pokémon. I’ll get more into how I filtered out mega Pokémon in the data prep section. One last note on Pokémon classes – there are no “shiny” Pokémon in this data set. Shiny Pokémon are usually not more powerful than their regular versions but they are more valuable in the lexicon of rare Pokémon. Next, we have 2 index features (why not just 1?). Pokémon are by construction at least one “type.” Type usually connects to an element of nature. Types include grass, poison, and fire (among others) but also less explicit and nature-related terms such as dragon, fairy, and psychic. A Pokémon can have up to two types with one usually being primary and the other secondary. Type is a critical part of battles as certain types have inherent advantages and disadvantages against other types (fire beats grass but water beats fire, for example). Unfortunately, creating a dataset containing battle data between every possible Pokémon would be a mess and we therefore are going to use the modeling section to see what types of Pokémon are strongest in general, and not situationally. Our next features will be related to Pokémon stats. HP – or hit points – represent your health. After Pokémon get struck with enough attack power (usually after a couple turns), their HP runs out and they faint. This takes me to the next four features: attack, defense, special attack, and special defense. Attack and defense are the baseline for a Pokémon’s, well… attack and defense. Certain moves may nullify these stats though. A great example is a one-hit knockout move; attack doesn’t matter at all. Special attack and special defense are the same as attack and defense but refer to moves where there is no contact between Pokémon. So a move like “body slam” is attack-based while a move like “flamethrower” is special attack based. Total represents the sum of HP, special attack/defense, attack/defense and speed. Speed is a metric that basically determines who strikes first in battle. Occasionally some moves will take precedence and strike first but in the absence of those moves the turn-based battle begins with the attack of the faster Pokémon. The feature generation refers to which generation between 1 and 6 that the Pokémon in question first appeared in. Pokémon often make appearances in later generations once introduced in a previous generation. This is sometimes the case for legendary Pokémon as well. This takes us to our next feature which is a binary legendary Pokémon indicator. Legendaries are Pokémon that appear once per game usually in one specific spot that are generally quite powerful (and cool-looking). There are “roaming legendaries” that are unique but will appear at random locations until caught as opposed to being confined to one special location. Most legendaries are delegated to specific areas at specific points in a game’s storyline. Latios and Latias, from Pokémon Ruby and Sapphire are roaming and are incredibly hard to locate and subsequently catch while Groudon and Kyogre are easier to find and catch and play a key role in the storyline of their respective games. Finally, our last feature is win rate (and it’s continuous) – and that is the target feature. I added some features to the dataset which will be discussed below. The only one I’ll mention now is that I binned win rate into three groups in order to have a discrete model as well.

Data Preparation

This section will outline how I manipulated my data for modeling purposes. I started by dropping extra columns like the index feature. Next, I created a feature which was a binary representation of whether a Pokémon was of one or two types. I also scanned the data to see if the word “mega” appears in any names and added a feature to indicate if a Pokémon was or was not a mega type. In addition to the collective data I also separated datasets into common, mega, and legendary Pokémon while aiming for no overlapping Pokémon. I had enough data where I could comfortably delete all rows with null values. I added another feature as well to measure the ratio of special attack divided by attack and did the same with defense. I originally thought special attack meant something different than what it actually means and ended up calling these features “attack bonus” and “defense bonus.” Next, I filtered out any rows containing numerical outliers which would be 2.5 standard deviations away from the rest of the data in said feature. This reduced my data by about 15%. Next, I applied max-absolute scaling to gauge the value of each data point relative to the other values in each feature (this puts every result somewhere between 0 for low and 1 for high). I also noticed through hypothesis tests that speed might have an outsized impact on battle outcome (more on this later) so created an extra feature called “speed ratio” to reflect how much of a Pokémon’s total comes from speed alone. To deal with categorical data, I leveraged target encoding. I published an article on Towards Data Science which can be found here if you would like to learn more about target encoding. This was the end of my main data prep. I did a quick train-test-split which pertains more to modeling but is technically an element of data prep.

EDA

This section contains a sampling of all the exciting graphs to describe what Pokémon has looked like over the years.

The following graph shows the distributions of win rate and total stats respectively segmented by Pokémon class.

The following graph shows how many Pokémon appear for each type:

The graph below shows the difference between stats of legendary Pokémon and regular ones.

And next…

This next set of visuals show trends over the years in common, legendary, and mega Pokémon respectively.

The next series of charts shows the distribution of stats by type with common Pokémon:

Next, we have scatter plots to compare attack and defense.

The next chart shows the distribution of total Pokémon across 6 generations.

Let’s look at legendary and mega Pokémon.

Let’s look at how these legendaries are distributed by type.

What about common Pokémon?

The next group of visuals shows distribution of stats by generation among legendary Pokémon.

The next set of visuals shows which Pokémon have the highest ratio of special attack to attack

The next set of visuals shows which Pokémon have the highest ratio of special defense to defense

This next set of charts show where mega Pokémon come from.

Next, we have a cumulative rise in total Pokémon over the generations.

Models

As I mentioned above, I ran continuous models and discrete models. I ran models for 3 classes of Pokémon: common, legendary, and mega. I also tried various groupings of input features to find the best model while not having any single model dominated by one feature completely. We’ll soon see that some features have outsized impacts.

Continuous Models

The first and most basic model yielded the following feature importance while having 90% accuracy:

Speed and speed ratio were too powerful, so after removing them my accuracy dropped 26% (you read that right). This means I need speed or speed ratio. I chose to keep speed ratio. Also, I removed the type 1, type 2, generation, and “two-type” features based on p-values. Now, I had 77% accuracy with the following feature ranks.

I’ll skip some steps here, but the following feature importances, along with 75% accuracy, seemed most satisfactory:

Interestingly, having special attack/defense being higher than regular attack/defense is not good. I suppose you want a Pokémon’s special stats to be in line with regular ones.

I’ll skip some steps again, but here are my results for mega and legendaries within the continuous model:

Legendary with 60% accuracy:

Mega with 91% accuracy:

Discrete Models

First things first – I need to upsample data. My data was pretty evenly balanced but needed minor adjustments. If you don’t know what I mean by target feature imbalance, check out my blog (on Towards Data Science) over here. Anyway, here comes my best model (notice the features I chose to include):

This shows a strong accuracy at 80% along with a good confusion matrix and interesting feature importances. For more information on what the metrics above, such as recall, actually mean, check out my blog over here.

Next Steps

One-Hot Encoding. I think it would be interesting to learn more about which specific types are the strongest. I plan to update this blog after that analysis so stay tuned.

Neural Network. I would love to run a neural network for both the continuous and discrete case.

Polynomial Model. I would also love to see if polynomial features improve accuracy (and possibly help delete the speed-related features).

Conclusion

For all you Pokémon fans and anyone else who enjoys this type of analysis, I hope you liked reading this post. I’m looking forward to the next 25 years of Pokémon.

Was Gretzky Really THAT Great?

(I used to genuinely question Gretzky’s greatness and the data helped me make my decision)

A unique approach to evaluate NHL players by accounting for era effects

Wayne Gretzky by the numbers: A look at 'The Great One's' NHL career |  Sporting News Canada
The Great One

Introduction

Thank you for visiting my blog!

Let’s get right into this blog as there isn’t a whole to be said in the introduction other that we are going to build an exciting and unique analysis of every NHL player from every NHL season (up until 2018-19).

Project Goal

A while back, I set out to develop a system to evaluate NHL and NBA players (like every one ever to play a second at the professional level) and see how they measure against their peers. I worked on data engineering and creating a database as well designing an optimal way to evaluate players. For today’s blog, I’m going to share part of that project with you in an interactive and visual-driven way. I have created a Tableau public project that will allow me share some insights with you, the reader, in a way driven by you, the reader. My project allows users to pick their own filters and ways to view my data. All you need to know for now is this: when you see that a good player’s goal statistic is, I don’t know, like 3.513, it means that the player in question was 3.513 times better than league average in goals that year. If the average NHL player scores 5.5 goals per year, the player in question probably scored ~19 goals. It doesn’t sound like a lot of goals to score in an 82-game season to the casual fan, but it’s actually not that bad. Just keep in mind while you view my project that some stats were not recorded until more recent years. If you look up how many hits Maurice Richard or Eddie Shore had in any season of their careers, you will likely find no information. To see different pages from the various visuals, scroll down to where it says “Metadata” and click on different sheets. Also, as you adjust your filter preferences, keep in mind that you do not want to overload the system. So looking for every name of every player to play in the 2006 season, for example, is a lot more annoying than looking at every Blackhawks player in 2006 or every player who had at least 4 times the league average of goals in 2006. I really like the system of comparing actual stats to averages because it allows you to understand everything in context. If someone scores 200 goals in a season, for example, you may assume he’s better than any current player as no one has even come close to 70 goals in recent years (most goals ever in a season is 92 I think – so this obviously just a test-case). However, if the season in question saw almost every player reach at least 8000 goals (again, just a test-case), than the player with 200 goals kind of sucks. With relative scores, we don’t have this problem. It’s similar to feature scaling in data science ($100 is a lot to pay for a candy bar but rather cheap for a functional yacht). My system is susceptible to the occasional glitch, especially if a player was traded late in the season, but overall works quite well.

The Gretzky Question

To follow up on my comment earlier about Gretzky, I’d like to share one visual users can navigate to in my project on their own if they go to the “Player Goal Assists” tab and filter for Edmonton Oiler stats from 1980 to 1989 and only include players who averaged at least the league average in goals and assists across those 10 seasons. The red bars indicate how much better one player was compared to the league in assists and the teal represents goals. If you know hockey, you see guys like Jari Kurri, Mark Messier, and Paul Coffey, who are all superstars. Paul Coffey is actually one of my favorite all time players (and favorite defensemen) in the pool of players-I-never-saw-play. However, none of the players mentioned above (all hall of famers who have each one at least one individual trophy and have also each won multiple Stanley Cups) compare to Wayne Gretzky (second from end right). So, yeah, he was pretty darn good even after accounting for era effects and taking into account his superstar teammates.

Link

So now that I have talked about the project and how to navigate your way around it, here comes the link…

But first… keep in mind that you guide this project using the interactive filters on the right hand side of the page – use those filters!

https://public.tableau.com/profile/joseph.cohen2401#!/vizhome/relative_hockey/Sheet0

Conclusion

Tableau is great! I only know the basics, though, and am still getting my feet wet. I definitely see the possibility of updating the Tableau project linked above and diving deeper into more interesting and exciting visualizations. If you, the reader, liked my data visualization system, I recommend you also learn some of the Tableau basics as its a fairly simple software. I hope you found some interesting stats and stories in today’s inquiry.

Thanks for Reading!

hockey gifs: michael del zotto breaks camera with shot GIF

Devin Booker vs. Gregg Popovich (vs. Patrick McCaw)

Developing a process to predict which NBA teams and players will end up in the playoffs.

Jim Mora Nfl GIF - JimMora Mora Nfl - Discover & Share GIFs

Introduction

Hello! Thanks for visiting my blog.

After spending the summer being reminded of just how incredible Michael Jordan was as a competitor and leader, the NBA is finally making a comeback and the [NBA] “bubble” is in full effect in Orlando. Therefore, I figured today would be the perfect time to discuss the NBA playoffs (I know the gif is for the NFL, but it is just too good to not use). Specifically, I’d like to answer the following question: if one were to look at some of the more basic NBA statistics recorded by an NBA player or team on a per-year basis, with what accuracy could you predict whether or not that player may find themselves in the NBA playoffs. Since the NBA has changed so much over time and each iteration has brought new and exciting skills to the forefront of strategies, I decided to stay up to date and look at the 2018-19 season for individual statistics (the three-point era). However, only having 30 data points to survey for team success will be insufficient to have a meaningful analysis. Therefore I expanded my survey to include data back to and including the 1979-80 NBA season (when Magic Johnson was still a rookie). I web-scraped basketball-reference.com to gather my data and it contains every basic stat and a couple advanced ones such as effective field goal percent (I don’t know how that is calculated but Wikipedia does: https://en.wikipedia.org/wiki/Effective_field_goal_percentage#:~:text=In%20basketball%2C%20effective%20field%20goal,only%20count%20for%20two%20points.) from every NBA season ever. As you go back in time, you do begin to lose data from some statistical categories that weren’t recorded yet. To give to examples here I would mention blocks which were not recorded until after the retirement of Bill Russell (widely considered the greatest defender to play the game) or three pointers made as three pointers were first introduced into the NBA in the late 1970s. So just to recap: if we look at fundamental statistics recorded by individuals or in a larger team setting – can we predict who will find themselves in the playoffs? Before we get started, I need to address the name of this blog. Gregg Popovich has successfully coached the San Antonio Spurs to the playoffs in every year since NBA legend Tim Duncan was a rookie. They are known to be a team that runs on good teamwork as opposed to outstanding individual play. This is not to say they have not had superstar efforts though. Devin Booker has been setting the league on fire, but his organization, the Phoenix Suns, have not positioned themselves to be playoff contenders. (McCaw just got lucky and was a role player for three straight NBA championships). This divide is the type of motivation that led me to pursue this project.

Playoff P" Art Print by cobyshimabukuro | Redbubble

Plan

I would first like to mention that the main difficulty in this project was developing a good web-scraping function. However, I want to be transparent here and let you know that I worked hard developing that function a while back and now realized this would be a great use of that data. Anyhow, in my code I go through the basic data science process. In this blog, however, I think I will try to stick to the more exciting observations and conclusions I reached. (here’s a link to GitHub: https://github.com/ArielJosephCohen/postseason_prediction).

The Data

First things first, let’s discuss what my data looked like. In the 2018-19 NBA season, there were 530 total players to play in at least one NBA game. My features included: name, position, age, team, games played, games started, minutes played, field goals made, field goals attempted, field goal percent, three pointers made, three pointers attempted, three point percent, two pointers made, two pointers attempted, two point percent, effective field goal percent, free throws made, free throws attempted, free throw percent, offensive rebounds per game, defensive rebounds per game, total rebounds per game, assists per game, steals per game, blocks per game, points per game, turnovers per game, year (not really important but it only makes the most marginal difference in the team setting), college (or where they were before the NBA – many were null and were filled as unknown), draft selection (un-drafted players were assigned the statistical mean of 34 – which is later than I expected. Keep in mind the drat used to exceed 2 rounds). My survey of teams grouped all the numerical values (with the exception of year) by each team and every season using the statistical mean of all its players that season. In total, there were 1104 rows of data, some of which included teams like Seattle that no longer exist in their original form. My target feature was a binary 0 or 1 with 0 representing a failure to qualify for the playoffs and 1 representing a team that successfully earned a playoff spot.

One limitation of this model is that it accounts for trades or other methods of player movement by assigning a player’s entire season stats to the team he ended the season with, regardless of how many games were actually played on his final team. In addition, this model doesn’t account for more advanced metrics like screen assists or defensive rating. Another major limitation is something that I alluded to earlier: the NBA changes and so does strategy. This makes this more of a study or transcendent influences that remain constant over time as opposed to what worked well in the 2015-16 NBA season (on a team level that is). Also, my model focuses on recent data for the individual player model, not what individual statistics were of high value in different basketball eras. A great example of this is the so-called “death of the big man.” Basketball used to be focused on big and powerful centers and power forwards who would control and dominate the paint. Now, the game has moved much more outside, mid-range twos have been shunned as the worst shot in basketball and even centers must develop a shooting range to develop. Let me show you what I mean:

Shot chart dramatically shows the change in the NBA game

Now let’s look at “big guy”: 7-foot-tall Brook Lopez:

In a short period of time, he has drastically changed his primary shot selection. I have one more visual from my data to demonstrate this trend.

Exploratory Data Analysis

Before I get into what I learned from my model, I’d like to share some interesting observations from my data. I’ll start with two histograms groups:

Team data histograms
Individual data histograms
Every year – playoff teams tend to have older players
Trends of various shooting percentages over time

Here, I have another team vs. individual comparison testing various correlations. In terms of my expectations – I hypothesized that 3 point percent would correlate negatively to rebounds, while assists would correlate positively to 3 point percent, blocks would have positive correlation to steal, and assists would have positive correlation with turnovers.

Team data
Individual data

I seem to be most wrong about assists and 3 point percent while moderately correct in terms of team data.

The following graph displays the common differences over ~40 years between playoff teams and non-playoff teams in the five main statistical categories in basketball. However, since the mean and extremes of these categories are all expressed in differing magnitudes, I applied scaling to allow for a more accurate comparison

It appears that the biggest difference between playoff teams and regular teams is in assists while the smallest difference is in rebounding

I’ll look at some individual stats now, starting with some basic sorting.

Here’s a graph to see the breakdown of current NBA players by position.

In terms of players by draft selection (with 34 = un-drafted basically):

In terms of how many players actually make the playoffs:

Here’s a look at free throw shooting by position in the 2018-19 NBA season (with an added filter for playoff teams):

I had a hypothesis that players drafted earlier tend to have higher scoring averages (note that in my graph there is a large cluster of points hovering around x=34. This is because I used 34 as a mean value to fill null for un-drafted players).

It seems like I was right – being picked early in the draft corresponds to higher scoring. I would guess the high point is James Harden at X=3.

Finally I’d like to share the distribution of statistical averages by each of the five major statistics sorted by position:

Model Data

I ran a random forest classifier to get a basic model and then applied scaling and feature selection to improve my models. Let’s see what happend.

2018-19 NBA players:

The above represents an encouraging confusion matrix with the rows representing predicted data vs columns which represent actual data. The brighter and more muted regions in the top left and bottom right correspond to the colors in the vertical bar adjacent to the matrix and indicate that brighter colors represent higher values (in this color palette). This means that my model had a lot more values whose prediction corresponded to its actual placement that incorrect predictions. The accuracy of this basic model sits around 61% which is a good start. The precision score represents the percentage of correct predictions in the bottom left and bottom right quadrants (in this particular depiction). Recall represents the percentage of correct positions in the bottom right and top right quadrants. In other words recall represents all the cases correctly predicted given that we were only predicting from a pool of teams that made the playoffs. Precision represents a similar pool, but with a slight difference. Precision looks at all the teams that were predicted to have made the playoffs and the precision score represents how many of those predictions were correct.

Next, I applied feature scaling which is a process of removing impacts from variables determined entirely by magnitude alone. For example: $40 is a lot to pay for a can of soda but quite little to pay for a functional yacht. In order to compare soda to yachts, it’s better to apply some sort of scaling that might, for example, place their costs in a 0-1 (or 0%-100%) range representing where there costs fall relative to the average soda or yacht. A $40 soda would be close to 1 and a $40 functional yacht would be closer to 0. Similarly, a $18 billion yacht and an $18 billion dollar soda would both be classified around 1 and conversely a 10 cent soda or yacht would both be classified around 0. A $1 soda would be around 0.5. I have no idea how much the average yacht costs.

Next, I wanted to see how many features I needed for optimal accuracy. I used recursive feature elimination which is a process of designing a model and the using that model to look for which features may be removed to improve the model.

20 features seems right

After feature selection, here were my results:

64% accuracy is not bad. Considering that a little over 50% of all NBA players make the playoffs every year, I was able to create a model that, without any team context at all, can predict to some degree which players will earn a trip to the playoffs. Let’s look at what features have the most impact. (don’t give too much attention to vertical axis values). I encourage you to keep these features in mind for later to see if they differ from the driving influences for the larger scale, team-oriented model.

If I take out that one feature at the beginning that is high in magnitude for feature importance, we get the following visual:

I will now run through the same process with team data. I will skip the scaling step as it didn’t accomplish a whole lot.

First model:

Feature selection.

Results:

78% accuracy while technically being a precision model

Feature importance:

Feature importance excluding age:

I think one interesting observation here is how much age matters in a team context over time, but less in the 2018-19 NBA season. Conversely, effective field goal percent had the opposite relationship.

Conclusion

I’m excited for sports to be back. I’m very curious to see the NBA playoffs unfold and would love to see if certain teams emerge as dark horse contenders. Long breaks from basketball can either be terrible or terrific. I’m sure a lot of players improved while others regressed. I’m also excited to see that predicting playoff chances both on individual and team levels can be done with an acceptable degree of accuracy. I’d like to bring this project full circle. Let’s look at the NBA champion San Antonio Spurs from 2013-14 (considered a strong team-effort oriented team) and Devin Booker from 2018–19. I don’t want to spend too much time here but let’s do a brief demo of the model using feature importance as a guide. The average age of the Spurs was two years above the NBA average with the youngest members of the team being 22, 22, and 25. The Spurs led the NBA in assists that year and were second to last in fouls per game. They were also top ten in fewest turnovers per game and free throw shooting. Now, for Devin Booker. Well, you’re probably expecting me to explain why he is a bad team player and doesn’t deserve a spot in the playoffs. That’s not the case. By every leading indicator in feature importance, Devin seems like he belongs in the playoffs. However, let’s remember two things. My individual player model was less accurate and Booker sure seems like an anomaly. However, I think there is a bigger story here, though. Basketball is a team sport. Having LeBron James as your lone star can only take you so far (it’s totally different for Michael Jordan because Jordan is ten times better, and I have no bias at all as a Chicago native). That’s why people love teams like the San Antonio Spurs. They appear to be declining now as their leaders have recently left the team in players such as Tim Duncan and Kawhi Leonard. Nevertheless, basketball is still largely a team sport. The team prediction model was fairly accurate. Further, talent alone seems like it is not enough either. Every year, teams that make the playoffs tend to have older players. Talent and athleticism is generally skewed toward younger players. Given these insights, I’m excited to have some fresh eyes with which to follow the NBA playoffs.

Thanks for reading.

Have a great day!

Thank you to all of the parents for... - Hamilton Little Lads ...

Method to the Madness

Imposing Structure and Guidelines in Exploratory Data Analysis

Sticking Many Graphs Charts Colorful Documents Stock Photo (Edit ...

Introduction

I like to start off my blogs with the inspiration that led me to write that blog and this blog is no different. I recently spent time working on a project. Part of my general process when performing data science inquiries and building models is to perform exploratory data analysis (EDA from henceforth). While I think modeling is more important than EDA, EDA tends to be the most fun part of a project and is very easy to share with a non-technical audience who don’t know what multiple linear regression or machine learning classification algorithms are. EDA is is often where you explore and manipulate your data to generate descriptive visuals. This helps you get a feel for your data and notice trends and anomalies. Generally, what makes this process exciting is the amount of unique types of visuals and customizations one can use to design said visuals. So now let’s get back to the beginning of the paragraph; I was working on a project and was up to the EDA step. I got really engrossed in this process and spent a lot of effort trying to explore every possible interesting relationship. Unfortunately, however, this soon became stressful as I was just randomly looking around and didn’t really know what I had already covered or at what point I would feel satisfied and therefore planned on stopping. It’s not reasonable to explore every possible relationship in every way. Creating many unique custom data frames, for example, is a headache, even if you get some exciting visuals as a result (and the payoff is not worth the effort you put in). After doing this, I was really uneasy about continuing to perform EDA in my projects. At this point, the most obvious thing in the entire world occurred to me and I felt pretty silly for not having thought of this before; I needed to make an outline and a process for all my future EDA. This may sound counter-intuitive as some people may agree with me that EDA shouldn’t be so rigorous and should be more free-flowing. Despite that opinion from some, that I have the mindset that imposing a structure and plan would be very helpful. Doing this would save me a whole lot of time and stress while also allowing me to keep track of what I had done and capture some of the most important relationships in my data. (I think when some people read this they may not agree with me about my thoughts about an approach to EDA. My response to them is that if there project is mainly focussed on EDA, as opposed to developing a model, then they should understand that this is not the scenario I have in mind. In addition, one still has the ability to impose structure and still perform more comprehensive EDA in the scenario that their project does in fact center around EDA).

Plan

I’m going to provide my process for EDA at a high level. How you decide to interpret my guidelines based on unique scenarios and which visuals you feel best describe and represent your data are up to you. I’d like to also note that I consider hypothesis testing to be part of EDA. I will not discuss hypothesis testing here, but understand that my process can extend to hypothesis testing to a certain degree.

Step 1 – Surveying Data And Binning Variables

I am of the opinion that EDA should be performed before you transform your data using methods such as feature scaling (unless you want to compare features of different average magnitudes). However, just like you would preprocess data for modeling, I believe there is an analogous way to “preprocess” data for EDA. In particular, I would like to focus on binning features. In case you don’t know – feature binning is a method of transforming old variables or generating new variables by splitting up numeric values into intervals. Say we have the data points 1, 4, 7, 8, 3, 5, 2, 2, 5, 9, 7, and 9. We could bin this as [1,5] and (5,9] and assign the labels 0 and 1 to each respective interval. So if we see the number 2 – it becomes a 0. If we see the number 8, however – it becomes a 1. I think it’s a pretty simple idea. It is so effective because often times you will have large gaps in your data or imbalanced features. I talk more about this in another blog about categorical feature encoding, but I would like to provide a quick example. People who buy cars between $300k-$2M probably include the same amount, if not less, than the amount of people buying cars between $10k-$50k. That’s all conjecture and may not in fact be true, but you get the idea. The first interval is significantly bigger than the second in terms of car price range but it is certainly logical to group these two types of people together. You can actually cut your data into bins easily (and assign labels) and even have the choice to cut at quantiles in pandas (pd.cut, pd.qcut). You can also feature engineer before you bin your data using transformations first, and binning second. So this is all great – and will make your models better if you have large gaps… and even give you the option to switch from regression to classification which also can be good – but why do we need to do this for EDA? I think there is a simple answer to this question. You can more efficiently add filters and hues to your data visualizations. Let me give you an example: if you want to find the GDP across all 7 continents (Antarctica must have like a zero GDP) in 2019 and add an extra filter to show large vs. small countries. You would need to add many filters for each unique country size. What a mess! Instead, split companies along a certain population threshold into big and small countries. Now you have 14 data points instead of 200+. 14 = 7 continents * (large population country + small population country). Now I’m sure that some readers can think of other ways to deal with this issue, but I’m just trying to prove a point here, not re-invent the wheel. When I do EDA, I like to first look at violin-plots and histograms to determine how to split my data and how many groups of data (2 bins, 3 bins…) I want, but I also keep my old variables as they are also valuable to EDA as I will allude at later on.

Step 2 – Ask Questions Without Diving Too Deep Into Your Data

This idea will be developed further later on and honestly doesn’t need much explanation right now. Once you know what your data is describing, you should begin to ask legitimate questions you would want to know the answers to and what relationships you would like to explore given the context of your data and write them down. I talk a lot about relationships later on. When you first write down your questions, don’t get too caught up in things like that, just thing about the things in your data you would like to explore using visuals.

Step 3 – Continue Your EDA By Looking At The Target Feature

I would always start an EDA process by looking at the most important and crucial piece of your data – the target feature (assuming it’s labeled data). Go through each of your predictive variables individually and compare them with the target variable with any number of filters you may like. Once you finish up with exploring the relationship with the target variable and variable number one, don’t include variable number one in any more visualizations. I’ll give an example: Say we have the following features: wingspan, height, speed and weight and we wanted to predict the likeliness that any given person played professional basketball. Well, start with variable number one – wingspan. You could see the average wingspan of regular folks versus professional players. However, you could also add other filters such as height to see if the average wingspan of tall professional players vs tall regular people is different. Here, we could have three variables in one simple bar graph: height, wingspan, and the target of whether a player plays professionally or not. You could continue doing this until you have exhausted all the relationships between wingspan, target variable, and anything else (if you so choose). Since you have explored all those relationships (primarily using bar graphs probably) you now can cross off one variable from your list, although it will come back later as we will soon see. Rinse and repeat this process until you go through every variable. The outcome is you have a really good picture of how your target variable moves around with different features as the guiding inputs. This will also give you some insight to reflect on when you run something such as feature importances or coefficients later on. I would like to add one last warning, though. Don’t let this process get out of hand. Just keep a couple visuals for each feature correlation with target variable and don’t be afraid to sacrifice some variables in to keep the more interesting ones. (I realize that there should be a modification to this part of the process if you have above a certain threshold of feature.

Step 4 – Explore Relationships Between Non-Target Features

The next step I usually take is to look at all the data (again), but without including the target feature. Similar to the process described above, if I have 8 non-target features I will think of all the ways that I can compare variable 1 with variables 2 through 8 (with filters). I then look at ways to compare feature 2 with features 3-8 and so on until I am at the comparison between 7 and 8. Now, obviously not every variable may be efficient to compare, easy to compare, or even worth comparing. Here, I tend to have a lot of different types of graphs as I am not focussed on the target feature. You can get more creative than you might be able to when comparing features to the target in a binary target classification problem. There are a lot of articles on how to leverage Seaborn to make exciting visuals and this step of the process is probably a great application of all those articles. I described an example above where we had height, wingspan, weight, and speed (along with a target feature). In terms of the relationships between the predictors, one interesting inquiry, for example, could be the relationship between height and weight in the form of a scatter plot, regression plot, or even joint plot. Developing visuals to demonstrate relationships can also affect your mindset upon filtering out correlation or feature engineering. I’d like to add a Step 4A at this point. When you look at your variables and start writing down all your questions and tracking relationships you want to explore, I highly recommend that you assign certain features as filters only. What I mean by this is the following: Say you have the following (new) example: feature 1 is normally distributed and has 1000 unique values and feature 2 (non-target) has 3 unique values (or distinct intervals). I highly recommend you don’t spend a lot of time exploring all the possible relationships that feature 2 has with other features. Instead, use it as a filter or “hue” (as it’s called in Seaborn). If you want to find the distribution of feature 1, which has many unique values, you could easily run a histogram or violin-plot, but you can tell a whole new story if you filter these data points into three categories dictated by feature 2. So when you write out your EDA plan, figure what works best as a filter and that will save you some more time.

Step 5 – Keep Track Of Your Work

Not much to say about this step. It’s very important though. You could end up wasting a lot of time if you don’t keep track of your work.

Step 6 – Move On With Your Results In Mind

When I am done with EDA, I often switch to hypothesis testing which I like to think as a more equation-y form of EDA. Like EDA, you should have a plan before you start running tests. However, now that you’ve done EDA and gotten a real good feel for your data, you should keep in mind your results for further investigation when you go into hypothesis testing. Say you find a visual that tells a puzzling story – then you should investigate further! Also, as alluded to earlier – your EDA could impact your feature engineering. Really, there’s too much to list in terms of why EDA is such a powerful process even beyond the initial visuals.

Conclusion:

I’ll be very direct about something here: there’s a strong possibility that I was in the minority as someone who often didn’t plan much before going into EDA. Making visuals is exciting and visualizations tend to be easy to share with others when compared to things like a regression summary. Due to this excitement, I used to dive right in without much of a plan beforehand. It’s certainly possible that everyone else already made plans. However, I think that this blog is valuable and important for two reasons. Number 1: Hopefully I helped someone who also felt that EDA was a mess. Number 2: I provided some structure that I think would be beneficial even to those who plan well. EDA is a very exciting part of the data science process and carefully carving out an attack plan will make it even better.

Thanks for reading and I hope you learned something today.

2018: thank u, next - The K.I.S.S Marketing Agency