Pokemon: EDA and Battle Modeling

Exploratory data analysis and models for determining which Pokemon tend to have the biggest pure advantages in battle given no items in play.

CHARIZARD

Thanks for visiting my blog today!

Context

Pokémon is currently celebrating 25 years in the year 2021 and I think this is the perfect time to reflect on Pokémon from a statistical perspective.

Introduction

If you’re around my age (or younger), then I hope you’re familiar with Pokémon. It’s the quintessential Gameboy game we all used to play as kids before we started playing “grown-up” games on more advanced consoles. I remember how excited I used to get when new games would be released and I remember equally as well how dejected I was when I lost my copy of the Pokémon game “FireRed.” Pokémon has followed a basic formula running through each iteration of the game, and it’s pretty simple. It revolves around creatures known as “Pokémon.” You start with a “starter” Pokémon who usually has high potential but starts with very little power and you then can catch Pokémon, train Pokémon, buy items, beat gym leaders, find “legendary” Pokémon (I suppose the word “Pokémon” works as singular and plural), create a secret base, trade Pokémon with other gameboy owners, and of course – beat the Elite 4 like 500 times with different parties of Pokémon. With each new set of games (games were released in sets usually labeled by a color – silver & gold and green & red – as two examples), came a new group of Pokémon. The first generation was pretty bland. There were some cool ones like Charizard, Alakazam, and Gyarados but there were far more weird and boring ones like Slowpoke (Slowpoke has got to be the laziest name ever), Pidgey (second most lazy name?), and Arbok (Cobra backwards, yes it’s a snake). As the generations of Pokémon new to each set of games progressed there was a lot more diversity and creativity. Some generations like Gen 2 and Gen 3 really stuck out in my head. Examples include Skarmory and a special red (not blue) Gyarados in Gen 2 and Blaziken and Mightyena in Gen 3. The most important part of the Pokémon games are battles. Pokémon grow in skill and sometimes even transform in appearance through battling and gaining experience. You also earn badges and win prize money through battles. Battles are thrilling and require planning and tactics. This is expressed in many ways; the order of your Pokémon in your party, the type of Pokémon you bring to counteract your potential opposing Pokémon, the items you use to aid you in battle, when you choose to use certain moves (there is a limit to the amount of times each move can be used), switching Pokémon in battle, sacrificing Pokémon, diminishing the health of a “wild” Pokémon to the point where it doesn’t faint but is weak enough to be caught, and there’s a lot more I could probably add. Ok, so the point of this blog is to look for interesting trends and findings through visualizations in the exploratory data analysis section (EDA) and later I will examine the battles aspect and look for key statistics that indicate when a Pokémon is likely to win or lose a battle.

Let’s get started!

Data

My data comes from Kaggle and spans about 800 Pokémon across the first 6 generations of the series. I’d like to quickly note that this includes mega-Pokémon which are basically just more powerful versions of certain regular Pokémon. Let me show you what I mean; we have Charizard at the top of this post and below we have mega-Charizard.

3D Printed Mega Charizard X - Pokemon by Gnarly 3D Kustoms | Pinshape

Yeah, it’s pretty cool. Anyway, we also have legendary Pokémon in this dataset but, while legendary Pokémon may have mega types, they are not a fancy copy of another Pokémon. I’ll get more into how I filtered out mega Pokémon in the data prep section. One last note on Pokémon classes – there are no “shiny” Pokémon in this data set. Shiny Pokémon are usually not more powerful than their regular versions but they are more valuable in the lexicon of rare Pokémon. Next, we have 2 index features (why not just 1?). Pokémon are by construction at least one “type.” Type usually connects to an element of nature. Types include grass, poison, and fire (among others) but also less explicit and nature-related terms such as dragon, fairy, and psychic. A Pokémon can have up to two types with one usually being primary and the other secondary. Type is a critical part of battles as certain types have inherent advantages and disadvantages against other types (fire beats grass but water beats fire, for example). Unfortunately, creating a dataset containing battle data between every possible Pokémon would be a mess and we therefore are going to use the modeling section to see what types of Pokémon are strongest in general, and not situationally. Our next features will be related to Pokémon stats. HP – or hit points – represent your health. After Pokémon get struck with enough attack power (usually after a couple turns), their HP runs out and they faint. This takes me to the next four features: attack, defense, special attack, and special defense. Attack and defense are the baseline for a Pokémon’s, well… attack and defense. Certain moves may nullify these stats though. A great example is a one-hit knockout move; attack doesn’t matter at all. Special attack and special defense are the same as attack and defense but refer to moves where there is no contact between Pokémon. So a move like “body slam” is attack-based while a move like “flamethrower” is special attack based. Total represents the sum of HP, special attack/defense, attack/defense and speed. Speed is a metric that basically determines who strikes first in battle. Occasionally some moves will take precedence and strike first but in the absence of those moves the turn-based battle begins with the attack of the faster Pokémon. The feature generation refers to which generation between 1 and 6 that the Pokémon in question first appeared in. Pokémon often make appearances in later generations once introduced in a previous generation. This is sometimes the case for legendary Pokémon as well. This takes us to our next feature which is a binary legendary Pokémon indicator. Legendaries are Pokémon that appear once per game usually in one specific spot that are generally quite powerful (and cool-looking). There are “roaming legendaries” that are unique but will appear at random locations until caught as opposed to being confined to one special location. Most legendaries are delegated to specific areas at specific points in a game’s storyline. Latios and Latias, from Pokémon Ruby and Sapphire are roaming and are incredibly hard to locate and subsequently catch while Groudon and Kyogre are easier to find and catch and play a key role in the storyline of their respective games. Finally, our last feature is win rate (and it’s continuous) – and that is the target feature. I added some features to the dataset which will be discussed below. The only one I’ll mention now is that I binned win rate into three groups in order to have a discrete model as well.

Data Preparation

This section will outline how I manipulated my data for modeling purposes. I started by dropping extra columns like the index feature. Next, I created a feature which was a binary representation of whether a Pokémon was of one or two types. I also scanned the data to see if the word “mega” appears in any names and added a feature to indicate if a Pokémon was or was not a mega type. In addition to the collective data I also separated datasets into common, mega, and legendary Pokémon while aiming for no overlapping Pokémon. I had enough data where I could comfortably delete all rows with null values. I added another feature as well to measure the ratio of special attack divided by attack and did the same with defense. I originally thought special attack meant something different than what it actually means and ended up calling these features “attack bonus” and “defense bonus.” Next, I filtered out any rows containing numerical outliers which would be 2.5 standard deviations away from the rest of the data in said feature. This reduced my data by about 15%. Next, I applied max-absolute scaling to gauge the value of each data point relative to the other values in each feature (this puts every result somewhere between 0 for low and 1 for high). I also noticed through hypothesis tests that speed might have an outsized impact on battle outcome (more on this later) so created an extra feature called “speed ratio” to reflect how much of a Pokémon’s total comes from speed alone. To deal with categorical data, I leveraged target encoding. I published an article on Towards Data Science which can be found here if you would like to learn more about target encoding. This was the end of my main data prep. I did a quick train-test-split which pertains more to modeling but is technically an element of data prep.

EDA

This section contains a sampling of all the exciting graphs to describe what Pokémon has looked like over the years.

The following graph shows the distributions of win rate and total stats respectively segmented by Pokémon class.

The following graph shows how many Pokémon appear for each type:

The graph below shows the difference between stats of legendary Pokémon and regular ones.

And next…

This next set of visuals show trends over the years in common, legendary, and mega Pokémon respectively.

The next series of charts shows the distribution of stats by type with common Pokémon:

Next, we have scatter plots to compare attack and defense.

The next chart shows the distribution of total Pokémon across 6 generations.

Let’s look at legendary and mega Pokémon.

Let’s look at how these legendaries are distributed by type.

What about common Pokémon?

The next group of visuals shows distribution of stats by generation among legendary Pokémon.

The next set of visuals shows which Pokémon have the highest ratio of special attack to attack

The next set of visuals shows which Pokémon have the highest ratio of special defense to defense

This next set of charts show where mega Pokémon come from.

Next, we have a cumulative rise in total Pokémon over the generations.

Models

As I mentioned above, I ran continuous models and discrete models. I ran models for 3 classes of Pokémon: common, legendary, and mega. I also tried various groupings of input features to find the best model while not having any single model dominated by one feature completely. We’ll soon see that some features have outsized impacts.

Continuous Models

The first and most basic model yielded the following feature importance while having 90% accuracy:

Speed and speed ratio were too powerful, so after removing them my accuracy dropped 26% (you read that right). This means I need speed or speed ratio. I chose to keep speed ratio. Also, I removed the type 1, type 2, generation, and “two-type” features based on p-values. Now, I had 77% accuracy with the following feature ranks.

I’ll skip some steps here, but the following feature importances, along with 75% accuracy, seemed most satisfactory:

Interestingly, having special attack/defense being higher than regular attack/defense is not good. I suppose you want a Pokémon’s special stats to be in line with regular ones.

I’ll skip some steps again, but here are my results for mega and legendaries within the continuous model:

Legendary with 60% accuracy:

Mega with 91% accuracy:

Discrete Models

First things first – I need to upsample data. My data was pretty evenly balanced but needed minor adjustments. If you don’t know what I mean by target feature imbalance, check out my blog (on Towards Data Science) over here. Anyway, here comes my best model (notice the features I chose to include):

This shows a strong accuracy at 80% along with a good confusion matrix and interesting feature importances. For more information on what the metrics above, such as recall, actually mean, check out my blog over here.

Next Steps

One-Hot Encoding. I think it would be interesting to learn more about which specific types are the strongest. I plan to update this blog after that analysis so stay tuned.

Neural Network. I would love to run a neural network for both the continuous and discrete case.

Polynomial Model. I would also love to see if polynomial features improve accuracy (and possibly help delete the speed-related features).

Conclusion

For all you Pokémon fans and anyone else who enjoys this type of analysis, I hope you liked reading this post. I’m looking forward to the next 25 years of Pokémon.

Was Gretzky Really THAT Great?

(I used to genuinely question Gretzky’s greatness and the data helped me make my decision)

A unique approach to evaluate NHL players by accounting for era effects

Wayne Gretzky by the numbers: A look at 'The Great One's' NHL career |  Sporting News Canada
The Great One

Introduction

Thank you for visiting my blog!

Let’s get right into this blog as there isn’t a whole to be said in the introduction other that we are going to build an exciting and unique analysis of every NHL player from every NHL season (up until 2018-19).

Project Goal

A while back, I set out to develop a system to evaluate NHL and NBA players (like every one ever to play a second at the professional level) and see how they measure against their peers. I worked on data engineering and creating a database as well designing an optimal way to evaluate players. For today’s blog, I’m going to share part of that project with you in an interactive and visual-driven way. I have created a Tableau public project that will allow me share some insights with you, the reader, in a way driven by you, the reader. My project allows users to pick their own filters and ways to view my data. All you need to know for now is this: when you see that a good player’s goal statistic is, I don’t know, like 3.513, it means that the player in question was 3.513 times better than league average in goals that year. If the average NHL player scores 5.5 goals per year, the player in question probably scored ~19 goals. It doesn’t sound like a lot of goals to score in an 82-game season to the casual fan, but it’s actually not that bad. Just keep in mind while you view my project that some stats were not recorded until more recent years. If you look up how many hits Maurice Richard or Eddie Shore had in any season of their careers, you will likely find no information. To see different pages from the various visuals, scroll down to where it says “Metadata” and click on different sheets. Also, as you adjust your filter preferences, keep in mind that you do not want to overload the system. So looking for every name of every player to play in the 2006 season, for example, is a lot more annoying than looking at every Blackhawks player in 2006 or every player who had at least 4 times the league average of goals in 2006. I really like the system of comparing actual stats to averages because it allows you to understand everything in context. If someone scores 200 goals in a season, for example, you may assume he’s better than any current player as no one has even come close to 70 goals in recent years (most goals ever in a season is 92 I think – so this obviously just a test-case). However, if the season in question saw almost every player reach at least 8000 goals (again, just a test-case), than the player with 200 goals kind of sucks. With relative scores, we don’t have this problem. It’s similar to feature scaling in data science ($100 is a lot to pay for a candy bar but rather cheap for a functional yacht). My system is susceptible to the occasional glitch, especially if a player was traded late in the season, but overall works quite well.

The Gretzky Question

To follow up on my comment earlier about Gretzky, I’d like to share one visual users can navigate to in my project on their own if they go to the “Player Goal Assists” tab and filter for Edmonton Oiler stats from 1980 to 1989 and only include players who averaged at least the league average in goals and assists across those 10 seasons. The red bars indicate how much better one player was compared to the league in assists and the teal represents goals. If you know hockey, you see guys like Jari Kurri, Mark Messier, and Paul Coffey, who are all superstars. Paul Coffey is actually one of my favorite all time players (and favorite defensemen) in the pool of players-I-never-saw-play. However, none of the players mentioned above (all hall of famers who have each one at least one individual trophy and have also each won multiple Stanley Cups) compare to Wayne Gretzky (second from end right). So, yeah, he was pretty darn good even after accounting for era effects and taking into account his superstar teammates.

Link

So now that I have talked about the project and how to navigate your way around it, here comes the link…

But first… keep in mind that you guide this project using the interactive filters on the right hand side of the page – use those filters!

https://public.tableau.com/profile/joseph.cohen2401#!/vizhome/relative_hockey/Sheet0

Conclusion

Tableau is great! I only know the basics, though, and am still getting my feet wet. I definitely see the possibility of updating the Tableau project linked above and diving deeper into more interesting and exciting visualizations. If you, the reader, liked my data visualization system, I recommend you also learn some of the Tableau basics as its a fairly simple software. I hope you found some interesting stats and stories in today’s inquiry.

Thanks for Reading!

hockey gifs: michael del zotto breaks camera with shot GIF

Dealing With Imbalanced Datasets The Easy Way

Imposing data balance in order to have meaningful and accurate models.

Understanding Imbalances - Chess Forums - Chess.com

Introduction

Thanks for visiting my blog today!

Today’s blog will discuss what to do with imbalanced data. Let me quickly explain what I’m talking about for all you non-data scientists. If I am screening people too see if they have a disease and I accurately screen every single person (let’s say I screen 1000 people total). Sounds good, right? Well, what if I told you that 999 people had no issues and I predicted them as not having a disease. The other 1 person had the disease and I got it right. This clearly holds little meaning. In fact, I would basically hold just about the same level of overall accuracy if I had predicted this diseased person to be healthy. There was literally one person and we don’t really know if my screening tactics work or I just got lucky. In addition, if I were to have predicted this one diseased person to be healthy, then despite my high accuracy, my model may in fact be pointless since it always ends in the same result. If you read my other blog posts, I have a similar blog which discusses confusion matrices. I never really thought about confusion matrices and their link to data imbalance until I wrote this blog, but I guess they’re still pretty different topics since you don’t normally upsample validation data, thus giving the confusion matrix its own unique significance. However, if you generate a confusion matrix to find results after training on imbalanced data, you may not be able to trust your answers. Back to the main point; imbalanced data causes problems and often leads to meaningless models as we have demonstrated above. Generally, it is thought that adding more data to any model or system will only lead to higher accuracy and upsampling a minority class is no different. A really good example of upsampling a minority class is fraud detection. Most people (I hope) aren’t committing any type of fraud ever (I highly recommend you don’t ask me about how I could afford that yacht I bought last week). That means that when you look at something like credit card fraud, the majority of the time a person makes a purchase, their credit card was not stolen. Therefore, we need more data on cases when people are actually the victims of fraud to have a better understanding of what to look for in terms of red flags and warning signs. I will discuss two simple methods you can use in python to solve this problem. Let’s get started!

When To Balance Data

For model validation purposes, it helps to have a set of data with which to train the model and a set with which to test the model. Usually, one should balance the training data and leave the test data unaffected.

First Easy Method

Say we have the following data…

Target class distributed as follows…

The following code below allows you to extremely quickly decide how much of each target class to keep in your data. One quick note is that you may have to update the library here. It’s always helpful to update libraries every now and then as libraries evolve.

Look at that! It’s pretty simple and easy. All you do is decide how many of each class to keep. After that, a certain number of rows resulting in one target feature outcome and a certain number of rows resulting in an alternative target feature outcome remain. The sampling strategy states how many rows to keep from each target variable. Obviously you cannot exceed the maximum per class, so this can only serve to downsample, which is not the case with our second easy method. This method works well when you have many observations from each class and doesn’t work as well when one class has significantly less data.

Second Easy Method

The second easy method is to use resample from sklearn.utils. In the code below, I decided to point out that I was using train data as I did not point it out above. Also in the code below, I generate new data of class 1 (sick class) and artificially generate enough data to make it level with the healthy class. So all the training data stays the same, but I repeat some rows from the minority class to generate that balance.

Here are the results of the new dataset:

As you can see above, each class represents 50% of the data. This method can be extended to cases with more than two classes quite easily as well.

Update!

If you are coming back and seeing this blog for the first time, I am very appreciative! I recently worked on a project that required data balancing. Below, I have included a rough but good way to create a robust data balancing method that works well without having to specify the situation or context too much. I just finished writing this function but think it works well and would encourage any readers to take this function and see if they can effectively leverage it themselves. If it has any glitches, I would love to hear feedback. Thanks!

Conclusion

Anyone conducting any type of regular machine learning modeling will likely need to balance data at some point. Conveniently, it’s easy to do and I believe you shouldn’t overthink it. The code above provides a great way to get started balancing your data and I hope it can be helpful to readers.

Thanks for reading!

Abbey Code

A Picture Is Worth 1000 Words. In This Case, I Decided To Stick With 400.

The Simpsons in Abbey Road by rastaman77 on DeviantArt

Introduction

Thanks for visiting my blog today!

For those of you who may not know, I love music and also play a couple instruments. One of my top 3 favorite bands of all time is The Beatles (#1 is Chumbawamba). The Beatles do not need an introduction as they are the most influential musical group in history.

Today, I’ve got a blog about the Beatles and their lyrics. I will later do some topic modeling and discuss how I might run an unsupervised learning algorithm with Beatles lyrics. For today, however, I’m going to share how I created a customized word cloud. That may not sound terribly exciting, but trust me when I say that this will be fun and useful.

Data Collection

First things first, we need some data as this is part of a larger topic modeling project. Now I’m sure that I can find a data repository or two with Beatles lyrics, but I decided to leverage web scraping. I found a lyrics website called lyricsfreak.com (https://www.lyricsfreak.com/b/beatles/) and web scraped the links for every song and later went into every link to extract lyrics via more web scraping and put them into a data frame. Interestingly, a couple of my songs were literally just short speeches (that were classified online as a speech, thankfully). One limitation of this source is that the website sometimes cuts repeated lyrics. I first noticed this when I was checking the lyrics of “All My Loving.” It’s a great song that hopefully you’ve heard before. All the unique phrases and groups of lyrics in the song are listed but it doesn’t repeat some of them even though the song itself does. Also, they don’t have every song. It was a little tricky to find a good website for lyrics that would work well with web scraping so I will be sticking with a slightly limited amount of data for this blog. I still think my results will be highly representative. You’ll see what I mean if you know The Beatles. So just a quick recap: I web scraped all my lyrics. More importantly, the way I performed this task was rather robust and I now believe I can quickly and easily scale my process to web scrape a whole discography worth of lyrics from any other band. We’ll see if anything happens with that some time in the future.s

Data Cleaning

I mentioned speeches. The first step in this process was to get rid of those. I’ll show you all the libraries I used (so keep them in mind for later) and what these speeches looked like:

Imports:

Some songs:

One speech:

Here’s all the speeches:

Let’s take a look at one song. I picked A Day In The Life as it was the first non-speech and also is widely considered to be the greatest Beatles song ever written given it’s advanced composition (I know a bunch of readers are going to disagree and say that either a song like Hey Jude or Strawberry Fields Forever is better or potentially contend that early Beatles music is better than later Beatles music. You’re entitled to your opinion; this blog is not an authoritative essay on what the best Beatles song may be).

By the way, the last chord in A Day In The Life (https://www.youtube.com/watch?v=YSGHER4BWME&ab_channel=TheBeatles-Topic around the 4:19 mark) is an E major, even though the song is in the key G major which does not have a G# note, necessary for an E major chord, in the scale. We call that move from E minor to E major (by changing the G note to a G# note) a “Picardy Third” in music theory and that difference is part of what makes this last chord so magical. If you’ve never heard the song, you’re missing out on possibly the most famous single chord in Beatles history. The other contender is the first chord (and literally the first thing you hear) in “A Hard Day’s Night.” Google says this other chord is an Fadd9 (adding a high G note to the F chord whose basic composition is just an F, A, and C note). (Special thanks to https://www.youtube.com/watch?v=jGaNdKabvQ4&ab_channel=DavidBennettPiano).

Ok, hopefully that made a little bit of sense or at least sparked some intrigue, but back to data cleaning. You may notice that the lyrics are written as a list. Here is how I dealt with that problem:

Here is a cleaning function designed to remove annoying characters like digits and punctuation:

Here, I combined every lyric into a list:

Next, I created a word count dictionary:

I earlier imported stop words (words like “the” that don’t help a whole lot) and here I have removed them and created a new list:

That’s basically it.

Word Cloud

So here is the basic word cloud (notice we are using generate_from_frequencies function to use the dictionary created above):

Looks pretty simple… and boring. Let’s try and make this a bit cooler. Like wouldn’t it be nice if we could work in an actual Beatles theme? How about the iconic Abbey Road album cover? Here is the picture you’ve likely seen before:

Why The Beatles' Abbey Road Album Was Streets Ahead Of Its Time

First, I decided to find a silhouette online that is a bit more black-and-white. I found this picture:

To actually use this picture, we will need to convert it to a numpy array. I’ve heard the term mask used to describe this process. I’ll copy the file path into python to get started.

What does it look like now?

It’s a bit messy, but…

And one step further…

Word clouds in python allow you to pass in a mask to shape your word cloud. In the code below, we basically have the same word cloud but have just added a mask.

Output:

I arrived at the magic number of 400 words through trail-and-error. I needed enough words to make a logo full and capture the letters of “The Beatles” and individual silhouettes while still avoiding having too many words and making each word hard to read and identify.

In terms of design aesthetics, we are almost done here. The color scheme doesn’t look great. Conveniently, word clouds let you pass in colors.

Output:

That looks a lot better now that we are using a new colormap. Interestingly, the word that stands out the most is “love.” The Beatles were well known for writing love songs and that is in fact the most common word after data cleaning..

Conclusion

Word clouds are fun and effective. Word clouds can also be utilized in different ways. If you want to customize your word clouds, you can find an image or two that stand out to you, copy the path name (or bring it into the folder), and then just work out the details within the code. It’s pretty scalable. We were able to see in this blog a process of web scraping lyrics, cleaning data, and creating an exciting and unique visualization at the end.

Thanks for reading!

Sources and further reading

(https://towardsdatascience.com/create-word-cloud-into-any-shape-you-want-using-python-d0b88834bc32)

Sink or Swim

Effectively Predicting the Outcome of a Shark Tank Pitch

Pin on Funny
yikes

Introduction

Thank you for visiting my blog today!

Recently, during my quarantine, I have found myself watching a lot of Shark Tank. In case you are living under a rock, Shark Tank is a thrilling (and often parodied) reality TV show (currently on CNBC) where hopeful entrepreneurs come into the “tank” and face-off against five “sharks.” The sharks are successful entrepreneurs who are basically de-facto venture capitalists looking to invest in the hopeful entrepreneurs mentioned above. It’s called “Shark Tank” and not something a bit less intimidating because things get intense in the tank. Entrepreneurs are “put through the ringer” and forced to prove themselves worthy of investment in every way imaginable while standing up to strong scrutiny from the sharks. Entrepreneurs need to demonstrate that they have a good product, understand how to run a business, understand the economic climate, are a pleasant person to work with, are trustworthy, and the list goes on and on. Plus, contestants are on TV for the whole world to watch and that just adds to the pressure to impress. If one succeeds, and manages to agree on a deal with a shark (usually a shark pays a dollar amount for a percentage equity in an entrepreneur’s business), the rewards are usually quite spectacular and entrepreneurs tend to get quite rich. I like to think of the show, even though I watch it so much, as a nice way for regular folks like myself to feel intelligent and business-savvy for a hot second. Plus, it’s always hilarious to see some of the less traditional business pitches (The “Ionic Ear” did not age well: https://www.youtube.com/watch?v=FTttlgdvouY). That said, I set out to look at the first couple seasons of Shark Tank from a data scientist / statistician’s perspective and build a model to understand whether or not an entrepreneur would succeed or fail during their moment in the tank. Let’s dive in!

Scary Shark GIFs | Tenor

Data Collection

To start off, my data comes from kaggle.com and can be found at (https://www.kaggle.com/rahulsathyajit/shark-tank-pitches). My goal was to predict the target feature “deal” which was either a zero representing a failure to agree on a deal or a 1 for a successful pitch. My predictive features were (by name): description, episode, category, entrepreneurs, location, website, askedFor, exchangeForStake, valuation, season, shark1, shark2, shark3, shark4, shark5, episode-season, and Multiple Entrepreneurs. Entrepreneurs meant the name of the person pitching a new business, asked for means how much money was requested, exchange for stake represents percent ownership offered by the entrepreneur, valuation was the implied valuation of the company, shark1-5 is just who was present (so shark1 could be Mark Cuban or Kevin Harrington, for example), and multiple entrepreneurs was a binary of whether or not there were multiple business owners beforehand. I think those are the only features that require explanation. I used dummy variables to identify which sharks were present in each pitch (this is different from the shark1 variable as now it says Mark Cuban, for example, as a column name with either a zero or one assigned depending on whether or not he was on for that episode) and also used dummy variables to identify the category of each pitch. I also created some custom features. Thus, before removing highly correlated features, my features now also included the dummy variables described above, website converted to a true-false variable depending on whether or not one existed, website length, a binned perspective on the amount asked for and valuation, and a numeric label identifying which unique group of sharks sat for each pitch.

EDA (Exploratory Data Analysis)

The main goal of my blog here was to see how strong of a model I could build. However, an exciting part of any data-oriented problem is actually looking at the data and getting comfortable with what it looks like both numerically and visually. This allows one to easily share fun observations, but also provides context on how to think about some features throughout the project. Here are some of my findings:

Here is the distribution of the most common pitches (using top 15):

Here is the likelihood of getting a deal by category with an added filter for how much of a stake was asked for:

Here are some other relationships with the likelihood of getting a deal:

Here are some basic trends from season 1 to season 6:

Here is the frequency of each shark group:

Here are some other trends over the seasons. Keep in mind that the index starts at zero but that relates to season 1:

Here is the average stake offered by leading categories:

Here comes an interesting breakdown of what happens when there is and is not a guest shark like Richard Branson:

Here is a breakdown of where the most common entrepreneurs come from:

In terms of the most likely shark group for people from different locations:

I also made some visuals of the amount of appearances of each unique category by each of the 50 states. We obviously won’t go through every state. Here are a couple, though:

Here is the average valuation by category:

Here is a distribution of pitches to the different shark groups (please ignore the weird formatting):

Here come some various visuals related to location:

Here come some various visuals related to shark group

This concludes my EDA for now.

Modeling

After doing some basic data cleaning and feature engineering, it’s time to see if I can actually build a good model.

First Model

For my first model, I used dummy variables for the “category” feature and information on sharks. Due to the problem of having different instances of the category feature, I split my data into a training and test set after pre-processing the data. I mixed and matched a couple of scaling methods and machine learning classification models before landing on standard scaling and logistic regression. Here was my first set of results:

In terms of an ROC/AUC visual:

64% accuracy on a show where anything can happen is a great start. Here were my coefficients in terms of a visual:

Let’s talk about these results. It seems like having Barbara Corcoran as a shark is the most likely indicator of a potential deal. That doesn’t mean Barbara makes the most deals. Rather, it means that you are likely to get a deal from someone if Barbara happens to be present. I really like Kevin because he always makes a ridiculous offer centered around royalties. His coefficient sits around zero. Effectively, if Kevin is there, we have no idea whether or not there will be a deal. He contributes nothing to my model. (He may as well be dead to me). Location seems to be an important decider. I interpret this to mean that some locations appear very infrequently and just happened to strike a deal. Furniture, music, and home improvement seem to be the most successful types of pitches. I’ll let you take a look for yourself to gain further insights.

Second Model

For my second model, I leveraged target encoding for all categorical data. This allowed me to split up my data before any preprocessing. I also spent time writing a complex backend helper module to automate my notebook. Here’s what my notebook looked like after all that work:

That was fast. Let’s see how well this new model performed given the new method used in feature engineering:

There is clearly a sharp and meaningful improvement present. That said, by using target encoding, I can no longer see the effects of individual categories pitched or sharks present. Here were my new coefficients:

These are a lot less coefficients than in my previous model due to the dummy variable problem, but this led to higher scores. This second model really shocked me. 86% accuracy for predicting the success of a shark tank pitch really surprised me given all the variability present in the show.

Conclusion

I was really glad that my first model was 64% accurate given what the show is like and all the variability involved. I came away with some insightful coefficients to understand what drove predictions. By sacrificing some detailed information I kept with dummy variables, I was able to encode categorical data in a different way which led to an even more accurate model. I’m excited to continue this project and add more data from more recent episodes to continue to build a more meaningful model.

Thanks for reading and I hope this was fun for any Shark Tank fans out there.

Anyway, this is the end of my blog…

shark tank - Imgflip

Devin Booker vs. Gregg Popovich (vs. Patrick McCaw)

Developing a process to predict which NBA teams and players will end up in the playoffs.

Jim Mora Nfl GIF - JimMora Mora Nfl - Discover & Share GIFs

Introduction

Hello! Thanks for visiting my blog.

After spending the summer being reminded of just how incredible Michael Jordan was as a competitor and leader, the NBA is finally making a comeback and the [NBA] “bubble” is in full effect in Orlando. Therefore, I figured today would be the perfect time to discuss the NBA playoffs (I know the gif is for the NFL, but it is just too good to not use). Specifically, I’d like to answer the following question: if one were to look at some of the more basic NBA statistics recorded by an NBA player or team on a per-year basis, with what accuracy could you predict whether or not that player may find themselves in the NBA playoffs. Since the NBA has changed so much over time and each iteration has brought new and exciting skills to the forefront of strategies, I decided to stay up to date and look at the 2018-19 season for individual statistics (the three-point era). However, only having 30 data points to survey for team success will be insufficient to have a meaningful analysis. Therefore I expanded my survey to include data back to and including the 1979-80 NBA season (when Magic Johnson was still a rookie). I web-scraped basketball-reference.com to gather my data and it contains every basic stat and a couple advanced ones such as effective field goal percent (I don’t know how that is calculated but Wikipedia does: https://en.wikipedia.org/wiki/Effective_field_goal_percentage#:~:text=In%20basketball%2C%20effective%20field%20goal,only%20count%20for%20two%20points.) from every NBA season ever. As you go back in time, you do begin to lose data from some statistical categories that weren’t recorded yet. To give to examples here I would mention blocks which were not recorded until after the retirement of Bill Russell (widely considered the greatest defender to play the game) or three pointers made as three pointers were first introduced into the NBA in the late 1970s. So just to recap: if we look at fundamental statistics recorded by individuals or in a larger team setting – can we predict who will find themselves in the playoffs? Before we get started, I need to address the name of this blog. Gregg Popovich has successfully coached the San Antonio Spurs to the playoffs in every year since NBA legend Tim Duncan was a rookie. They are known to be a team that runs on good teamwork as opposed to outstanding individual play. This is not to say they have not had superstar efforts though. Devin Booker has been setting the league on fire, but his organization, the Phoenix Suns, have not positioned themselves to be playoff contenders. (McCaw just got lucky and was a role player for three straight NBA championships). This divide is the type of motivation that led me to pursue this project.

Playoff P" Art Print by cobyshimabukuro | Redbubble

Plan

I would first like to mention that the main difficulty in this project was developing a good web-scraping function. However, I want to be transparent here and let you know that I worked hard developing that function a while back and now realized this would be a great use of that data. Anyhow, in my code I go through the basic data science process. In this blog, however, I think I will try to stick to the more exciting observations and conclusions I reached. (here’s a link to GitHub: https://github.com/ArielJosephCohen/postseason_prediction).

The Data

First things first, let’s discuss what my data looked like. In the 2018-19 NBA season, there were 530 total players to play in at least one NBA game. My features included: name, position, age, team, games played, games started, minutes played, field goals made, field goals attempted, field goal percent, three pointers made, three pointers attempted, three point percent, two pointers made, two pointers attempted, two point percent, effective field goal percent, free throws made, free throws attempted, free throw percent, offensive rebounds per game, defensive rebounds per game, total rebounds per game, assists per game, steals per game, blocks per game, points per game, turnovers per game, year (not really important but it only makes the most marginal difference in the team setting), college (or where they were before the NBA – many were null and were filled as unknown), draft selection (un-drafted players were assigned the statistical mean of 34 – which is later than I expected. Keep in mind the drat used to exceed 2 rounds). My survey of teams grouped all the numerical values (with the exception of year) by each team and every season using the statistical mean of all its players that season. In total, there were 1104 rows of data, some of which included teams like Seattle that no longer exist in their original form. My target feature was a binary 0 or 1 with 0 representing a failure to qualify for the playoffs and 1 representing a team that successfully earned a playoff spot.

One limitation of this model is that it accounts for trades or other methods of player movement by assigning a player’s entire season stats to the team he ended the season with, regardless of how many games were actually played on his final team. In addition, this model doesn’t account for more advanced metrics like screen assists or defensive rating. Another major limitation is something that I alluded to earlier: the NBA changes and so does strategy. This makes this more of a study or transcendent influences that remain constant over time as opposed to what worked well in the 2015-16 NBA season (on a team level that is). Also, my model focuses on recent data for the individual player model, not what individual statistics were of high value in different basketball eras. A great example of this is the so-called “death of the big man.” Basketball used to be focused on big and powerful centers and power forwards who would control and dominate the paint. Now, the game has moved much more outside, mid-range twos have been shunned as the worst shot in basketball and even centers must develop a shooting range to develop. Let me show you what I mean:

Shot chart dramatically shows the change in the NBA game

Now let’s look at “big guy”: 7-foot-tall Brook Lopez:

In a short period of time, he has drastically changed his primary shot selection. I have one more visual from my data to demonstrate this trend.

Exploratory Data Analysis

Before I get into what I learned from my model, I’d like to share some interesting observations from my data. I’ll start with two histograms groups:

Team data histograms
Individual data histograms
Every year – playoff teams tend to have older players
Trends of various shooting percentages over time

Here, I have another team vs. individual comparison testing various correlations. In terms of my expectations – I hypothesized that 3 point percent would correlate negatively to rebounds, while assists would correlate positively to 3 point percent, blocks would have positive correlation to steal, and assists would have positive correlation with turnovers.

Team data
Individual data

I seem to be most wrong about assists and 3 point percent while moderately correct in terms of team data.

The following graph displays the common differences over ~40 years between playoff teams and non-playoff teams in the five main statistical categories in basketball. However, since the mean and extremes of these categories are all expressed in differing magnitudes, I applied scaling to allow for a more accurate comparison

It appears that the biggest difference between playoff teams and regular teams is in assists while the smallest difference is in rebounding

I’ll look at some individual stats now, starting with some basic sorting.

Here’s a graph to see the breakdown of current NBA players by position.

In terms of players by draft selection (with 34 = un-drafted basically):

In terms of how many players actually make the playoffs:

Here’s a look at free throw shooting by position in the 2018-19 NBA season (with an added filter for playoff teams):

I had a hypothesis that players drafted earlier tend to have higher scoring averages (note that in my graph there is a large cluster of points hovering around x=34. This is because I used 34 as a mean value to fill null for un-drafted players).

It seems like I was right – being picked early in the draft corresponds to higher scoring. I would guess the high point is James Harden at X=3.

Finally I’d like to share the distribution of statistical averages by each of the five major statistics sorted by position:

Model Data

I ran a random forest classifier to get a basic model and then applied scaling and feature selection to improve my models. Let’s see what happend.

2018-19 NBA players:

The above represents an encouraging confusion matrix with the rows representing predicted data vs columns which represent actual data. The brighter and more muted regions in the top left and bottom right correspond to the colors in the vertical bar adjacent to the matrix and indicate that brighter colors represent higher values (in this color palette). This means that my model had a lot more values whose prediction corresponded to its actual placement that incorrect predictions. The accuracy of this basic model sits around 61% which is a good start. The precision score represents the percentage of correct predictions in the bottom left and bottom right quadrants (in this particular depiction). Recall represents the percentage of correct positions in the bottom right and top right quadrants. In other words recall represents all the cases correctly predicted given that we were only predicting from a pool of teams that made the playoffs. Precision represents a similar pool, but with a slight difference. Precision looks at all the teams that were predicted to have made the playoffs and the precision score represents how many of those predictions were correct.

Next, I applied feature scaling which is a process of removing impacts from variables determined entirely by magnitude alone. For example: $40 is a lot to pay for a can of soda but quite little to pay for a functional yacht. In order to compare soda to yachts, it’s better to apply some sort of scaling that might, for example, place their costs in a 0-1 (or 0%-100%) range representing where there costs fall relative to the average soda or yacht. A $40 soda would be close to 1 and a $40 functional yacht would be closer to 0. Similarly, a $18 billion yacht and an $18 billion dollar soda would both be classified around 1 and conversely a 10 cent soda or yacht would both be classified around 0. A $1 soda would be around 0.5. I have no idea how much the average yacht costs.

Next, I wanted to see how many features I needed for optimal accuracy. I used recursive feature elimination which is a process of designing a model and the using that model to look for which features may be removed to improve the model.

20 features seems right

After feature selection, here were my results:

64% accuracy is not bad. Considering that a little over 50% of all NBA players make the playoffs every year, I was able to create a model that, without any team context at all, can predict to some degree which players will earn a trip to the playoffs. Let’s look at what features have the most impact. (don’t give too much attention to vertical axis values). I encourage you to keep these features in mind for later to see if they differ from the driving influences for the larger scale, team-oriented model.

If I take out that one feature at the beginning that is high in magnitude for feature importance, we get the following visual:

I will now run through the same process with team data. I will skip the scaling step as it didn’t accomplish a whole lot.

First model:

Feature selection.

Results:

78% accuracy while technically being a precision model

Feature importance:

Feature importance excluding age:

I think one interesting observation here is how much age matters in a team context over time, but less in the 2018-19 NBA season. Conversely, effective field goal percent had the opposite relationship.

Conclusion

I’m excited for sports to be back. I’m very curious to see the NBA playoffs unfold and would love to see if certain teams emerge as dark horse contenders. Long breaks from basketball can either be terrible or terrific. I’m sure a lot of players improved while others regressed. I’m also excited to see that predicting playoff chances both on individual and team levels can be done with an acceptable degree of accuracy. I’d like to bring this project full circle. Let’s look at the NBA champion San Antonio Spurs from 2013-14 (considered a strong team-effort oriented team) and Devin Booker from 2018–19. I don’t want to spend too much time here but let’s do a brief demo of the model using feature importance as a guide. The average age of the Spurs was two years above the NBA average with the youngest members of the team being 22, 22, and 25. The Spurs led the NBA in assists that year and were second to last in fouls per game. They were also top ten in fewest turnovers per game and free throw shooting. Now, for Devin Booker. Well, you’re probably expecting me to explain why he is a bad team player and doesn’t deserve a spot in the playoffs. That’s not the case. By every leading indicator in feature importance, Devin seems like he belongs in the playoffs. However, let’s remember two things. My individual player model was less accurate and Booker sure seems like an anomaly. However, I think there is a bigger story here, though. Basketball is a team sport. Having LeBron James as your lone star can only take you so far (it’s totally different for Michael Jordan because Jordan is ten times better, and I have no bias at all as a Chicago native). That’s why people love teams like the San Antonio Spurs. They appear to be declining now as their leaders have recently left the team in players such as Tim Duncan and Kawhi Leonard. Nevertheless, basketball is still largely a team sport. The team prediction model was fairly accurate. Further, talent alone seems like it is not enough either. Every year, teams that make the playoffs tend to have older players. Talent and athleticism is generally skewed toward younger players. Given these insights, I’m excited to have some fresh eyes with which to follow the NBA playoffs.

Thanks for reading.

Have a great day!

Thank you to all of the parents for... - Hamilton Little Lads ...

Exciting Graphs for Boring Maths

Understanding style options available in Seaborn that can enhance your explanatory graphs and other visuals.

10 Funny Graphs That Perfectly Explain Everyday Life

Introduction

Welcome to my blog! Today, I will be discussing ways to leverage Seaborn (I will often use the ‘sns’ abbreviation later on) into spicing up python visuals. Seaborn [and Matplotlib – abbreviated as ‘plt’] are powerful libraries that can enable anyone to swiftly and easily design stunning and impressive visuals that can tell interesting stories and inspire thoughtful questions. Today, however, I would like to shift the focus to a different aspect of data visualizations. I may one day (depending on when this is read – maybe it happened already) write about all the amazing things Seaborn can do in terms of telling a story, but today that is not my focus. What inspired me to write this blog is that all my graphs used to all be… boring. Well, I should qualify that. When I use Microsoft Excel, I enjoy taking a “boring” graph and doing something exciting like putting neon colors against a dark background. In python, my graphs tend to be a lot more exciting in terms capability in describing data but were boring visually. They were all in like red, green, and blue against a blank white background. I bet a lot of readers out there may know some ways to spice up graphs by enhancing aesthetics but I would like to discuss that very topic today for any readers who may not really know these options are available (like I didn’t at first) and hopefully, even provide some insights to more experienced python users. Now, as a disclaimer: this information can all be found online in Seaborn documentation, I’m just reminding some people it’s there. Another quick note: Matplotlib and Seaborn go hand in hand but this blog will focus on Seaborn. I imagine at times it will be implied that you may or may not also need Matplotlib. Okay, let’s go!

Plan

First of all, I found the following image at this blog: https://medium.com/@rachel.hwung/how-to-make-elegant-graphs-quickly-in-python-using-matplotlib-and-seaborn-252344bc7fcf. It’s not exactly a comprehensive description of what I plan to discuss today, but let me use it as an introduction to give you an understanding of some of the style options available in a graph.

Anatomy of a figure — Matplotlib 3.1.2 documentation

(I hope you are comfortable with scatter plots and regression plots). The plan here is to go through all the design aesthetics functions in Seaborn (in particular – style options) and talk about how to leverage them into spicing up data visualizations. One quick note here: I like to use the ‘with’ word in python as this does not change all my design aesthetics for my whole notebook, just that one visual or set of visuals. In other words – if you don’t write ‘with’ your changes will be permanent. Let me show you what I mean:

Using no design aesthetics
Using ‘with’ keyword

It may be annoying to always add the with keyword, so you obviously can just set one style for your whole notebook. I’ll talk about what happens when you want to get rid of that style later on.

One more note: You may notice I added a custom color. Below I have included every basic python color and color palette (at least, I think I have). I’m fairly certain you can create custom colors but that is beyond the scope of this blog (I may blog bout that some other time and that very blog may already exist depending on when this is read, so keep your eyes open). The reason that the color palettes are in an error message is to give any reader a trick to remember them all. If you use an incorrect color palette (which will be discussed shortly), then python will give you your real options. I really like this feature. Just keep in mind that some of these, like ‘Oranges,’ go on past the end of the line, so make sure that doesn’t confuse you. You may also notice that every palette has a ‘normal’ color and a second ‘normal_r’ color. Look at ‘Blues’ and ‘Blue_r,’ for example. All it means is that the colors are reversed. I’m not familiar with Blues, but if it were to start with light blue and move toward dark blue, I imagine the reverse would start at dark and move to light. Color palettes sometimes can be ineffective if you don’t need (or don’t want) a wide range of colors. Example: my graph above of a regression plot. In the case that color palettes are of little value, I have provided the names of colors as well.

Okay, let’s start!

sns.reset_defaults()

This is probably the easiest one to explain. If you have toyed with your design aesthetic settings, this will reset everything.

sns.reset_orig()

This function looks similar to the one above but is slightly different in its effect. According to the Seaborn documentation, this function resets all style but keeps any custom rcParams. To be honest, I tried to put this in action and see how it was different than reset_defaults( ). I was not very successful. In case you want to learn more about what rcParams are, check out this link: https://matplotlib.org/3.2.2/users/dflt_style_changes.html. I think rc stands for ‘Run & Configure.’ I’ll need to investigate this further. If any readers understand this and how it is different from reset_defaults(), please feel free to reach out.

sns.set_color_codes()

This one is quick but interesting. My understanding is that it allows you to control how Matplotlib colors are interpreted. Let me show you what I mean:

‘muted’ palette
‘pastel’ palette

The change above is subtle, but can be effective, especially if you add further customizations. In total, color code options include: deep, muted, pastel, dark, bright, and colorblind.

I’d like to record a quick realization here: while writing this blog I learned that there are also many Matplotlib customization options that I didn’t know of before and, while that is beyond the scope of this blog, I may write about that later (or have already written about it depending on when this blog is being read). I think the reason I never noticed this is because, and I’m just thinking out loud here, I only ever looked at the matplotlib.pyplot module and never really looked at modules like matplotlib.style. I’m wondering to myself now if the main virtue of Seaborn (as opposed to Matplotlib) is that Seaborn can generate more descriptive and insightful visuals as opposed to ones with creative aesthetics.

sns.set_context()

This one is very similar to one yet to be discussed. Anyways, to quote the documentation: “This affects things like the size of the labels, lines, and other elements of the plot, but not the overall style.” There are 4 default contexts: talk, paper, notebook and poster. This function also (can) take the following parameters: context (things like paper or poster), font_scale (making the font bigger or smaller), and rc (a dictionary of other parameters discussed earlier used “to override the values in the preset [S]eaborn context dictionaries” according to the documentation). Here are some visuals:

notebook context
poster context
rc params in poster context
rc params changing line size in poster context

sns.plotting_context()

This function has the same basic application as the one above and even has the same parameters. However, as will be the trend with other topics, by typing the function above you will actually get a return value. To be clear, by typing sns.plotting_context() (I’ll have a visual) you will get a return value of a dictionary that provides you with all the rcParams available for plotting context. So not only can you set a context, but you can also see your options if you forget what’s available. One difference I have found between this function and the one above is that this one works (maybe only works?) with something like the ‘with’ keyword as opposed to the one before which functions independently of that keyword. (I’ll need to do some more research here). In any case, we’re going to see another dictionary later, but here is our first one.

sns.set_style()

I’m going to skip this, you’ll understand why.

sns.axes_style()

The two functions listed above basically work together in the same way that set context and plotting context do. The default options for setting an axes style are: darkgrid, whitegrid, dark, white, and ticks. In addition, in the axes_style version, you can just enter sns.axes_style() to return a dictionary of all your additional options. If you look back at the regplots at the beginning of the blog you may notice that I leveraged the darkgrid option and I encourage you to scroll back up to see it in action.

All I did in the graph above was set a black background, but now the graph looks far more exciting and catchy than any of the ones above.

I’d like to record another thought here. It really does seem that the ‘with’ keyword only works with some functions as described above. Unless I am missing some key point, I find it works with things like axes_style and plotting_context but not set_style and set_context. I’m going to have to dive deeper into what that means about the functions and how they are different.

sns.set()

Today’s blog is about style options and this is the last one to cover (although I didn’t really go in any particular order). I think that this function is a bit of a catch-all. The parameters are: context, style, palette, font, font_scale. The documentation says that context and style correspond to the parameters in the axes_style and plotting_context ‘dictionaries’ cited above. I’d like to also look at the palette option. I alluded to this earlier in the blog with all the available color palettes. In the documentation color palette is not part of the style options. I would still encourage using the (with) sns.color_palette() function (that can use the ‘with’ keyword) to be able to set interesting color palettes like the ones listed above. It appears that this function (back to sns.set()) doesn’t easily work with the ‘with’ keyword. Also, the color_codes parameter is a true-false value, but it’s not entirely clear to me how it works. Here’s what the documentation says: ‘If True and palette is a Seaborn palette, remap the shorthand color codes (e.g. “b”[lue], “g[reen]”, “r[ed]”, etc.) to the colors from this palette.’ I found this link for all the fonts available for Seaborn / Matplotlib: http://jonathansoma.com/lede/data-studio/matplotlib/list-all-fonts-available-in-matplotlib-plus-samples/. When I provide some visuals, I encourage you to pay attention to how I enter the font options.

set background color
set context
The palette only shows one color in this case so I will provide some palette examples later
set font and font size – unlike colors, however, fonts are not entered as one word – they have spaces. A color is written as ‘forestgreen’ as opposed to ‘forest green’ but fonts work differently
set grid color with rcParams
I’ve used the icefire palette referenced above in the error message
here is icefire_r

You may notice how I added the palette in the function that makes the graph. Yes, it is an option. However, if you want to make a lot of graphs it can be inefficient. I suppose one interesting option would be to create a list of palettes and just loop through them if you want to keep mixing things up.

Conclusion

Creating visuals to tell a story or identify a trend is exciting and important to data science inquiries. If this is indeed the case, why not make your visuals catchy and exciting? People are drawn to things that look ‘cool.’ In addition, while I know the focus in this blog was on style, we did briefly explore color options. I found this blog that quickly talks about strategically planning how you use colors and color palettes in your visuals: https://medium.com/nightingale/how-to-choose-the-colors-for-your-data-visualizations-50b2557fa335. Like I said, though, the focus here has been on style options. Writing this blog has been a bit of a learning experience for me as well. I have realized that there are a lot more options to spice up graphs and make them more interesting and meaningful. For example, I probably haven’t taken full advantage of all the options available in Matplotlib and its various modules. I’m going to have to do more research on various topics and possibly update and correct parts of my blog post here, but I think this is a good starting point.

Thanks for reading and have a great day.

By the way… here’s some graphs I made using Matplotlib and Seaborn:

Method to the Madness

Imposing Structure and Guidelines in Exploratory Data Analysis

Sticking Many Graphs Charts Colorful Documents Stock Photo (Edit ...

Introduction

I like to start off my blogs with the inspiration that led me to write that blog and this blog is no different. I recently spent time working on a project. Part of my general process when performing data science inquiries and building models is to perform exploratory data analysis (EDA from henceforth). While I think modeling is more important than EDA, EDA tends to be the most fun part of a project and is very easy to share with a non-technical audience who don’t know what multiple linear regression or machine learning classification algorithms are. EDA is is often where you explore and manipulate your data to generate descriptive visuals. This helps you get a feel for your data and notice trends and anomalies. Generally, what makes this process exciting is the amount of unique types of visuals and customizations one can use to design said visuals. So now let’s get back to the beginning of the paragraph; I was working on a project and was up to the EDA step. I got really engrossed in this process and spent a lot of effort trying to explore every possible interesting relationship. Unfortunately, however, this soon became stressful as I was just randomly looking around and didn’t really know what I had already covered or at what point I would feel satisfied and therefore planned on stopping. It’s not reasonable to explore every possible relationship in every way. Creating many unique custom data frames, for example, is a headache, even if you get some exciting visuals as a result (and the payoff is not worth the effort you put in). After doing this, I was really uneasy about continuing to perform EDA in my projects. At this point, the most obvious thing in the entire world occurred to me and I felt pretty silly for not having thought of this before; I needed to make an outline and a process for all my future EDA. This may sound counter-intuitive as some people may agree with me that EDA shouldn’t be so rigorous and should be more free-flowing. Despite that opinion from some, that I have the mindset that imposing a structure and plan would be very helpful. Doing this would save me a whole lot of time and stress while also allowing me to keep track of what I had done and capture some of the most important relationships in my data. (I think when some people read this they may not agree with me about my thoughts about an approach to EDA. My response to them is that if there project is mainly focussed on EDA, as opposed to developing a model, then they should understand that this is not the scenario I have in mind. In addition, one still has the ability to impose structure and still perform more comprehensive EDA in the scenario that their project does in fact center around EDA).

Plan

I’m going to provide my process for EDA at a high level. How you decide to interpret my guidelines based on unique scenarios and which visuals you feel best describe and represent your data are up to you. I’d like to also note that I consider hypothesis testing to be part of EDA. I will not discuss hypothesis testing here, but understand that my process can extend to hypothesis testing to a certain degree.

Step 1 – Surveying Data And Binning Variables

I am of the opinion that EDA should be performed before you transform your data using methods such as feature scaling (unless you want to compare features of different average magnitudes). However, just like you would preprocess data for modeling, I believe there is an analogous way to “preprocess” data for EDA. In particular, I would like to focus on binning features. In case you don’t know – feature binning is a method of transforming old variables or generating new variables by splitting up numeric values into intervals. Say we have the data points 1, 4, 7, 8, 3, 5, 2, 2, 5, 9, 7, and 9. We could bin this as [1,5] and (5,9] and assign the labels 0 and 1 to each respective interval. So if we see the number 2 – it becomes a 0. If we see the number 8, however – it becomes a 1. I think it’s a pretty simple idea. It is so effective because often times you will have large gaps in your data or imbalanced features. I talk more about this in another blog about categorical feature encoding, but I would like to provide a quick example. People who buy cars between $300k-$2M probably include the same amount, if not less, than the amount of people buying cars between $10k-$50k. That’s all conjecture and may not in fact be true, but you get the idea. The first interval is significantly bigger than the second in terms of car price range but it is certainly logical to group these two types of people together. You can actually cut your data into bins easily (and assign labels) and even have the choice to cut at quantiles in pandas (pd.cut, pd.qcut). You can also feature engineer before you bin your data using transformations first, and binning second. So this is all great – and will make your models better if you have large gaps… and even give you the option to switch from regression to classification which also can be good – but why do we need to do this for EDA? I think there is a simple answer to this question. You can more efficiently add filters and hues to your data visualizations. Let me give you an example: if you want to find the GDP across all 7 continents (Antarctica must have like a zero GDP) in 2019 and add an extra filter to show large vs. small countries. You would need to add many filters for each unique country size. What a mess! Instead, split companies along a certain population threshold into big and small countries. Now you have 14 data points instead of 200+. 14 = 7 continents * (large population country + small population country). Now I’m sure that some readers can think of other ways to deal with this issue, but I’m just trying to prove a point here, not re-invent the wheel. When I do EDA, I like to first look at violin-plots and histograms to determine how to split my data and how many groups of data (2 bins, 3 bins…) I want, but I also keep my old variables as they are also valuable to EDA as I will allude at later on.

Step 2 – Ask Questions Without Diving Too Deep Into Your Data

This idea will be developed further later on and honestly doesn’t need much explanation right now. Once you know what your data is describing, you should begin to ask legitimate questions you would want to know the answers to and what relationships you would like to explore given the context of your data and write them down. I talk a lot about relationships later on. When you first write down your questions, don’t get too caught up in things like that, just thing about the things in your data you would like to explore using visuals.

Step 3 – Continue Your EDA By Looking At The Target Feature

I would always start an EDA process by looking at the most important and crucial piece of your data – the target feature (assuming it’s labeled data). Go through each of your predictive variables individually and compare them with the target variable with any number of filters you may like. Once you finish up with exploring the relationship with the target variable and variable number one, don’t include variable number one in any more visualizations. I’ll give an example: Say we have the following features: wingspan, height, speed and weight and we wanted to predict the likeliness that any given person played professional basketball. Well, start with variable number one – wingspan. You could see the average wingspan of regular folks versus professional players. However, you could also add other filters such as height to see if the average wingspan of tall professional players vs tall regular people is different. Here, we could have three variables in one simple bar graph: height, wingspan, and the target of whether a player plays professionally or not. You could continue doing this until you have exhausted all the relationships between wingspan, target variable, and anything else (if you so choose). Since you have explored all those relationships (primarily using bar graphs probably) you now can cross off one variable from your list, although it will come back later as we will soon see. Rinse and repeat this process until you go through every variable. The outcome is you have a really good picture of how your target variable moves around with different features as the guiding inputs. This will also give you some insight to reflect on when you run something such as feature importances or coefficients later on. I would like to add one last warning, though. Don’t let this process get out of hand. Just keep a couple visuals for each feature correlation with target variable and don’t be afraid to sacrifice some variables in to keep the more interesting ones. (I realize that there should be a modification to this part of the process if you have above a certain threshold of feature.

Step 4 – Explore Relationships Between Non-Target Features

The next step I usually take is to look at all the data (again), but without including the target feature. Similar to the process described above, if I have 8 non-target features I will think of all the ways that I can compare variable 1 with variables 2 through 8 (with filters). I then look at ways to compare feature 2 with features 3-8 and so on until I am at the comparison between 7 and 8. Now, obviously not every variable may be efficient to compare, easy to compare, or even worth comparing. Here, I tend to have a lot of different types of graphs as I am not focussed on the target feature. You can get more creative than you might be able to when comparing features to the target in a binary target classification problem. There are a lot of articles on how to leverage Seaborn to make exciting visuals and this step of the process is probably a great application of all those articles. I described an example above where we had height, wingspan, weight, and speed (along with a target feature). In terms of the relationships between the predictors, one interesting inquiry, for example, could be the relationship between height and weight in the form of a scatter plot, regression plot, or even joint plot. Developing visuals to demonstrate relationships can also affect your mindset upon filtering out correlation or feature engineering. I’d like to add a Step 4A at this point. When you look at your variables and start writing down all your questions and tracking relationships you want to explore, I highly recommend that you assign certain features as filters only. What I mean by this is the following: Say you have the following (new) example: feature 1 is normally distributed and has 1000 unique values and feature 2 (non-target) has 3 unique values (or distinct intervals). I highly recommend you don’t spend a lot of time exploring all the possible relationships that feature 2 has with other features. Instead, use it as a filter or “hue” (as it’s called in Seaborn). If you want to find the distribution of feature 1, which has many unique values, you could easily run a histogram or violin-plot, but you can tell a whole new story if you filter these data points into three categories dictated by feature 2. So when you write out your EDA plan, figure what works best as a filter and that will save you some more time.

Step 5 – Keep Track Of Your Work

Not much to say about this step. It’s very important though. You could end up wasting a lot of time if you don’t keep track of your work.

Step 6 – Move On With Your Results In Mind

When I am done with EDA, I often switch to hypothesis testing which I like to think as a more equation-y form of EDA. Like EDA, you should have a plan before you start running tests. However, now that you’ve done EDA and gotten a real good feel for your data, you should keep in mind your results for further investigation when you go into hypothesis testing. Say you find a visual that tells a puzzling story – then you should investigate further! Also, as alluded to earlier – your EDA could impact your feature engineering. Really, there’s too much to list in terms of why EDA is such a powerful process even beyond the initial visuals.

Conclusion:

I’ll be very direct about something here: there’s a strong possibility that I was in the minority as someone who often didn’t plan much before going into EDA. Making visuals is exciting and visualizations tend to be easy to share with others when compared to things like a regression summary. Due to this excitement, I used to dive right in without much of a plan beforehand. It’s certainly possible that everyone else already made plans. However, I think that this blog is valuable and important for two reasons. Number 1: Hopefully I helped someone who also felt that EDA was a mess. Number 2: I provided some structure that I think would be beneficial even to those who plan well. EDA is a very exciting part of the data science process and carefully carving out an attack plan will make it even better.

Thanks for reading and I hope you learned something today.

2018: thank u, next - The K.I.S.S Marketing Agency

Out Of Office

Predicting Absenteeism At Work

Out of Office Email – Auto Reply Email Templates | iHire

Introduction

It’s important to keep track of who does and does not show up to work when they are supposed to. I found some interesting data online that gives information on how much work from a range of 0 to 40 hours any employee is expected to miss in a certain week. I ran a couple models and came away with some insights on what my best accuracy would look like and what it would tell me are the most predictive of time expected to miss by an employee.

Process

  1. Collect Data
  2. Clean Data
  3. Model Data

Collect Data

My data comes from the UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work). While I will not go through this part in full detail, the link above talks about the numerical representation for “reason for absence.” The features of the data set, other than the target feature of time missed, were: ID, reason for absence, month, age, day of week, season, distance from work, transportation cost, service time, work load, percent of target hit, disciplinary failure, education, social drinker, social smoker, pet, weight, height, BMI, and son. I’m not entirely sure what “son” means. So now I was ready for some data manipulation. However, before I did that, I performed some exploratory data analysis with some custom columns being binning the variables with many unique values such as transportation expense.

EDA

First, I have a histogram of each variable.

After filtering outliers, the next three histogram charts describe the distribution of variables in cases of missing low, medium, and high amounts of work, respectively.

Low
Medium
High

Below, I have displayed a sample of the visuals produced in my exploratory data analysis which I feel tell interesting stories. When an explanation is needed it will be provided.

(O and 1 are binary for “Social Drinker”)
(The legend refers distance from work)
(O and 1 are binary for “Social Drinker”)
(The legend refers transportation expense to get to work)
(The legend reflect workload numbers)
(O and 1 are binary for “Social Drinker”)
Histogram
(Values adjusted using Min-Max scaler)

This concludes the EDA section.

Hypothesis Testing

I’ll do a quick run-through here of some of the hypothesis testing I performed and what I learned. I examined the seasons of the year to see if there was a discrepancy in the absences observed in the Summer and Spring vs Winter and Fall. What I found was that there wasn’t much evidence to say a difference exists. I found with high statistical power that people with higher travel expenses tend to miss more work. This was also the case with people who have longer distances to work. Transportation costs as well as distance to work also have a moderate effect on service time at a company. Age has a moderate effect on whether people tend to smoke or drink socially but not enough to have statistical significance. In addition, there appears to be little correlation with time at the company and whether or not targets were hit. However, this test has low statistical power and has a p-value that is somewhat close to 5% implying that an adjusted alpha may change how we view this test both in terms of type 1 error and statistical power. People with less education tend to drink more as well. Education has a moderate correlation with service time. Anyway, that is very quick recap of the main hypotheses I tested boiled down to the most easy way to communicate their findings.

Clean Data

I started out by binning variables with wildly uneven distributions. Next, I used categorical data encoding to encode all my newly binned features. Next, I applied scaling so that all the data would be within 3 standard deviations of each variable’s mean. Having filtered out misleading values, I binned my target variable into three groups. Next, I removed correlation. I will go back and discuss some of these topics later in this blog when I discuss some of the difficulties I faced.

Model Data

My next step was to split and model my data. One problem came up. I had a huge imbalance among my classes. The “lowest amount of work missed” class had way more than the other two classes. Therefore, I synthetically created new data to have every class have the same amount of cases. To find my most ideal model and then improve it… well I first needed to find the best model. I applied 6 types of scaling across 9 models = 54 results and found that my best model would be a Random Forest model. I even found that adding polynomial features would give me near 100% accuracy on training data without much loss on test data. Anyway, I went back to my random forest model. I found the most indicative features of time missed in order from biggest indicator to smallest indicator were: month, reason for absence, work load, day of the week, season, and social drinker. There are obviously other features, but these are the most predictive ones. The others provide less information.

Problems Faced in Project

The first problem I had was not looking at the distribution of the target variable. It is a continuous variable, however, there are very few values in certain ranges. I therefore split it into three bins; missing little work, a medium amount of work, and a lot of work. I also experimented with having two bins as well as different cutoff points to pick the bins, but three bins worked better. This also affected my upsampling as the different binning methods resulted in different class breakdowns. The next problem I had was a similar one. How would I bin variables? In short, I tried a couple of ways and found that three bins worked well. All this binning was not done using quantiles, by the way. That would imply no target class imbalance which was not the case. I tried using quantiles, but did not find it effective. I also experimented with different categorical feature encoding but found that the most effective method was to bin based on mean value in connection with target variable (check my home page for a blog about that concept). I ran a gridsearch to optimize my random forest at the very end and then printed a confusion matrix. This was not good, but I nee to be intellectually honest. Predicting when someone would fall into class 0 (“missing low amount of work”) my model was amazing and its recall exceeded precision. However, it did not work well on the other two. Now keep in mind that you do not upsample test data and this could be a total fluke. However, that was still frustrating to see. An obvious next step is to collect more data and continue to improve the model. One last idea I want to talk about is exploratory data analysis. Now, to be fair, this could be inserted into any blog. Exploratory data analysis is both fun and interesting as it allows you to be creative and take a dive into your idea using visuals as quick story-tellers. The project I had just scrapped before acquiring the data for this project drove me kind of crazy because I didn’t really have a plan for my exploratory data analysis. It was arbitrary and unending. That is never a good plan. EDA should be planned and thought out. I will talk more / have talked more about this (depending on when you read this blog) in another blog but the main point is you want to think of yourself as person who doesn’t do programming who just wants to ask questions based on the names of the features. Having structure in place for EDA is less free-flowing and exciting than not having structure, but it ensures that you work efficiently and have a good start point as well as stop point. That really helped me save a lot of stress.

Conclusion

It’s time to wrap things up. At this point, I think I would need more data to continue to improve this project, and I’m not sure where that data would come from. In addition, there are a lot of ambiguities in this data set such as the numerical choices for reason for absence. Nevertheless, I think that by doing this project I learned how to create an EDA process and how to take a step back and rephrase your questions as well as rethink your thought process. Just because a variable is continuous, this does not imply it requires regression analysis. Think about your statistical inquiries as questions, think about what makes sense from an outsider’s perspective, and then go crazy!

Happy National Thank You Day! - Inventionland