Devin Booker vs. Gregg Popovich (vs. Patrick McCaw)

Developing a process to predict which NBA teams and players will end up in the playoffs.

Jim Mora Nfl GIF - JimMora Mora Nfl - Discover & Share GIFs

Introduction

Hello! Thanks for visiting my blog.

After spending the summer being reminded of just how incredible Michael Jordan was as a competitor and leader, the NBA is finally making a comeback and the [NBA] “bubble” is in full effect in Orlando. Therefore, I figured today would be the perfect time to discuss the NBA playoffs (I know the gif is for the NFL, but it is just too good to not use). Specifically, I’d like to answer the following question: if one were to look at some of the more basic NBA statistics recorded by an NBA player or team on a per-year basis, with what accuracy could you predict whether or not that player may find themselves in the NBA playoffs. Since the NBA has changed so much over time and each iteration has brought new and exciting skills to the forefront of strategies, I decided to stay up to date and look at the 2018-19 season for individual statistics (the three-point era). However, only having 30 data points to survey for team success will be insufficient to have a meaningful analysis. Therefore I expanded my survey to include data back to and including the 1979-80 NBA season (when Magic Johnson was still a rookie). I web-scraped basketball-reference.com to gather my data and it contains every basic stat and a couple advanced ones such as effective field goal percent (I don’t know how that is calculated but Wikipedia does: https://en.wikipedia.org/wiki/Effective_field_goal_percentage#:~:text=In%20basketball%2C%20effective%20field%20goal,only%20count%20for%20two%20points.) from every NBA season ever. As you go back in time, you do begin to lose data from some statistical categories that weren’t recorded yet. To give to examples here I would mention blocks which were not recorded until after the retirement of Bill Russell (widely considered the greatest defender to play the game) or three pointers made as three pointers were first introduced into the NBA in the late 1970s. So just to recap: if we look at fundamental statistics recorded by individuals or in a larger team setting – can we predict who will find themselves in the playoffs? Before we get started, I need to address the name of this blog. Gregg Popovich has successfully coached the San Antonio Spurs to the playoffs in every year since NBA legend Tim Duncan was a rookie. They are known to be a team that runs on good teamwork as opposed to outstanding individual play. This is not to say they have not had superstar efforts though. Devin Booker has been setting the league on fire, but his organization, the Phoenix Suns, have not positioned themselves to be playoff contenders. (McCaw just got lucky and was a role player for three straight NBA championships). This divide is the type of motivation that led me to pursue this project.

Playoff P" Art Print by cobyshimabukuro | Redbubble

Plan

I would first like to mention that the main difficulty in this project was developing a good web-scraping function. However, I want to be transparent here and let you know that I worked hard developing that function a while back and now realized this would be a great use of that data. Anyhow, in my code I go through the basic data science process. In this blog, however, I think I will try to stick to the more exciting observations and conclusions I reached. (here’s a link to GitHub: https://github.com/ArielJosephCohen/postseason_prediction).

The Data

First things first, let’s discuss what my data looked like. In the 2018-19 NBA season, there were 530 total players to play in at least one NBA game. My features included: name, position, age, team, games played, games started, minutes played, field goals made, field goals attempted, field goal percent, three pointers made, three pointers attempted, three point percent, two pointers made, two pointers attempted, two point percent, effective field goal percent, free throws made, free throws attempted, free throw percent, offensive rebounds per game, defensive rebounds per game, total rebounds per game, assists per game, steals per game, blocks per game, points per game, turnovers per game, year (not really important but it only makes the most marginal difference in the team setting), college (or where they were before the NBA – many were null and were filled as unknown), draft selection (un-drafted players were assigned the statistical mean of 34 – which is later than I expected. Keep in mind the drat used to exceed 2 rounds). My survey of teams grouped all the numerical values (with the exception of year) by each team and every season using the statistical mean of all its players that season. In total, there were 1104 rows of data, some of which included teams like Seattle that no longer exist in their original form. My target feature was a binary 0 or 1 with 0 representing a failure to qualify for the playoffs and 1 representing a team that successfully earned a playoff spot.

One limitation of this model is that it accounts for trades or other methods of player movement by assigning a player’s entire season stats to the team he ended the season with, regardless of how many games were actually played on his final team. In addition, this model doesn’t account for more advanced metrics like screen assists or defensive rating. Another major limitation is something that I alluded to earlier: the NBA changes and so does strategy. This makes this more of a study or transcendent influences that remain constant over time as opposed to what worked well in the 2015-16 NBA season (on a team level that is). Also, my model focuses on recent data for the individual player model, not what individual statistics were of high value in different basketball eras. A great example of this is the so-called “death of the big man.” Basketball used to be focused on big and powerful centers and power forwards who would control and dominate the paint. Now, the game has moved much more outside, mid-range twos have been shunned as the worst shot in basketball and even centers must develop a shooting range to develop. Let me show you what I mean:

Shot chart dramatically shows the change in the NBA game

Now let’s look at “big guy”: 7-foot-tall Brook Lopez:

In a short period of time, he has drastically changed his primary shot selection. I have one more visual from my data to demonstrate this trend.

Exploratory Data Analysis

Before I get into what I learned from my model, I’d like to share some interesting observations from my data. I’ll start with two histograms groups:

Team data histograms
Individual data histograms
Every year – playoff teams tend to have older players
Trends of various shooting percentages over time

Here, I have another team vs. individual comparison testing various correlations. In terms of my expectations – I hypothesized that 3 point percent would correlate negatively to rebounds, while assists would correlate positively to 3 point percent, blocks would have positive correlation to steal, and assists would have positive correlation with turnovers.

Team data
Individual data

I seem to be most wrong about assists and 3 point percent while moderately correct in terms of team data.

The following graph displays the common differences over ~40 years between playoff teams and non-playoff teams in the five main statistical categories in basketball. However, since the mean and extremes of these categories are all expressed in differing magnitudes, I applied scaling to allow for a more accurate comparison

It appears that the biggest difference between playoff teams and regular teams is in assists while the smallest difference is in rebounding

I’ll look at some individual stats now, starting with some basic sorting.

Here’s a graph to see the breakdown of current NBA players by position.

In terms of players by draft selection (with 34 = un-drafted basically):

In terms of how many players actually make the playoffs:

Here’s a look at free throw shooting by position in the 2018-19 NBA season (with an added filter for playoff teams):

I had a hypothesis that players drafted earlier tend to have higher scoring averages (note that in my graph there is a large cluster of points hovering around x=34. This is because I used 34 as a mean value to fill null for un-drafted players).

It seems like I was right – being picked early in the draft corresponds to higher scoring. I would guess the high point is James Harden at X=3.

Finally I’d like to share the distribution of statistical averages by each of the five major statistics sorted by position:

Model Data

I ran a random forest classifier to get a basic model and then applied scaling and feature selection to improve my models. Let’s see what happend.

2018-19 NBA players:

The above represents an encouraging confusion matrix with the rows representing predicted data vs columns which represent actual data. The brighter and more muted regions in the top left and bottom right correspond to the colors in the vertical bar adjacent to the matrix and indicate that brighter colors represent higher values (in this color palette). This means that my model had a lot more values whose prediction corresponded to its actual placement that incorrect predictions. The accuracy of this basic model sits around 61% which is a good start. The precision score represents the percentage of correct predictions in the bottom left and bottom right quadrants (in this particular depiction). Recall represents the percentage of correct positions in the bottom right and top right quadrants. In other words recall represents all the cases correctly predicted given that we were only predicting from a pool of teams that made the playoffs. Precision represents a similar pool, but with a slight difference. Precision looks at all the teams that were predicted to have made the playoffs and the precision score represents how many of those predictions were correct.

Next, I applied feature scaling which is a process of removing impacts from variables determined entirely by magnitude alone. For example: $40 is a lot to pay for a can of soda but quite little to pay for a functional yacht. In order to compare soda to yachts, it’s better to apply some sort of scaling that might, for example, place their costs in a 0-1 (or 0%-100%) range representing where there costs fall relative to the average soda or yacht. A $40 soda would be close to 1 and a $40 functional yacht would be closer to 0. Similarly, a $18 billion yacht and an $18 billion dollar soda would both be classified around 1 and conversely a 10 cent soda or yacht would both be classified around 0. A $1 soda would be around 0.5. I have no idea how much the average yacht costs.

Next, I wanted to see how many features I needed for optimal accuracy. I used recursive feature elimination which is a process of designing a model and the using that model to look for which features may be removed to improve the model.

20 features seems right

After feature selection, here were my results:

64% accuracy is not bad. Considering that a little over 50% of all NBA players make the playoffs every year, I was able to create a model that, without any team context at all, can predict to some degree which players will earn a trip to the playoffs. Let’s look at what features have the most impact. (don’t give too much attention to vertical axis values). I encourage you to keep these features in mind for later to see if they differ from the driving influences for the larger scale, team-oriented model.

If I take out that one feature at the beginning that is high in magnitude for feature importance, we get the following visual:

I will now run through the same process with team data. I will skip the scaling step as it didn’t accomplish a whole lot.

First model:

Feature selection.

Results:

78% accuracy while technically being a precision model

Feature importance:

Feature importance excluding age:

I think one interesting observation here is how much age matters in a team context over time, but less in the 2018-19 NBA season. Conversely, effective field goal percent had the opposite relationship.

Conclusion

I’m excited for sports to be back. I’m very curious to see the NBA playoffs unfold and would love to see if certain teams emerge as dark horse contenders. Long breaks from basketball can either be terrible or terrific. I’m sure a lot of players improved while others regressed. I’m also excited to see that predicting playoff chances both on individual and team levels can be done with an acceptable degree of accuracy. I’d like to bring this project full circle. Let’s look at the NBA champion San Antonio Spurs from 2013-14 (considered a strong team-effort oriented team) and Devin Booker from 2018–19. I don’t want to spend too much time here but let’s do a brief demo of the model using feature importance as a guide. The average age of the Spurs was two years above the NBA average with the youngest members of the team being 22, 22, and 25. The Spurs led the NBA in assists that year and were second to last in fouls per game. They were also top ten in fewest turnovers per game and free throw shooting. Now, for Devin Booker. Well, you’re probably expecting me to explain why he is a bad team player and doesn’t deserve a spot in the playoffs. That’s not the case. By every leading indicator in feature importance, Devin seems like he belongs in the playoffs. However, let’s remember two things. My individual player model was less accurate and Booker sure seems like an anomaly. However, I think there is a bigger story here, though. Basketball is a team sport. Having LeBron James as your lone star can only take you so far (it’s totally different for Michael Jordan because Jordan is ten times better, and I have no bias at all as a Chicago native). That’s why people love teams like the San Antonio Spurs. They appear to be declining now as their leaders have recently left the team in players such as Tim Duncan and Kawhi Leonard. Nevertheless, basketball is still largely a team sport. The team prediction model was fairly accurate. Further, talent alone seems like it is not enough either. Every year, teams that make the playoffs tend to have older players. Talent and athleticism is generally skewed toward younger players. Given these insights, I’m excited to have some fresh eyes with which to follow the NBA playoffs.

Thanks for reading.

Have a great day!

Thank you to all of the parents for... - Hamilton Little Lads ...

Sell Phones

Understanding the main influences and driving factors behind how a phone’s price is determined.

Apple's Surprise iPhone 12 Pro Upgrade Suddenly Confirmed

Introduction

If you made it this far it means you were not completely turned off by my terrible attempt to use a pun in the title and I thank you for that. This blog will probably be pretty quick I think. Recently, I have been seeing some buzz about the phone pictured above; the alleged iPhone 12. While every new iPhone release and update is exciting (yes – I like Apple and have never had an android), I have long been of the opinion that there hasn’t been too much major innovation in cell phones in recent years. While there have been new features added or camera upgrades, there are few moments that stick out as much as when Siri was first introduced. (I remember when I tried to use voice recognition on my high school phone before everyone had an iPhone). Let’s look at the following graph for a second:

The graph above indicates that as popularity in smartphones has gone up, prices have declined. I think that there are many influences here, but surely the lack of major innovation is one reason why no one is capitalizing on this obvious trend and creating phones that are allowed to be more expensive because of what they bring to the table. This led me to my next question; what actually drives phone pricing? iOS 9 still sticks out to me all these years later because I loved the music app. However, it could be that others care nothing about the music app and are more interested in other features like camera quality and I was really interested in seeing what actually mattered. Conveniently, I was able to find some data from kaggle.com and could investigate further.

Data Preparation

I mentioned my source: kaggle.com. My data contained the following features as described by kaggle.com (I’ll add descriptions in parentheses when needed but also present the names as they appear in the data): battery_power, blue (presence of bluetooth), clock_speed (microprocessor speed), dual_sim (dual sim support or not), fc (front camera megapixels), four_g (presence of 4G), int_memory (storage), m_dep (mobile depth in cm), n_cores (number of cores of processor), pc (primary camera megapixels), px_height (pixel resolution height), px_width, ram, sc_h (screen height), sc_W (screen_width), talk_time (longest possible call before battery runs out), three_g (presence of 3G), touch_screen, wi_fi. The target feature is price range which consists of 4 classes each representing an increasing general price range. In terms of filtering out correlation – there was surprisingly no high correlation to filter out.

There were some other basic elements of data preparation that are not very exciting so will not be described here. I would like to point out one weird observation, though:

In those two smaller red boxes, it appears that some phones have a screen width of zero. That doesn’t really make any sense.

Model Data

I ran a coupe different models and picked the logistic regression model because it worked best. You may notice that when I post my confusion matrix and metrics that precision = recall = f1 = accuracy. That is always the case with evenly balanced data. Even after filtering outliers I had a strong balance in my target class.

In terms of determining what drives pricing by using coefficients determined in logistic regression:

I think these results make sense. At the top you see things like RAM and battery power. At the bottom you see things like touch screen and 3G. Almost every phone includes these features nowadays an there is nothing special about a phone having those features.

Conclusion

We now have an accurate model that can classify the price range of any phone given it’s features and we also know what features have the biggest impact on price. If companies are stuck on innovation, they could just keep upping the game with things like higher RAM, I suppose. I think the next step in this project is to collect more data on other features not yet included but to also take a closer look at what the specific differences in each iteration of the iPhone were as well as how the operating systems they run on have evolved. I actually saw a video today about iOS 14 and it doesn’t look to be that innovative. Although, I am curious to see what will happen as Apple begins to develop their own chips in place of Intel ones. At the very least, we can use the model established here as a baseline to be able to understand what will allow companies to charge more for their phones in the absence of more impactful innovation.

Thanks for reading and have a great day!

13 Best iPhone Message Tricks for Apple Users

Out Of Office

Predicting Absenteeism At Work

Out of Office Email – Auto Reply Email Templates | iHire

Introduction

It’s important to keep track of who does and does not show up to work when they are supposed to. I found some interesting data online that gives information on how much work from a range of 0 to 40 hours any employee is expected to miss in a certain week. I ran a couple models and came away with some insights on what my best accuracy would look like and what it would tell me are the most predictive of time expected to miss by an employee.

Process

  1. Collect Data
  2. Clean Data
  3. Model Data

Collect Data

My data comes from the UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work). While I will not go through this part in full detail, the link above talks about the numerical representation for “reason for absence.” The features of the data set, other than the target feature of time missed, were: ID, reason for absence, month, age, day of week, season, distance from work, transportation cost, service time, work load, percent of target hit, disciplinary failure, education, social drinker, social smoker, pet, weight, height, BMI, and son. I’m not entirely sure what “son” means. So now I was ready for some data manipulation. However, before I did that, I performed some exploratory data analysis with some custom columns being binning the variables with many unique values such as transportation expense.

EDA

First, I have a histogram of each variable.

After filtering outliers, the next three histogram charts describe the distribution of variables in cases of missing low, medium, and high amounts of work, respectively.

Low
Medium
High

Below, I have displayed a sample of the visuals produced in my exploratory data analysis which I feel tell interesting stories. When an explanation is needed it will be provided.

(O and 1 are binary for “Social Drinker”)
(The legend refers distance from work)
(O and 1 are binary for “Social Drinker”)
(The legend refers transportation expense to get to work)
(The legend reflect workload numbers)
(O and 1 are binary for “Social Drinker”)
Histogram
(Values adjusted using Min-Max scaler)

This concludes the EDA section.

Hypothesis Testing

I’ll do a quick run-through here of some of the hypothesis testing I performed and what I learned. I examined the seasons of the year to see if there was a discrepancy in the absences observed in the Summer and Spring vs Winter and Fall. What I found was that there wasn’t much evidence to say a difference exists. I found with high statistical power that people with higher travel expenses tend to miss more work. This was also the case with people who have longer distances to work. Transportation costs as well as distance to work also have a moderate effect on service time at a company. Age has a moderate effect on whether people tend to smoke or drink socially but not enough to have statistical significance. In addition, there appears to be little correlation with time at the company and whether or not targets were hit. However, this test has low statistical power and has a p-value that is somewhat close to 5% implying that an adjusted alpha may change how we view this test both in terms of type 1 error and statistical power. People with less education tend to drink more as well. Education has a moderate correlation with service time. Anyway, that is very quick recap of the main hypotheses I tested boiled down to the most easy way to communicate their findings.

Clean Data

I started out by binning variables with wildly uneven distributions. Next, I used categorical data encoding to encode all my newly binned features. Next, I applied scaling so that all the data would be within 3 standard deviations of each variable’s mean. Having filtered out misleading values, I binned my target variable into three groups. Next, I removed correlation. I will go back and discuss some of these topics later in this blog when I discuss some of the difficulties I faced.

Model Data

My next step was to split and model my data. One problem came up. I had a huge imbalance among my classes. The “lowest amount of work missed” class had way more than the other two classes. Therefore, I synthetically created new data to have every class have the same amount of cases. To find my most ideal model and then improve it… well I first needed to find the best model. I applied 6 types of scaling across 9 models = 54 results and found that my best model would be a Random Forest model. I even found that adding polynomial features would give me near 100% accuracy on training data without much loss on test data. Anyway, I went back to my random forest model. I found the most indicative features of time missed in order from biggest indicator to smallest indicator were: month, reason for absence, work load, day of the week, season, and social drinker. There are obviously other features, but these are the most predictive ones. The others provide less information.

Problems Faced in Project

The first problem I had was not looking at the distribution of the target variable. It is a continuous variable, however, there are very few values in certain ranges. I therefore split it into three bins; missing little work, a medium amount of work, and a lot of work. I also experimented with having two bins as well as different cutoff points to pick the bins, but three bins worked better. This also affected my upsampling as the different binning methods resulted in different class breakdowns. The next problem I had was a similar one. How would I bin variables? In short, I tried a couple of ways and found that three bins worked well. All this binning was not done using quantiles, by the way. That would imply no target class imbalance which was not the case. I tried using quantiles, but did not find it effective. I also experimented with different categorical feature encoding but found that the most effective method was to bin based on mean value in connection with target variable (check my home page for a blog about that concept). I ran a gridsearch to optimize my random forest at the very end and then printed a confusion matrix. This was not good, but I nee to be intellectually honest. Predicting when someone would fall into class 0 (“missing low amount of work”) my model was amazing and its recall exceeded precision. However, it did not work well on the other two. Now keep in mind that you do not upsample test data and this could be a total fluke. However, that was still frustrating to see. An obvious next step is to collect more data and continue to improve the model. One last idea I want to talk about is exploratory data analysis. Now, to be fair, this could be inserted into any blog. Exploratory data analysis is both fun and interesting as it allows you to be creative and take a dive into your idea using visuals as quick story-tellers. The project I had just scrapped before acquiring the data for this project drove me kind of crazy because I didn’t really have a plan for my exploratory data analysis. It was arbitrary and unending. That is never a good plan. EDA should be planned and thought out. I will talk more / have talked more about this (depending on when you read this blog) in another blog but the main point is you want to think of yourself as person who doesn’t do programming who just wants to ask questions based on the names of the features. Having structure in place for EDA is less free-flowing and exciting than not having structure, but it ensures that you work efficiently and have a good start point as well as stop point. That really helped me save a lot of stress.

Conclusion

It’s time to wrap things up. At this point, I think I would need more data to continue to improve this project, and I’m not sure where that data would come from. In addition, there are a lot of ambiguities in this data set such as the numerical choices for reason for absence. Nevertheless, I think that by doing this project I learned how to create an EDA process and how to take a step back and rephrase your questions as well as rethink your thought process. Just because a variable is continuous, this does not imply it requires regression analysis. Think about your statistical inquiries as questions, think about what makes sense from an outsider’s perspective, and then go crazy!

Happy National Thank You Day! - Inventionland

Feature Selection in Data Science

This image has an empty alt attribute; its file name is image-63.png

Introduction

Often times, when addressing data sets with many features, reducing features and simplifying your data can be helpful. Usually, one particular juncture where you remove a lot of data or features is by reducing correlation using a filter of 70%, or so. (Having highly correlated variables usually leads to overfitting). However, you can continue to reduce features and improve your models by deleting features that not only correlate to each other, but also… don’t really matter. A quick example: Imagine I was trying to predict whether or not someone might get a disease and the information I had was height, age, weight, wingspan, and favorite song. I might have to remove height or wingspan since they probably have a high degree of correlation. Favorite song, on the other hand, likely has no impact on anything one would care about but would not be removed using correlation. That’s why we would just get rid of one feature. Similarly, if there are other features that are irrelevant or can be mathematically proven to have little impact, we could delete them. There are various methods and avenues one could take to accomplish this task. This blog will outline a couple them, particularly: Principal Component Analysis, Recursive Feature Elimination, and Regularization. The ideas, concepts, benefits, and drawbacks will be discussed and some coding snippets will be provided.

Principal Component Analysis (PCA)

So, just off the bat, PCA is complicated and involves a lot of backend linear algebra and I don’t even understand it fully myself. This is not a blog about linear algebra, it’s a blog about making life easier, so I plan to keep this discussion at a pretty high level. First, I’ll start with a prerequisite; scale your data. Scaling data is a process of reducing impact based on magnitude alone and aligning all your data to be relatively in the same range of values. If you had a data point representing GDP and another data point representing year the country was founded, you can’t compare those variables easily as one is a lot bigger in magnitude than the other. There are various ways to scale your variables and I have a separate blog about that if you would like to learn more. For our purposes, though, we always need to apply standard scaling. Standard scaling takes each unique value of a variable, subtracts its mean, and finally divides by the standard deviation. The effect is that every value becomes compressed to the interval [-1,1]. Next, as discussed above, we filter out correlated variables. Okay, so now things get real. We’re ready to for the hard part. The first important idea to understand beforehand, however, is what a principal component is. Principal components are new features which are some linear representation of operations performed with other features. So If I have the two features weight and height – maybe I could combine the two by dividing weight by height to get some other feature. Unfortunately, however, as we will discuss more later, none of these new components we will replace our features with actually have a name, they are just assigned a numeric representation such as 0 or 1 (or 2 or….). While we don’t maintain feature names, the ultimate goal is to make life easier. So once we have transformed the structure of our features we want to find out how many features we actually need and how many are superfluous. Okay, so we know what a principal component is and what purpose they serve, but how are they constructed in the first place? We know they are derived from our initial features, but we don’t know where they come from. I’ll start by saying this: the amount of principal components created always matches the number of features, but we can easily see with visualization tools which ones we plan to delete. So the short answer to our question of where these things come from is that for each dimension (feature) in our data, we have two corresponding linear algebra metrics/results called eigenvectors and eigenvalues which you may remember from linear algebra. If you don’t, given a square matrix called “A” that has a non-zero determinant; multiplying that matrix by an eigenvector, called v, yields the same result as scaling that vector, v, by a scalar known as the eigenvalue, lambda. The story these metrics tell is apparent when you perform linear transformations. When you transform your axes in transformations, the eigenvectors will still maintain the same direction but will increase in scale by lambda. That may sound confusing, and it’s not critical to understand it completely but I wanted to leave a short explanation for those more familiar with linear algebra. What matters is that calculating these metrics/results in the context of data science gives us information about our features. The eigenvalues with highest magnitude yield the eigenvectors with the most impact on explaining variance in models. Source two below indicates that “Eigenvectors are the set of basis functions that are the most efficient set to describe data variability. The eigenvalues is a measure of the data variance explained by each of the new coordinate axis.” What’s important to keep in mind is that we use the eigenvalues to remind us of what new, unnamed, transformations matter most.

Code (from a fraud detection model)

This image has an empty alt attribute; its file name is image-47.png
This image has an empty alt attribute; its file name is image-48.png
This image has an empty alt attribute; its file name is image-49.png

Recursive Feature Elimination (RFE)

RFE is different than PCA in the sense that it models data and then goes back in time so you can run a new model. What I mean by this is that RFE assumes you have model in place and then uses that model to find feature importances or coefficients. If you were running a linear regression model, for example, you would instantiate a linear regression model, run the model, and find the variables with the highest coefficients and drop all the other ones. This can work with a random forest classifier, for example, which has an attribute called feature importances. Usually, I like to find what model works best and then run RFE using that model. RFE will then run through different combinations of keeping different amounts of features and then solve for the features that matter most.

Code

This image has an empty alt attribute; its file name is image-50.png
This image has an empty alt attribute; its file name is image-51.png
This image has an empty alt attribute; its file name is image-52.png
This image has an empty alt attribute; its file name is image-64.png

Regularization (Lasso and Ridge)

Regularization is a process designed to reduce overfitting in regression models by penalizing models for having excessive and misleading predictive features. According to Renu Khaldewal (see below): “When a model tries to fit the data pattern as well as noise then the model has a high variance a[n]d will be overfitting… An overfitted model performs well on training data but fails to generalize.” The point is that it works when you train the model but does not deal well with new data. Let’s think back to the “favorite song” feature I proposed earlier. If we were to survey people likely to get a disease and find they all have the same favorite song, while this would certainly be interesting, it would be pretty pointless. The real problem would be when we encounter someone who likes a different song but is checks off every other box. The model might say this person is unlikely to get the disease. Once we get rid of this feature, well now we’re talking and we can focus on the real predictors. So we know what regularization is (a method of placing focus on more predictive features and penalizing models that have excessive features), we know why we need it (overfitting), we don’t yet know how it works. Let’s get started. In gradient descent, one key term you better know is “cost function.” It’s a mildly complex topic, but basically it tells you how much error is in your model by subtracting the predicted values from the true values and summing up the total error. You then use calculus to optimize this cost function to find the inputs that produce the minimal error. Now keep in mind that the cost function captures every variable and the error present in each. In regularization, an extra term is added to that cost function which reduces the impact of larger variables. So the outcome is that you optimize your cost function and find the coefficients of a regression, however you now have reduced overfitting by scaling your terms using a value (often called) lambda and thus have produced more descriptive coefficients. So what is this Ridge and Lasso business? Well, there are two common ways of performing regularization (there is a third, less common, way which basically covers both). In ridge regularization you add a parameter designed to scale the magnitude of each coefficient. We call this L2. Lasso, or L1, is very similar. The difference in effect is that lasso regularization may actually remove features completely. Not just decrease their impact, but actually remove them. So ridge may decrease the impact of “favorite song” while lasso would likely remove it completely. In this sense, I believe lasso more closely resembles PCA and RFE than ridge. In Khandelwal’s summary, she mentions that L1 deals well with outliers but struggles with more complex cases, while ridge has the opposite effect on both accounts. I won’t get in to that third case I alluded above. It’s called Elastic Net and you can use if you’re unsure of whether you want to use ridge or lasso. That’s all I’m going to say… but I will provide code for it.

Code

(Quick note: alpha is a parameter which determines how much of a penalty is placed in regression).

I’ll also quickly add a screenshot to capture context. The variables will not be displayed, but one should instead pay attention to the extreme y (vertical axis) values and see how each type of regularization affects the resulting coefficients.

Initial visual:

This image has an empty alt attribute; its file name is image-53.png

Ridge

This image has an empty alt attribute; its file name is image-54.png

(Quick note this accuracy is up from 86%)

This image has an empty alt attribute; its file name is image-55.png
This image has an empty alt attribute; its file name is image-56.png

Lasso

This image has an empty alt attribute; its file name is image-57.png
This image has an empty alt attribute; its file name is image-58.png
This image has an empty alt attribute; its file name is image-59.png

Elastic Net

This image has an empty alt attribute; its file name is image-60.png
This image has an empty alt attribute; its file name is image-61.png
This image has an empty alt attribute; its file name is image-62.png

Conclusion

Data science is by nature a bit messy and inquiries can get out of hand very quickly. By reducing features, you not only make the task at hand easier to deal with and less intimidating, but you tell a more meaningful story. To get back to an earlier example, I really don’t care if everyone whose favorite song is “Sweet Caroline” are likely to be at risk for a certain disease or not. Having that information is not only useless, but it also will make your models worse. Here, I have provided a high-level road map to improving models and distinguishing between important information and superfluous information. My advice to any reader is to get in the habit of reducing features and honing on what matters right away. As an added bonus, you’ll probably get to make some fun visuals, if you enjoy that sort of thing. I personally spent some time designing a robust function that can handle RFE pretty well in many situations. While I don’t have it posted here, it is likely all over my GitHub. It’s really exciting to get output and learn what does and doesn’t matter in different inquiries. Sometimes the variable you think matters most… doesn’t matter much at all and sometimes variables you don’t think matter, will matter a lot (not that correlation always equals causation). Take the extra step and make your life easier.

That wraps it up.

——————————————————————————————————————–

Sources and further reading:

(https://builtin.com/data-science/step-step-explanation-principal-component-analysis)

(https://math.stackexchange.com/questions/23312/what-is-the-importance-of-eigenvalues-eigenvectors)

(https://medium.com/@harishreddyp98/regularization-in-python-699cfbad8622)

(https://www.youtube.com/watch?v=PFDu9oVAE-g)

(https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2 )

(https://towardsdatascience.com/the-mathematics-behind-principal-component-analysis-fff2d7f4b643)

The 5 Best Thank You Memes to Use

Feature Scaling In Machine Learning

Accounting for the Effect of Magnitude in Comparing Features and Building Predictive Models

What is Feature Scaling? - Let Me Fail

Introduction

The inspiration for this blog post comes from some hypothesis testing I performed on a recent project. I needed to put all my data on the same scale in order to compare it. If I wanted to compare the population of a country to its GDP, for example, well… it doesn’t sound like a good comparison in the sense that those are apples and oranges. Let me explain. Say we have the U.S. as our country. The population in 2018 was 328M and the GDP was $20T. These are not easy numbers to compare. By scaling these features you can put them on the same level and test relationships. I’ll get more into how we balance them later. However, the benefits of scaling data extend beyond hypothesis testing. When you run a model, you don’t want features to have disproportionate impacts based on magnitude alone. The fact is that features come in all different shapes and sizes. If you want to have an accurate model and understand what is going on, scaling is key. Now you don’t necessarily have to do scaling early on. It might be best after some EDA and cleaning. Also, while it is important for hypothesis testing, you may not want to permanently change the structure of your data just yet.

I hope to use this blog to discuss the scaling systems available from the Scikit-Learn library in python.

Plan

I am going to list all the options listed in the Sklearn documentation (see https://scikit-learn.org/stable/modules/preprocessing.html for more details). Afterward, I will provide some visuals and tables to understand the effects of different types of scaling.

  1. StandardScaler
  2. MaxAbsScaler
  3. MinMaxScaler
  4. RobustScaler
  5. PowerTransformer
  6. QuantileTransformer

But First: Generalized Code

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

df[‘scaled_data’] = ss.fit_transform(df[[‘data’]])

This code can obviously be generalized to fit other scalers.

Anyway… lets’ get started

Standard Scaler

The standard scaler is similar to standardization in statistics. Every value has its overall mean subtracted from it and the final quantity is divided over the feature’s standard deviation. The general effect causes the data to have a mean of zero and a standard deviation of one.

Min Max Scaler

ML | Feature Scaling – Part 2 - GeeksforGeeks

The min max scaler effectively compresses your data to [0,1]. However, one should be careful not to divide by negative values or fractions as that will not yield the most useful results. In addition, it does not deal well with outliers.

Max Abs Scaler

Here, you divide every value by the maximum absolute value of that feature. Effectively all your data gets put into the [-1,1] range.

Robust Scaler

The robust scaler is designed to deal with outliers. It generally applies scaling using the inner-quartile range (IQR). This means that you can specify extremes using quantiles for scaling. What does that mean? If your data follows a standard normal distribution (mean 0, error 1), the 25% quantile is -0.5987 and the 75% quantile is 0.5987 (symmetry is not usually the case – this distribution is special). So once you hit -0.5987, you have covered 1/4 of the data. By 0, you hit 50%, and by 0.5987, you hit 75% of the data. Q1 represents the lower quantile of the two. It’s very similar to min-max-scaling but allows you to control how outliers affect the majority of your data.

Power Transform

According to Sklearn’s website (https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html):

PowerTransformer applies a power transformation to each feature to make the data more Gaussian-like. Currently, PowerTransformer implements the Yeo-Johnson and Box-Cox transforms. The power transform finds the optimal scaling factor to stabilize variance and mimimize skewness through maximum likelihood estimation. By default, PowerTransformer also applies zero-mean, unit variance normalization to the transformed output. Note that Box-Cox can only be applied to strictly positive data. Income and number of households happen to be strictly positive, but if negative values are present the Yeo-Johnson transformed is to be preferred.”

Quantile Transform

The Sklearn website describes this as a method to coerce one or multiple features into a normal distribution (independently, of course) – according to my interpretation. One interesting effect is that this is not a linear transformation and may change how certain variables interact with one another. In other words – if you were to plot values and just adjust the scale of the axes to match the new scale of the data, it would likely not look the same.

Visuals and Show-and-Tell

I’ll start with my first set of random data. Column “a” is the initial data (with description in the cell above) and the others are transforms (where the first two letters like maa indicate MaxAbsScaler).

This next output shows 9 models’ accuracy scores across four types of scaling. I recommend every project contain some type of analysis that resembles this to find your optimal model and optimal scaling type (note: Ran = random forest, Dec = decision tree, Gau = Gaussian Naive Bayes, Log = logistic regression, Lin = linear svm, SVC = support vector machine, SGD = stochastic gradient descent, XGB = xgboost, KNe = K nearest neighbors. You can read more about these elsewhere… I may write a blog about this topic later).

More visuals…

I also generated a set of random data that does not relate to any real world scenario (that I know of) to visualize how these transforms work. Here goes:

So I’ll start with the original data, show everything all together, and then break it into pieces. Everything will be labeled. (Keep in mind that the shape of the basic data may appear to change due to relative scale. Also, I have histograms below which show the frequency of a value in a data set).

Review

What I have shown above is how one individual feature may be transformed in different ways and how that data would adjust to a new interval (using histograms . What I have not shown is a visual of moving many features to one uniform interval can happen. While this is hard to visualize, I would like to provide the following data frame to get an idea of how scaling features of different magnitudes can change your data.

Conclusion

Scaling is important and essential to almost any data science project. Variables should not have their importance determined based on magnitude alone. Different types of scaling move your data around in different ways and can have moderate to meaningful effects depending on which model you apply them to. Sometimes, you will need to use one method of scaling in specific (see my blog on feature selection and principal component analysis). If that is not the case, I would encourage trying every type of scaling and surveying the results. I recently worked on a project myself where I effectively leveraged featured scaling into creating a metric to determine how valuable individual hockey and basketball players are to their team compared to the rest of the league on a per-season basis. Clearly, the virtues of feature scaling extend beyond just modeling purposes. In terms of models, though, I would expect that feature scaling would change outputs and results in metrics such as coefficients. If this happens, focus on relative relationships. If one coefficient is at… 0.06 and another is at… 0.23, what that tells you is that one feature is nearly 4 times as impactful in output. My point is that don’t let the change in magnitude fool you. You will find a story in your data.

I appreciate you reading my blog and hope you learned something today.

Computer forensics powerpoint presentation