Skip to main content

Up coming

Pyscript

PyScript PyScript is a framework that allows users to create rich Python applications in the browser using HTML’s interface. PyScript aims to give users a first-class programming language that has consistent styling rules, is more expressive, and is easier to learn. What is PyScript?  Well, here are some of the core components: Python in the browser:  Enable drop-in content, external file hosting (made possible by the  Pyodide project , thank you!), and application hosting without the reliance on server-side configuration Python ecosystem:  Run many popular packages of Python and the scientific stack (such as numpy, pandas, scikit-learn, and more) Python with JavaScript:  Bi-directional communication between Python and Javascript objects and namespaces Environment management: Allow users to define what packages and files to include for the page code to run Visual application development:  Use readily available curated UI components, such as buttons, contain...

5 Statistical Traps Data Scientist Should avoid



Fallacies are what we call the results of faulty reasoning. Statistical fallacies, a form of misuse of statistics, is poor statistical reasoning; you may have started off with sound data, but your use or interpretation of it, regardless of your possible purity of intent, has gone awry. Therefore, whatever decisions you base on these wrong moves will necessarily be incorrect.

 

There are infinite ways to incorrectly reason from data, some of which are much more obvious than others. Given that people have been making these mistakes for so long, many statistical fallacies have been identified and can be explained. The good thing is that once they are identified and studied, they can be avoided. Let's have a look at a few of these more common fallacies and see how we can avoid them.

Out of interest, when misuse of statistics is not intentional, the process bears a resemblance to cognitive biases, which Wikipedia defines as "tendencies to think in certain ways that can lead to systematic deviations from a standard of rationality or good judgment." The former builds incorrect reasoning on top of data and its explicit and active analysis, while the latter reaches a similar outcome much more implicitly and passively. That's not hard and fast, however, as there is definitely overlap between these 2 phenomena. The end results is the same, however: plain ol' wrong.

Here are five statistical fallacies — traps — which data scientists should be aware of and definitely avoid. The failure to do so will be catastrophic in terms of both data outcomes and a data scientist's credibility.

 



1. Cherry Picking

In an attempt to demonstrate just how obvious and simplistic that statistical fallacies can be, let's start off with the classic which everyone should already know: cherry picking. We can put this in the category of other easily recognizable fallacies, such as the Gambler's Fallacy, False Causality, biased sampling, overgeneralization, and many others.

The idea of cherry picking is a simple one, and something you have definitely done before: the intentional selection of data points which help support your hypothesis, at the expense of other data points which either do not support your hypothesis or actively oppose it. Have you ever heard a politician talk? Then you've heard cherry picking. Also, if you are a living, breathing human being, you have cherry picked data at some point in your life. You know you have. It's often tempting, a piece of low-hanging fruit which can win over or confound an opponent in a debate, or help push your agenda at the expense of an opposing view.

Why is it bad? Because it's dishonest, that's why. If data is truth, and analysis of data using statistical tools is supposed to help unearth truth, then cherry picking is the antithesis of truth-seeking. Don't do it.

 

Data Science Real life Projects


2. McNamara Fallacy

The McNamara Fallacy is named after former US Secretary of Defense, Robert McNamara, who, during the Vietnam War, based his related decisions on quantitative metrics which were easily obtainable while ignoring others. This led to his treatment of body counts (easily obtainable metric) as the sole indicator of success, at the expense of all other quantitative measures.

Without dispensing much mental power, it should be relatively straightforward to see how a simple body count comparison could lead you astray when evaluating your performance on the battlefield. As one simple example, perhaps the enemy is pushing into your territory with disproportionate numbers of fighters and taking control as they do, but are losing slightly more bodies than you are as they do so. As another, perhaps the enemy is taking your fighters prisoner at a much higher rate than you are killing theirs. And so on.

Putting the statistical blinders on and placing all of your trust in a single, simple metric wasn't good enough to paint a full picture of what was happening in Vietnam, and it's not going to paint a full picture of whatever it is that you are doing.

 



3. Cobra Effect

The Cobra Effect is an unintended consequence from what was thought to be a solution to a problem, but which instead makes the problem worse. The name comes from a specific instance of the phenomenon which took place in India under British colonial rule, which included — you guessed it — cobras.

The Wikipedia page has a few examples of the Cobra Effect, my favorite being the attempt to reduce pollutants in Mexico City in the late 1980s. The government intended to lower emissions from vehicles by restricting by 20% the number of vehicles which could drive in a given week, based on the last digits of a license plate. To circumvent this policy, residents of the city purchased additional vehicles with different license plates, in hopes of having alternate permissible means of driving on the days their primary cars were banned. This led to a flood of often cheaper cars into the city, and ultimately made the pollution problem worse.

This is a much trickier issue than cherry picking, given the latent and often difficult to predict nature of unintended consequences. Team approaches to data science, and the additional thought processes these extra individuals bring, is a good way to combat Cobra Effect creep.

 



4. Simpson's Paradox

This paradox, named after British statistician Edward H. Simpson(though it had previously been identified by other individuals), refers to the observance of certain trends in a subgroup of a dataset which disappears once these subgroups are combined. In this sense, it can be thought of as unintentional cherry picking. An example from baseball can help to illustrate.

If we compared batting averages of a pair of professional ballplayers over the full years of their entire careers, you may find some subgroup years in which player A had a higher batting average than player B, perhaps even significantly higher. It is entirely possible, however, that looking at their batting averages over the entirety of their careers could show that player B actually had a higher batting average than player A, perhaps even significantly higher.

If you knew this ahead of time and selectively chose years X, Y, and Z as evidence that player A was a better player, that would be cherry picking. If you were not aware of the aggregate statistic, but chanced upon those individual isolated years and took them as representative of their entire careers — but (hopefully) found out otherwise once looking at the full statistical picture — that would be an example of Simpson's Paradox.

Both scenarios lead to incorrect outcomes, with one being a more innocent way of arriving at the misinterpretation. It's still wrong, though, and should be guarded against. Full statistical analysis should be part of a data scientist's regimen, and is a robust approach to ensuring you don't succumb to this phenomenon.

 



5. Data Dredging

Data dredging, known by other more ominous names such as p-hacking, is the "misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect." This amounts to performing a wide range of statistical tests on the data and cherry picking significant results in order to advance a narrative (meta cherry picking?). While statistical analysis should move from hypothesis to testing, data dredging involves allowing the results of statistical testing to dictate a conforming hypothesis. It amounts to the difference between "I think this is the case, now I will test if I am correct" and "Let's see what I can make the data say with testing, and then come up with an idea that it helps support."

But why is it wrong? Why are we concerned with forming hypotheses first and then testing them, instead of just letting the data dictate what might be a finding we have not thought to look for? With enough data and enough variables to test for correlations, it doesn't take long to have enough individual combinations for which something may appear to be significant. If we disregard all of the counterfactual evidence and focus on these conforming test results, it can appear that there is something there, when in reality there is no there there; it appears so due to chance. Capitalizing on, and justifying, chance is clearly not what science should be about.



For a related concept, and an approach to determining where the "chance determination line" can be drawn, have a look at the Bonferroni correction.

Comments

Popular posts from this blog

Data Science Courses — 2022 Guide & Reviews

  Top 8 Online Data Science Courses — 2022 Guide & Reviews Learn data science online this year by taking one of these top-ranked courses LearnDataSci is reader-supported. When you purchase through links on our site, earned commissions help support our team of writers, researchers, and designers at no extra cost to you. Over the course of several years and 100+ hours watching course videos, engaging with quizzes and assignments, reading reviews on various aggregators and forums, I’ve narrowed down the best data science courses available to the list below. This is a fairly long article with reviews of each course, so here’s the  TL;DR: 8 Best Data Science Courses & Certifications for 2022: Data Science Specialization  — JHU @ Coursera Introduction to Data Science  — Metis Applied Data Science with Python Specialization  — UMich @ Coursera Data Science MicroMasters  — UC San Diego @ edX Dataquest Statistics and Data Science MicroMasters ...

Top data Science Interview Questions And Answers

DATA SCIENCE SCHOOL Top Data Science Interview Questions And Answers Data Science is among the leading and most popular technologies in the world today. Major organizations are hiring professionals in this field. With the high demand and low availability of these professionals, Data Scientists are among the highest-paid IT professionals. This Data Science Interview preparation blog includes the most frequently asked questions in Data Science job interviews. Here is a list of these popular Data Science interview questions: Q1. What is Data Science? Q2. Differentiate between Data Analytics and Data Science Q3. What do you understand about linear regression? Q4. What do you understand by logistic regression? Q5. What is a confusion matrix? Q6. What do you understand by true-positive rate and false-positive rate? Q7. How is Data Science different from traditional application programming? Q8. Explain the difference between Supervised and Unsupervised Learning. Q9. What is the di...

I’m a Self-Taught Data Scientist. Here Are My 3 Suggestions for Newcomers

Goto My Channel For more Info: Make your learning journey more efficient. Photo by  Kelly Sikkema  on  Unsplash My data science journey started in 2019. Those who follow me on Medium would know that I like sharing my experience of learning data science. I write about the mistakes I made, the challenges I faced, the tools I frequently use, and so on. In this article, I would like to share 3 suggestions to those who plan to become a data scientist or just started learning data science. These are based on my own experience and what I observe in the data science ecosystem. Without further ado, let’s get started. The hidden fact about data analyst and data scientist. Stay tune! and Watch till end of the video Remember to subscribe to my Channel.  #datascience #dataanalyst  #datascientist  #datascienceschool #problems https://youtu.be/2pH9q20Q_vQ 1. Be agile More and more businesses invest in data science with an aim of converting data to value. The form of this ...