Data Science Career Paths & Skillsets in 2021
A glimpse of Kaggle’s State of Data Science and Machine Learning Survey
Introduction
Each year, Kaggle conducts an annual worldwide survey among active users to provide a comprehensive view of the data science (DS) and machine learning (ML) community1. After preliminary cleaning, a total of 25,973 responses were recorded for analysis. Following the survey, Kaggle hosts an annual Data Science Survey Challenge that awards top notebook authors who tell a rich story about a subset of the DS and ML community.
For this project, the goal is to tell a data science story about data science professionals working in the United States. The raised questions pertain to comparing and contrasting job titles that fall within the general umbrella of data science, such as data scientist, data analyst, machine learning engineer, statistician. The key features with which to tell the data science story include but are not limited to salary, preferences on programming languages and environments, hardware/cloud utilization, diversity, and inclusion. Not only are these interesting questions to ask, but they also provide insights for fellow students in Data Science and Analytics program2 at Georgetown University to narrow down job search.
Data and Cleaning
This research draws data from the 2021 Kaggle Machine Learning & Data Science Survey conducted anonymously among the Kaggle community from 09/01/2021 to 10/04/2021. The dataset consists of 25,973 valid responses to the survey questionnaire encompassing 42 questions about personal background and preference on programming language, IDE, cloud service, computing hardware, database, machine learning package, workflow. Due to the nature of this data science story, a subset of data will be sliced as the data of interest.
To give a more precise view of the data science and machine learning community, the scale of research is narrowed down to only Kaggle users from the United States of America. As a result, 2650 responses(10.2%) are selected from the origin dataset.
Also, some non-relevant or too-specific questions are removed from the dataset, and only 21 questions(50%) are used in the analysis below.
The next step of data cleaning is removing NA values. In the questionnaire, some questions can only choose single answer and some questions allow to choose multiple answers. For single-answered questions, NA value means the question is not answered by respondent, i.e. all students didn’t answer the question about salary. For multiple-answers question, NA value means the option is not selected by respondent, i.e. in the question about programming languages, users didn’t select all listed answers.
Besides that, some special answers are treated as not useful and removed. The first one is OTHER which means the answer is not specified, and the second one is NONE which means none of the answers above is applicable for the respondent.
The origin responses of salary are categorical with 27 classes having format like $150,000-199,999
, so two different approach to transform the salary are done.
- Aggregation: the categorical salaries are grouped into 6 levels
{poverty, low, medium, high, very high, highest}
.
- String Split: the categorical salaries are separated into two columns
{lower bound, upper bound}
.
Statistical Questions and Methods
Data science is a broad and interdisciplinary field. As a result, there is a lot of variation in the day-to-day responsibilities, workflows, skill sets, and salaries for jobs related to data science. The rough idea is to do some key comparisons among the following job titles: data analyst, data engineer, data scientist, machine learning engineer, software engineer, statistician. Preliminarily, these are the questions of interests regarding career paths, with proposed analytical methods in parentheses:
- What percentage of the survey respondents are working under these job titles? (EDA)
- What are the statistics on salaries for these job titles? (EDA, ANOVA, hypothesis testing)
- What levels of education are required for these job titles? (EDA)
- Is there a significant income gap between genders for these jobs? (EDA, hypothesis testing)
- What is the typical skill set for these jobs? How does it affect the pay rate? (regression, ANOVA)
- Is there a certain correlation between industry and the need for these jobs? (EDA, bootstrap, hypothesis testing)
Additionally, it is also worth investigating the requisite skillsets for each of the data science careers as guidance to our program cohort on which skills and technologies to focus the career preparation on. The inquiry includes but is not limited to the following questions:
- What programming languages and IDEs do they use? (hypothesis test on the two-way table)
- Where do they get and share the knowledge? (EDA)
- What data science packages do they use? (EDA)
- Are they using cloud computing tools? What are their preferences? Will a user’s preference for cloud computing platforms affect his or her preference for big data products/cloud storage products? (Hypothesis testing)
- What is the overall AWS usage percentage among DS practitioners? Is it the same for Google Cloud? (Law of Large Numbers, two proportions Z-test.)
The Analysis and Results
Since the one large data science problem can be divided into several smaller bits, it is practical to combine the analysis and the interpretation sections together. And the questions will be addressed and discussed one by one.
The following questions are about data science career paths.
Job Titles Distributions
What percentage of the survey respondents are working under these job titles?
The percentages are shown in the legends of the following waffle plot.
Salaries
What are the statistics on salaries for these job titles?
The US minimum wage is 7.25 per hour, multiply that for a total \(260\) typical number of workdays a year:
\[7.25 \frac{\$}{\text{hour}} \cdot 8 \frac{\text{hours}}{\text{day}} \cdot 260\frac{\text{days}}{\text{year}} = 15080.\]
The new categories* for salaries will be: - poverty: below the federal minimum wage - low: 15,000 to 49,999 - medium: 50,000 to 99,999 - high: 100,000 to 199,999 - very high: 200,000 to 499,999 - highest: >= 500,000
*loosely based on US federal income tax brackets.
A chi-squared test of independence can be used to test for the statistical significance in the difference in categorical frequencies in each salary category for each job title. On the one hand, the test is done on data-oriented data science roles, see below:
##
## Pearson's Chi-squared test with simulated p-value (based on 10000
## replicates)
##
## data: q2_table.1
## X-squared = 180.52, df = NA, p-value = 9.999e-05
Qualitatively, it can be seen that each job title has a seemingly different categorical distribution over the salary brackets. The chi-squared test using simulated p-values reveals that this difference is statistically significant as well with a p-value of \(0.0004998\). At an alpha-level of 0.05, the null hypothesis that the frequency of counts over salary brackets is not independent of the job title.
Because the original salary data is categorical instead of a numerical estimation, a two-sample t-test to test the difference of mean salary is not possible. But the visualization above reveals that for data scientists and data engineers, there are more people in medium and high salary brackets than low salary brackets; the opposite is true for business analyst and data analyst. The visualization alone is able to reveal that “analyst” is considered a more junior role compared to an “engineer” or a “data scientist”. There are just too few DBA/Database engineers and statisticians represented in the dataset for a conclusion to be drawn for these categories.
It is also worth noting that most of the highest earning (\(>\$500,000\)) people in the dataset are data scientists.
Below, the analysis focuses on the job titles that have a focus on software and product management, such as machine learning engineer (MLE), software engineer (SWE), and product manager (PM).
##
## Pearson's Chi-squared test with simulated p-value (based on 10000
## replicates)
##
## data: q2_table.2
## X-squared = 39.402, df = NA, p-value = 0.0011
As can be seen above, more MLEs command the highest category of salary than any other job title. The average MLE will earn at or above the medium salary bracket. As for PMs, very few of them ear a low salary; in fact, most of them are in the medium to very high brackets of income. The salary distribution for project managers is similar to that of PMs, except there are significantly more professionals in the low income bracket. Last but not least, most SWEs command a medium to high salary, with still a significant number of professional earning outside this range on both the high end and the lower end.
Requried Education Levels
What levels of education are required for these job titles?
Aggregate Level
The percentages are shown in the legends of the following waffle plot.
##
## Master’s degree
## 0.466037736
## Bachelor’s degree
## 0.269433962
## Doctoral degree
## 0.164528302
## Some college/university study without earning a bachelor’s degree
## 0.064905660
## Professional doctorate
## 0.016981132
## I prefer not to answer
## 0.011320755
## No formal education past high school
## 0.006792453
At an aggregate level, it is evident that more than half of the data science community on Kaggle has a degree that is above the undergraduate level. Combining percentages of people with a master’s degree or a doctoral degree (including a professional doctoral degree) gives us that \(64.7/%\) of respondents hold a graduate degree. For the combination of programming skill level requirements, as well as the mount of requisite knowledge in statistics machine learning, this is an expected phenomenon.
Most professional data scientists have a master’s degree or above:
The percentages are shown in the legends of the following waffle plot.
The same doesn’t necessary go for software engineering
The percentages are shown in the legends of the following waffle plot.
Gender Gap in Income
Is there a significant income gap between genders for these jobs?
What about at an aggregate level?
##
## poverty low medium high very high highest
## Man 125 108 369 584 187 28
## Nonbinary 1 5 5 5 0 1
## Prefer not to say 11 4 4 12 6 1
## Prefer to self-describe 1 1 0 1 0 0
## Woman 68 57 118 122 19 2
As seen above at an aggregate level, for salary levels high and above there are a total of 799 men and only 143 women. The gender map is apparent.
By groups aggregated depending on years of experience?
## [1] "< 1 years"
##
## poverty low medium high very high highest
## Man 13 20 46 17 4 0
## Nonbinary 0 1 0 0 0 0
## Prefer not to say 0 1 0 0 0 0
## Prefer to self-describe 0 1 0 0 0 0
## Woman 14 15 21 8 3 0
## [1] "10-20 years"
##
## poverty low medium high very high highest
## Man 10 7 34 142 49 11
## Nonbinary 0 0 0 1 0 0
## Prefer not to say 1 1 1 1 1 0
## Woman 6 5 6 24 7 1
## [1] "1-3 years"
##
## poverty low medium high very high highest
## Man 24 24 81 50 9 1
## Nonbinary 0 2 2 0 0 1
## Prefer not to say 3 1 1 0 0 0
## Prefer to self-describe 1 0 0 1 0 0
## Woman 16 10 29 13 0 0
## [1] "5-10 years"
##
## poverty low medium high very high highest
## Man 16 17 67 140 31 3
## Nonbinary 1 0 1 1 0 0
## Prefer not to say 1 1 0 6 2 0
## Woman 7 6 12 29 5 0
## [1] "20+ years"
##
## poverty low medium high very high highest
## Man 34 12 37 160 82 12
## Nonbinary 0 0 1 2 0 0
## Prefer not to say 5 0 1 3 3 1
## Woman 6 2 8 21 4 1
## [1] "3-5 years"
##
## poverty low medium high very high highest
## Man 17 18 76 60 9 1
## Nonbinary 0 1 1 1 0 0
## Prefer not to say 0 0 1 1 0 0
## Prefer to self-describe 0 0 0 0 0 0
## Woman 11 5 21 22 0 0
Skillset vs Income
What is the typical skill set for these jobs? How does it affect the pay rate?
Here the key skill is defined as a skill that has been acquired more than 10% people under certain job title.
From the table, huge salary variances make it very hard to tell whether a skill will increase the salary or not.
Correlation between Industry and Job
Is there a certain correlation between industry and the need for these jobs?
A prop.table
version.
## # A tibble: 5 × 6
## Title Academics Computers Finance Internet Medical
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Data Analyst 0.4 0.105 0.258 0.105 0.214
## 2 Data Engineer 0.047 0.091 0.129 0.105 0.071
## 3 Data Scientist 0.353 0.338 0.387 0.386 0.459
## 4 Machine Learning Engineer 0.082 0.15 0.056 0.211 0.071
## 5 Software Engineer 0.118 0.317 0.169 0.193 0.184
Some hypothesis that use prop.test
to examine:
- \(H_a\):
Data Analyst
is more needed inAcademics
Industry.
To Computers
: X-squared = 38.1, p-value = 3.28e-10, 95% CI = (0.2, 1), H0 rejected.
To Finance
: X-squared = 4.1, p-value = 2.19e-02, 95% CI = (0.02, 1), H0 rejected.
To Internet
: X-squared = 13.2, p-value = 1.38e-04, 95% CI = (0.17, 1), H0 rejected.
To Medical
: X-squared = 6.6, p-value = 5.07e-03, 95% CI = (0.06, 1), H0 rejected.
TRUE, Data Analyst
is more needed in Academics
Industry.
- \(H_a\):
Data Scientist
is more needed inMedical
Industry.
To Academics
: X-squared = 1.7, p-value = 9.56e-02, 95% CI = (-0.02, 1), can’t reject H0.
To Computers
: X-squared = 4.1, p-value = 2.14e-02, 95% CI = (0.02, 1), H0 rejected.
To Finance
: X-squared = 0.9, p-value = 1.73e-01, 95% CI = (-0.05, 1), can’t reject H0.
To Internet
: X-squared = 0.5, p-value = 2.36e-01, 95% CI = (-0.08, 1), can’t reject H0.
FALSE, Data Analyst
has similar demand in these industries.
- \(H_a\):
Software Engineer
is more needed inComputers
Industry.
To Academics
: X-squared = 12.2, p-value = 2.39e-04, 95% CI = (0.12, 1), H0 rejected.
To Finance
: X-squared = 8.8, p-value = 1.51e-03, 95% CI = (0.07, 1), H0 rejected.
To Internet
: X-squared = 2.9, p-value = 4.32e-02, 95% CI = (0.02, 1), H0 rejected.
To Medical
:X-squared = 5.8, p-value = 8.17e-03, 95% CI = (0.05, 1), H0 rejected.
TRUE, Software Engineer
is more needed in Computers
Industry.
- \(H_a\):
Machine Learning Engineer
is more needed inInternet
Industry.
To Academics
: X-squared = 2, p-value = 7.77e-02, 95% CI = (0, 1), can’t reject H0.
To Computers
: X-squared = 3.3, p-value = 9.66e-01, 95% CI = (-0.14, 1), can’t reject H0.
To Finance
: X-squared = 6.2, p-value = 6.32e-03, 95% CI = (0.04, 1), H0 rejected.
To Medical
: X-squared = 3.3, p-value = 3.44e-02, 95% CI = (0.02, 1), H0 rejected.
FALSE, The demand of Machine Learning Engineer
in Internet
Industry is not significantly greater than in Academics
and Computers
.
- \(H_a\): Overall,
Data Scientist
is more needed than other jobs.
To Data Analyst
: X-squared = 53, p-value = 1.66e-13, 95% CI = (0.14, 1), H0 rejected.
To Data Engineer
: X-squared = 143.1, p-value = 2.74e-33, 95% CI = (0.24, 1), H0 rejected.
To Software Engineer
: X-squared = 29.5, p-value = 2.76e-08, 95% CI = (0.1, 1), H0 rejected.
To ML Engineer
: X-squared = 113.3, p-value = 9.36e-27, 95% CI = (0.22, 1), H0 rejected.
TRUE, Data Scientist
is more needed than other 4 jobs.
Salaries for these jobs vary widely from industry to industry.
Languages and IDEs
What programming languages and IDEs do they use?
Survey questions Q7 (daily-used programming language), Q9 (IDE).
In the heatmap of programming language below, some trends can be found:
Python
,R
, andSQL
are three most popular language in Kaggle community.Statisticians
useR
more thanPython
andSQL
.C++
,Java
,Javascript
are mainly used bySoftware Engineer
.Julia
andSwift
are barely used by Kaggle users.
In the heatmap of IDE below, some trends can be found:
Jupyter Notebook
is the most popular IDE for Kaggle users.MATLAB
is barely used in the industry(last 5 jobs).Statisticians
useRStudio
the most.VSCode
is overall the second common-used IDE.
From two plots above, while Python becomes the most popular programming language for Kaggle users, Jupyter Notebook (a Python IDE) also becomes the most commonly used IDE. Are there any correlation between them?
##
## Jupyter Python
## Data Analyst 167 186
## Data Engineer 62 80
## Data Scientist 357 407
## Machine Learning Engineer 85 100
## Software Engineer 134 179
## Statistician 9 19
## Student 292 396
\(H_a\): The distribution of Python is significantly different from the distribution of Jupyter Notebook.
All job titles: X-squared = 6.1, p-value = 0.4, can’t reject H0.
Without Statistician
: X-squared = 4.3, p-value = 0.51, can’t reject H0.
Without Statistician
, Student
: X-squared = 2, p-value = 0.73, can’t reject H0.
FALSE. There is no significant difference between the distribution of Python and the distribution of Jupyter Notebook among jobs. In the other words, most people who use Python also use Jupyter Notebook as the IDE.
Learning Sources
Where do they get and share the knowledge?
Survey questions Q39 (share and deploy), Q40 (learning resources), Q42 (Media sources).
In the heatmap of learning platform below, some trends can be found:
Coursera
is the most popular learning platform outside the University.Statisticans
learn knowledge mainly fromUniversity
course.
In the heatmap of sharing platform below, some trends can be found:
GitHub
is the most popular sharing platform.Kaggle
,Colab
,Personal Blog
are used by a small proportion of Kaggle Users.- Other platforms listed on the questionare are barely used by Kaggle Users.
- A large proportion of people don’t share their works to public.
- None of
Student
answered this question.
In the heatmap of media source below, some trends can be found:
Blog
,Kaggle
,YouTube
are three most popular media source.- Many
Statisticians
useJournal Publication
as their media sources.
Special Job Title: Statistician
From all analysis above, Statistician
shows many different preference than other jobs. For example:
- Only a few people identify themselves as
Statistician
. - They mostly work in
Academic
industry. - They use
R
more frequently, also they useRStudio
a lot. - They learn most knowledge from
University Courses
. - Many of them use
Journal Publication
to get report on data science topics.
Based on these features, Statisticians
are very likely to be a job title specially for professors and researchers in academic institutions (i.e. universities).
Popular Packages
As for the popular packages, one can easily give some examples that no strangers to data science practitioners. But the real problem is, what can be learned from this? Some conditional probabilities will highlight the findings.
The data is a subset of survey answer to Q14 and Q16, which emphasize on visualization libraries and machine learning related libraries respectively. A set of niche job titles are excluded due to limited number of responses, such as “DBA/Database Engineer”, “Currently not employed” and “Statistician.” The remaining valid responses are visualized as barplots by group to show the count of responses that are familiar with those packages.
Additionally, as the number of machine learning libraries covered in the survey exceeds a reasonable number of meaning palette colors, only the top 10 libraries are kept in the visualization.
## Selecting by count
There are some obvious results worth mentioning. But before that, it is important to reiterate that the comparison is based on the calculation of some conditional probabilities according to Bayes’ Theorem.
\[P(A|B) = \frac{P(B|A)\cdot P(A)}{P(B)} = \frac{P(A \cap B)}{P(B)}\]
where \(A\) and \(B\) are events.
## value
## Q5 Altair Bokeh D3 js Geoplotlib Ggplot / ggplot2
## Business Analyst 3 6 8 7 29
## Data Analyst 2 16 13 18 104
## Data Engineer 2 5 14 8 25
## Data Scientist 15 45 26 19 183
## Machine Learning Engineer 4 7 7 6 17
## Program/Project Manager 0 2 11 7 33
## Research Scientist 2 10 6 5 67
## Software Engineer 4 13 30 6 35
## Student 1 11 17 20 129
## Sum 33 115 132 96 622
## value
## Q5 Leaflet / Folium Matplotlib Plotly / Plotly Express
## Business Analyst 4 56 32
## Data Analyst 12 136 72
## Data Engineer 6 57 28
## Data Scientist 31 325 176
## Machine Learning Engineer 3 87 39
## Program/Project Manager 7 48 18
## Research Scientist 6 113 39
## Software Engineer 6 139 33
## Student 5 286 88
## Sum 80 1247 525
## value
## Q5 Seaborn Shiny Sum
## Business Analyst 39 13 197
## Data Analyst 106 44 523
## Data Engineer 39 9 193
## Data Scientist 270 75 1165
## Machine Learning Engineer 66 5 241
## Program/Project Manager 34 8 168
## Research Scientist 68 25 341
## Software Engineer 83 7 356
## Student 156 26 739
## Sum 861 212 3923
## value
## Q5 Caret Huggingface Keras LightGBM Prophet PyTorch
## Business Analyst 7 3 21 5 6 18
## Data Analyst 28 1 30 11 9 27
## Data Engineer 1 4 24 7 3 27
## Data Scientist 60 55 166 99 56 133
## Machine Learning Engineer 3 29 61 23 7 58
## Program/Project Manager 5 2 17 3 1 18
## Research Scientist 13 15 50 12 2 60
## Software Engineer 6 9 62 12 5 77
## Student 25 14 118 15 4 109
## Sum 148 132 549 187 93 527
## value
## Q5 Scikit-learn TensorFlow Tidymodels Xgboost Sum
## Business Analyst 50 21 9 18 158
## Data Analyst 114 44 25 51 340
## Data Engineer 52 31 6 14 169
## Data Scientist 345 186 35 229 1364
## Machine Learning Engineer 81 69 1 43 375
## Program/Project Manager 44 17 6 17 130
## Research Scientist 97 58 11 39 357
## Software Engineer 119 90 2 31 413
## Student 223 150 10 64 732
## Sum 1125 666 105 506 4038
- Both Data Scientists and Students are the two major respondents to package preferences/familiarity questions.
\[ \begin{split} P(\text{Student}|\text{Visualization}) &= \frac{739}{3923} \approx 0.1884\\ P(\text{Data Scientist}|\text{Visualization}) &= \frac{1165}{3923} \approx 0.2970 \\ P(\text{Student}|\text{Machine Learning}) &=\frac{732}{4038} \approx 0.1813\\ P(\text{Data Scientist}|\text{Machine Learning}) &=\frac{1364}{4038} \approx 0.3378 \end{split} \]
- Python-related packages or packages with Python support are overall more popular among the responses.
- For example,
Altair
,bokeh
,geoplotlib
,leaflet/folium
,matplotlib
,plotly/plotly express
,seaborn
are categorized as Python-wise, whileD3.js
,ggplot2
,leaflet/folium
,plotly/plotly express
andshiny
are categorized as R-wise. Note that there are some overlapping among these packages since they do have multiple language support. - Similarly,
Huggingface
,Keras
,LightGBM
,Prophet
,PyTorch
,scikit-learn
,TensorFlow
,Xgboost
are Python compatible.Caret
,Keras
,LightGBM
,Prophet
,TensorFlow
,tidymodels
andXgboost
and usable in R.
- For example,
\[ \begin{split} P(\text{Python}|\text{Visualization}) &= (33+115+96+80+1247+525+861)/3923 \\ &\approx 0.7538\\ P(\text{R}|\text{Visualization}) &= (132+622+80+525+212)/3923 \\&\approx 0.4005\\ P(\text{Python}|\text{Machine Learning}) &= (132+549+187+93+527+1125+666+506)/4038 \\&\approx 0.9373\\ P(\text{R}|\text{Machine Learning}) &= (148+549+187+93+666+105+506)/4038 \\&\approx 0.5582\\ \end{split} \]
There are more to be discovered depending on the perspectives.
The Loyalty to AWS
It is clear to see that cloud computing is the general trend of data science in the past couple of years. And it is predictable to be still prevailing in the future years to come, especially the concept of “metaverse” is brewing some new possibilities that require cloud computation as their foundations. The market is majorly occupied by Amazon, Google and Microsoft.
Specifically, will a user’s preference for a cloud computing platforms affect his or her preferences for other tools? For example, an AWS EC2 dedicated user will actually prefer AWS S3 over other products.
A list of AWS services are selected for this hypothesis, including EC2, S3, EFS, SageMaker, RedShift, Aurora, RDS and DynamoDB. The data covers the following survey questions:
- Survey question Q29-A: computing products (Part_1)
- Survey question Q30: Storage (Part_3, Part_4)
- Survey question Q31-A: ML products (Part_1)
The null hypothesis to be tested is:
- \(H_0:\) The proportion of people who use EC2 is independent of the people who uses S3 (this could be replaced by any other AWS services.)
- \(H_a:\) The proportion of people who use EC2 is dependent on the people who use S3.
The hypothesis is tested by checking the \(\chi^2\) as the statistic.
\[\chi^2=\sum\frac{(O_i-E_i)^2}{E_i}\] The following three tests are conducted between:
- EC2 vs S3
- EC2 vs EFS
- EC2 vs SageMaker
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: aws_user$ec2 and aws_user$s3
## X-squared = 1532, df = 1, p-value < 2.2e-16
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: aws_user$ec2 and aws_user$efs
## X-squared = 637.99, df = 1, p-value < 2.2e-16
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: aws_user$ec2 and aws_user$sagemaker
## X-squared = 566.34, df = 1, p-value < 2.2e-16
Since all three p-values are significantly smaller than the common critical value 0.05, that means the null hypothesis is rejected. That means the user group of EC2 and all three AWS services’ user groups (S3, EFS and SageMaker) are not independent.
A possible explanation is that EC2 is the core service of AWS, and most of other AWS services rely on running an elastic computing instance. That implies a data science practitioner who frequently uses EC2 will be more likely to use other AWS services.
Cloud Computing Platforms Preferences
Another related question addressed is to compare the market occupancy rate of the top 3 cloud computing providers. But in the survey, the market occupancy rate is instantiated as the using rate among survey takers.
Again, two hypotheses are given:
- \(H_0: p_A=p_B\)
- \(H_a:p_A\not=p_B\)
where \(A\) and \(B\) can be replaced by AWS, Azure, or GCP, and \(n_A\), \(n_B\) are sample size of group \(A\) and \(B\) respectively.
The test statistic (z-statistic) can be calculated as follow:
\[z=\frac{p_A-p_B}{\sqrt{p(1-p)/n_A+p(1-p)/n_B}}\]
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(sum(cloud_comp$aws_usage), sum(cloud_comp$azure_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 81.285, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.07606175 0.11865523
## sample estimates:
## prop 1 prop 2
## 0.2377358 0.1403774
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(sum(cloud_comp$azure_usage), sum(cloud_comp$gcp_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 1.8823, df = 1, p-value = 0.1701
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.005495458 0.031910552
## sample estimates:
## prop 1 prop 2
## 0.1403774 0.1271698
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(sum(cloud_comp$gcp_usage), sum(cloud_comp$aws_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 107.85, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.1315249 -0.0896072
## sample estimates:
## prop 1 prop 2
## 0.1271698 0.2377358
The results show that only the p-value of second test is greater than the common critical value 0.05. That means:
- Reject \(H_0\) that the using rates of AWS and Microsoft Azure are the same.
- Fail to reject \(H_0\) that the using rates of Microsoft Azure and Google Cloud Platform are the same.
- Reject \(H_0\) that the using rates of Google Cloud Platform and AWS are the same.
Therefore, to simplify the result, AWS is leading the market (a lot), while Microsoft Azure and Google Cloud Platform are pretty much the same in terms of using rate.
Another perspective of interpretation is to look at the 95% confidence interval of the hypothesis test. For instance, the test between AWS usage and Azure usage gives a confidence interval of \((0.076, 0.119)\). So it’s 95% confident that the true difference is between such an interval.
Linear Regression on Salary
Honestly speaking, the best way to approach this question is to apply a logistic regression on the the dependent variable salary
due to its categorical nature. However, in order to experiment a linear model, we take a bold move to transform the salary
back to continuous by randomly sampling values in between the pay levels. Three typical languages used by data science practitioners are included as well: Python, R and SQL.
A quick glimpse at the response will uncover the fact that the salary
is not normally distributed. In fact, it is right-skewed with a long tail. A log-transformation on salary
will not fully fix this issue, but indeed make the linear model fitting easier.
Afterwards, the function MASS::stepAIC()
is used to do a stepwise regression in order to find the suitable predictive variables. Note that the selection criterion is AIC, and the model starts with the maximal model with all the predictors and some manually added interactions, including:
- Interactions between languages,
sql:r
,sql:python
,r:python
,python:r:sql
. - Interactions between jobs and languages,
job:sql
,job:r
,job:python
. - Interactions between jobs and other variables,
job:experience
,job:education
.
Finally, the automatic procedure of model fitting returns the following model as the result:
\[\log(\text{salary})=\beta_0+\beta_1 \text{age_group} +\beta_2\text{gender}+ \beta_3\text{job}+\beta_4\text{experience}\\ +\beta_5\text{python}+\beta_6\text{r}+\beta_7\text{sql}\\ +\beta_8\text{python}\cdot\text{sql} + \beta_9\text{python}\cdot\text{r} + \beta_{10}\text{job}\cdot\text{python} + \epsilon \]
## Analysis of Variance Table
##
## Response: log(salary)
## Df Sum Sq Mean Sq F value Pr(>F)
## age_group 10 103.17 10.3170 14.3705 < 2.2e-16 ***
## gender 4 51.48 12.8702 17.9269 1.913e-14 ***
## job 12 63.62 5.3016 7.3845 2.108e-13 ***
## experience 6 55.41 9.2357 12.8644 2.811e-14 ***
## python 1 0.04 0.0448 0.0625 0.80268
## r 1 0.05 0.0525 0.0731 0.78695
## sql 1 0.04 0.0379 0.0528 0.81830
## python:sql 1 4.76 4.7620 6.6329 0.01009 *
## python:r 1 2.00 2.0006 2.7866 0.09523 .
## job:python 12 17.89 1.4904 2.0760 0.01575 *
## Residuals 1795 1288.68 0.7179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = log(salary) ~ age_group + gender + job + experience +
## python + r + sql + python:sql + python:r + job:python, data = lm_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2002 -0.2588 0.0902 0.4265 3.2106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.530416 0.267915 39.305 < 2e-16
## age_group22-24 0.375789 0.208249 1.805 0.071319
## age_group25-29 0.667577 0.198832 3.357 0.000803
## age_group30-34 0.910161 0.198496 4.585 4.85e-06
## age_group35-39 0.932372 0.199699 4.669 3.25e-06
## age_group40-44 0.858800 0.203040 4.230 2.46e-05
## age_group45-49 0.876625 0.204626 4.284 1.93e-05
## age_group50-54 1.005792 0.205810 4.887 1.11e-06
## age_group55-59 0.792527 0.210650 3.762 0.000174
## age_group60-69 0.725414 0.208899 3.473 0.000528
## age_group70+ 0.400476 0.239091 1.675 0.094109
## genderNonbinary -0.528104 0.213639 -2.472 0.013530
## genderPrefer not to say -0.326387 0.141802 -2.302 0.021466
## genderPrefer to self-describe -0.491400 0.493495 -0.996 0.319504
## genderWoman -0.281689 0.050747 -5.551 3.27e-08
## jobData Analyst -0.056646 0.186378 -0.304 0.761218
## jobData Engineer -0.430005 0.411058 -1.046 0.295659
## jobData Scientist -0.011597 0.222771 -0.052 0.958487
## jobDBA/Database Engineer 0.245765 0.311723 0.788 0.430562
## jobDeveloper Relations/Advocacy 1.933314 0.630497 3.066 0.002199
## jobMachine Learning Engineer -0.552834 0.384643 -1.437 0.150817
## jobOther 0.073818 0.168189 0.439 0.660786
## jobProduct Manager 0.299793 0.288356 1.040 0.298637
## jobProgram/Project Manager 0.193235 0.217665 0.888 0.374786
## jobResearch Scientist 0.110558 0.220402 0.502 0.615996
## jobSoftware Engineer 0.013011 0.208370 0.062 0.950217
## jobStatistician -0.595571 0.289542 -2.057 0.039836
## experience1-3 years 0.091679 0.086024 1.066 0.286690
## experience10-20 years 0.535813 0.089894 5.961 3.02e-09
## experience20+ years 0.557339 0.092407 6.031 1.97e-09
## experience3-5 years 0.270869 0.088927 3.046 0.002353
## experience5-10 years 0.405516 0.085871 4.722 2.51e-06
## experienceI have never written code -0.057973 0.128541 -0.451 0.652039
## python -0.085435 0.209366 -0.408 0.683274
## r 0.163227 0.117684 1.387 0.165616
## sql -0.202115 0.108713 -1.859 0.063167
## python:sql 0.258247 0.119399 2.163 0.030682
## python:r -0.191537 0.127757 -1.499 0.133990
## jobData Analyst:python 0.042105 0.224697 0.187 0.851379
## jobData Engineer:python 0.494905 0.436385 1.134 0.256904
## jobData Scientist:python 0.246436 0.249720 0.987 0.323849
## jobDBA/Database Engineer:python -0.757434 0.383077 -1.977 0.048168
## jobDeveloper Relations/Advocacy:python -1.425580 0.746101 -1.911 0.056202
## jobMachine Learning Engineer:python 0.827107 0.408757 2.023 0.043173
## jobOther:python 0.024478 0.207962 0.118 0.906315
## jobProduct Manager:python 0.312287 0.335536 0.931 0.352128
## jobProgram/Project Manager:python 0.087848 0.263783 0.333 0.739149
## jobResearch Scientist:python -0.306212 0.256948 -1.192 0.233525
## jobSoftware Engineer:python 0.003902 0.242807 0.016 0.987181
## jobStatistician:python 0.448505 0.374177 1.199 0.230825
##
## (Intercept) ***
## age_group22-24 .
## age_group25-29 ***
## age_group30-34 ***
## age_group35-39 ***
## age_group40-44 ***
## age_group45-49 ***
## age_group50-54 ***
## age_group55-59 ***
## age_group60-69 ***
## age_group70+ .
## genderNonbinary *
## genderPrefer not to say *
## genderPrefer to self-describe
## genderWoman ***
## jobData Analyst
## jobData Engineer
## jobData Scientist
## jobDBA/Database Engineer
## jobDeveloper Relations/Advocacy **
## jobMachine Learning Engineer
## jobOther
## jobProduct Manager
## jobProgram/Project Manager
## jobResearch Scientist
## jobSoftware Engineer
## jobStatistician *
## experience1-3 years
## experience10-20 years ***
## experience20+ years ***
## experience3-5 years **
## experience5-10 years ***
## experienceI have never written code
## python
## r
## sql .
## python:sql *
## python:r
## jobData Analyst:python
## jobData Engineer:python
## jobData Scientist:python
## jobDBA/Database Engineer:python *
## jobDeveloper Relations/Advocacy:python .
## jobMachine Learning Engineer:python *
## jobOther:python
## jobProduct Manager:python
## jobProgram/Project Manager:python
## jobResearch Scientist:python
## jobSoftware Engineer:python
## jobStatistician:python
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8473 on 1795 degrees of freedom
## Multiple R-squared: 0.1881, Adjusted R-squared: 0.1659
## F-statistic: 8.484 on 49 and 1795 DF, p-value: < 2.2e-16
By looking at both the summary()
table and anova()
table, the following can be concluded as the takeaways:
- The predictive variables
age_group
,gender
,job
andexperience
are significant while the three languagespython
,r
andsql
are not significant alone. - However, the interaction of
python:sql
,python:r
andjob:python
are playing some roles in predicting thelog(salary)
as the response. - Generally speaking, the
age_group
shows new graduate (22-24) and people past 70+ tend to have lower salaries. - Comparing to the baseline
gender
, which isMale
, other gender options lead to lower salaries. - The
experience
years andsalary
are positively related. - Some job titles with
python
as a skillset can significantly relate tosalary
changes. Although it is still confusing to see the combination ofDeveloper Relations/Advocacy:python
has quite a lot negative effect onsalary
.
In general, this linear model is built upon lots of compromising assumptions, so that the overall power to explain the deviations in not ideal. Nevertheless, it still shows some insights about the data.
Conclusions
Note that since this project covers various aspects of data science including background information, tool preferences and so on, the conclusion cannot be summarized into one sentence. However, there are still many takeaways:
- Similar job titles don’t mean the same salary expectations. By salary standards, data scientist is a more senior role than data analysts and data engineers.
- Women are extremely underrepresented in the highest earning DS jobs.
- Graduate degree holders (MS, PhD, JD, MD) are the majority in the DS community. The only exceptions in terms of job title are data engineers and software engineering.
- Learn Python, R, and SQL, the industry favors Python over R, but R is quite popular among academics.
- Take courses at Coursera, Udacity, Udemy, also at University
- Share your work on GitHub
- Blog, Kaggle, YouTube are good sources of data science information
- Both Python and R have extensive amount of useful packages in terms of visualization and machine learning.
- Pick a cloud computing platform and learn its products of your interest. AWS is highly recommended.
Limitations
Admittedly, Kaggle is a popular data science learning community with a focus on machine learning competition, its user group is representative enough to reflect a partial status quo of data science practitioners in the industry. In addition, one should keep in mind that not all data science professionals are keen on participating in machine learning competitions while working “nine to five”.
Nevertheless, the survey still manages to give the audience a snapshot of what a data science practitioner’s career looks like, especially in a post-pandemic era. Hopefully, it could provide some guidance to graduate students in the States who are pursuing data science-related careers.
Appendix: R scripts
# invalidate cache when the package version changes
::opts_chunk$set(cache = TRUE,
knitrecho = FALSE,
eval = TRUE,
tidy = FALSE,
warning = FALSE,
cache.extra = packageVersion('tidyverse'))
options(htmltools.dir.version = FALSE)
if (!require("pacman")) install.packages("pacman")
if (!require("waffle")) install.packages("waffle", repos = "https://cinc.rud.is")
::p_load(tidyverse, ggthemes, latex2exp, glue,
pacman
hrbrthemes, plotly, stringr, DT, extrafont,
tidymodels, rmdformats, waffle, pals)
set.seed(511)
::theme_set(
ggplot2theme_fivethirtyeight() +
theme(
text = element_text(family = "Roboto Condensed"),
title = element_text(size = 14),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(size = 10),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.minor.x = element_blank()
)
)
<- names(read_csv(
col_names "data/kaggle_survey_2021_responses.csv",
n_max=0))
<- read_csv(
dat "data/kaggle_survey_2021_responses.csv",
col_names = col_names, skip=2)
<- dat %>%
dat filter(Q3=="United States of America" )
<- dat %>%
job.dat filter(Q5 %in% c("Data Analyst",
"Data Engineer",
"Data Scientist",
"Machine Learning Engineer",
"Software Engineer",
"Statistician",
"Student")) %>%
mutate(Q25 = str_remove_all(Q25, "[$,]")) %>%
mutate(Q25 = str_replace(Q25, ">1000000", "1000000-2000000")) %>%
separate(Q25, into = c("salary_lb", "salary_ub"), sep = "-") %>%
mutate(salary_lb = as.numeric(salary_lb)) %>%
mutate(salary_ub = as.numeric(salary_ub))
# Q1
<- sort(table(dat$Q5), decreasing = T) %>%
jtitle as.data.frame() %>%
as.tibble()
<- rename(jtitle, `Job Title` = Var1)
jtitle
<- jtitle$Freq/25
x <- sprintf("%s (%s)", jtitle$`Job Title`,
val_names ::percent(round(x/sum(x), 3)))
scalesnames(x) <- val_names
waffle(x, rows = 5, color = pals::parula(n=15)) +
labs(
title = "Job Title Distributions",
subtitle = "US respondents in Kaggle's annual survey",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme_ipsum_rc(grid="") +
theme_enhance_waffle() +
theme(legend.position = "bottom",
legend.text = element_text(size=6))
# Q2
# To change salary categories into FACTOR dtype with descending labels
= c("$0-999", "1,000-1,999" , "2,000-2,999", "3,000-3,999", "4,000-4,999", "5,000-7,499", "7,500-9,999",
poverty "10,000-14,999")
= c("15,000-19,999", "20,000-24,999", "25,000-29,999", "30,000-39,999", "40,000-49,999")
low = c("50,000-59,999", "60,000-69,999", "70,000-79,999", "80,000-89,999", "90,000-99,999")
medium = c("100,000-124,999", "125,000-149,999", "150,000-199,999")
high = c("200,000-249,999", "250,000-299,999", "300,000-499,999")
very_high = c("$500,000-999,999", ">$1,000,000")
highest
$Q25[dat$Q25 %in% poverty] <- "poverty"
dat$Q25[dat$Q25 %in% low] <- "low"
dat$Q25[dat$Q25 %in% medium] <- "medium"
dat$Q25[dat$Q25 %in% high] <- "high"
dat$Q25[dat$Q25 %in% very_high] <- "very high"
dat$Q25[dat$Q25 %in% highest] <- "highest"
dat$Q25 = factor(dat$Q25, levels = c("poverty", "low", "medium", "high", "very high", "highest"), ordered = T)
dat
<- c("Data Scientist", "Data Analyst", "Business Analyst", "Data Engineer", "Statistician", "DBA/Database Engineer")
data_side <- c("Software Engineer", "Machine Learning Engineer", "Program/Project Manager", "Product Manager")
swe_side <- c("Student", "Other", "Research Scientist")
academia
$Q5 %in% data_side & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
dat[datggplot( aes(x=Q5, y=Q25, color = Q5)) +
geom_count() +
ggtitle("Two-Way Salary Visualizations: Data-Oriented Jobs") +
xlab("") +
labs(
y = "salary",
caption = glue("Author: celeritasML
Source: Kaggle"))+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_blank())
$Q5 %in% swe_side & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
dat[datggplot( aes(x=Q5, y=Q25, color = Q5)) +
geom_count() +
ggtitle("Two-Way Salary Visualizations: Engineering-Oriented Jobs") +
xlab("") +
labs(
y = "salary",
caption = glue("Author: celeritasML
Source: Kaggle"))+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_blank())
$Q5 %in% academia & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
dat[datggplot( aes(x=Q5, y=Q25, color = Q5)) +
geom_count() +
ggtitle("Two-Way Salary Visualizations: Academic Jobs and Others") +
xlab("") +
labs(
y = "salary",
caption = glue("Author: celeritasML
Source: Kaggle"))+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_blank())
<- function(A){
expectedcounts <- rowSums(A)
r <- colSums(A)
c = sum(A)
N <- outer(r,c)/N
expected return(expected)
}
.1 = table(dat$Q25[dat$Q5 %in% data_side], dat$Q5[dat$Q5 %in% data_side])
q2_table= q2_table.1 %>% as.data.frame() %>% as.tibble()
q2_table.df <- expectedcounts(q2_table.1) %>% as.data.frame() %>% as.tibble()
q2_expectedcounts
ggplot(q2_table.df, aes(x = Var2, y = Freq, fill = Var1)) +
geom_col(position = "dodge") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(
title = "Histogram of Salary",
subtitle = "data-oriented data science roles",
caption = glue("Author: celeritasML
Source: Kaggle"))
chisq.test(q2_table.1, simulate.p.value = T, B = 10000))
(
.2 = table(dat$Q25[dat$Q5 %in% swe_side], dat$Q5[dat$Q5 %in% swe_side])
q2_table= q2_table.2 %>% as.data.frame() %>% as.tibble()
q2_table.df2 <- expectedcounts(q2_table.2) %>% as.data.frame() %>% as.tibble()
q2_expectedcounts2
ggplot(q2_table.df2, aes(x = Var2, y = Freq, fill = Var1)) +
geom_col(position = "dodge") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(
title = "Histogram of Salary",
subtitle = "software and product management related data science roles",
caption = glue("Author: celeritasML
Source: Kaggle"))
chisq.test(q2_table.2, simulate.p.value = T, B = 10000))
(
# Q3
<- sort(table(dat$Q4), decreasing = T) %>%
edu as.data.frame() %>%
as.tibble()
<- rename(edu, `Degree` = Var1)
edu
<- edu$Freq/18
x <- sprintf("%s (%s)", edu$Degree,
val_names ::percent(round(x/sum(x), 3)))
scalesnames(x) <- val_names
waffle(x, rows = 5, color = pals::parula(n=7)) +
labs(
title = "Waffle Chart of Education Level",
subtitle = "US respondents in Kaggle's annual survey",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme_ipsum_rc(grid="") +
theme_enhance_waffle() +
theme(legend.position = "bottom",
legend.text = element_text(size=6))
<- sort(table(dat$Q4), decreasing = T)
Q3_temp <- Q3_temp / sum(Q3_temp))
(Q3_temp
.1 <- sort(table(dat[dat$Q5 == "Data Scientist",]$Q4), decreasing = T) %>%
eduas.data.frame() %>%
as.tibble()
.1 <- rename(edu.1, `Degree` = Var1)
edu
<- edu.1$Freq/5
x <- sprintf("%s (%s)", edu$Degree,
val_names ::percent(round(x/sum(x), 3)))
scalesnames(x) <- val_names
waffle(x, rows = 5, color = pals::parula(n=7)) +
labs(
title = "Waffle Chart of Education Level",
subtitle = "Subset of Data Scientists",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme_ipsum_rc(grid="") +
theme_enhance_waffle() +
theme(legend.position = "bottom",
legend.text = element_text(size=6))
<- table(dat$Q5, dat$Q4) %>%
EDU as.data.frame() %>%
as.tibble()
<- rename(EDU, `Job Title` = Var1)
EDU <- rename(EDU, `Degree` = Var2)
EDU
<- function(job){
print_pie_chart <- sort(table(dat[dat$Q5 == job,]$Q4), decreasing = T) %>%
edu as.data.frame() %>%
as.tibble()
<- rename(edu, `Degree` = Var1)
edu
<- edu$Freq/ceiling(min(edu$Freq))
x <- sprintf("%s (%s)", edu$Degree,
val_names ::percent(round(x/sum(x), 3)))
scalesnames(x) <- val_names
<- waffle(x, rows = 5, color = pals::parula(n=length(unique(edu$Degree)))) +
p labs(
title = "Waffle Chart of Education Level",
subtitle = glue("Subset of ", job),
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme_ipsum_rc(grid="") +
theme_enhance_waffle() +
theme(legend.position = "bottom",
legend.text = element_text(size=6))
print(p)
}<- unique(dat$Q5)
jobs
print_pie_chart("Software Engineer")
print_pie_chart("Data Engineer")
# Q5
table(dat$Q2, dat$Q25)
for (agegroup in unique(dat$Q6)){
if (agegroup != "I have never written code"){
print(agegroup)
print(table(dat$Q2[dat$Q6 == agegroup], dat$Q25[dat$Q6 == agegroup]))
}
}
# Q5
<- job.dat %>%
skill.set filter(Q5 != "Other") %>%
select(c(Q5, starts_with("Q7_"), starts_with("Q9_"),
starts_with("Q12_"), starts_with("Q14_"),
starts_with("Q16_"), starts_with("Q17_"),
starts_with("Q18_"), starts_with("Q19_"),
%>%
salary_lb)) mutate(Total = "`Total`") %>%
gather("fake_key", "skillset",
-c(Q5, salary_lb), na.rm = T) %>%
filter(!skillset %in% c("None", "Other")) %>%
rename(title = Q5) %>%
group_by(title, skillset) %>%
summarise(n = n(),
salary_mean = round(mean(salary_lb, na.rm = T)),
salary_sd = round(sd(salary_lb, na.rm = T)),
.groups = "drop") %>%
group_by(title) %>%
mutate(prop = round(n * 100 / max(n), 1)) %>%
filter(prop >= 0.1) %>%
select(-n) %>%
arrange(title, desc(prop))
datatable(skill.set, filter = 'top', width = 800)
# Q6
<- job.dat %>%
industry.dat filter(Q5 != "Student") %>%
select(Q5, Q20, salary_lb, salary_ub) %>%
mutate(Q20 = case_when(
== "Academics/Education" ~ "Academics",
Q20 == "Accounting/Finance" ~ "Finance",
Q20 == "Computers/Technology" ~ "Computers",
Q20 == "Medical/Pharmaceutical" ~ "Medical",
Q20 == "Online Service/Internet-based Services" ~ "Internet",
Q20 TRUE ~ Q20
%>%
)) filter(Q20 %in% c("Academics", "Finance", "Computers",
"Medical", "Internet"))
<- industry.dat %>%
p count(Q5, Q20) %>%
mutate(Q20 = fct_reorder(Q20, n, .fun="sum")) %>%
rename(title=Q5, Industry=Q20, count=n) %>%
ggplot(aes(x=title, y=count)) +
geom_bar(stat = "identity") +
coord_flip() +
facet_wrap(~ Industry) +
labs(
title = "Users' work industry",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p)
<- industry.dat %>%
ind.dis filter(Q5 != "Statistician") %>%
rename(Title = Q5, Industry = Q20)
%>%
ind.dis group_by(Industry) %>%
count(Title) %>%
mutate(Proportion = round(prop.table(n), 3)) %>%
select(-n) %>%
pivot_wider(names_from = Industry, values_from = Proportion)
<- ind.dis %>% filter(Industry == "Academics")
aca.dis <- ind.dis %>% filter(Industry == "Computers")
com.dis <- ind.dis %>% filter(Industry == "Finance")
fin.dis <- ind.dis %>% filter(Industry == "Internet")
int.dis <- ind.dis %>% filter(Industry == "Medical")
med.dis
<- ind.dis %>% filter(Title == "Data Scientist")
ds.dis <- ind.dis %>% filter(Title == "Data Analyst")
da.dis <- ind.dis %>% filter(Title == "Data Engineer")
de.dis
<- function(target.dis, other.dis, title) {
my.prop.test <- sum(target.dis$Title == title)
target <- sum(other.dis$Title == title)
other <- prop.test(c(target, other),
t.res c(nrow(target.dis), nrow(other.dis)),
alternative = "greater",
correct = TRUE)
return(paste0("X-squared = ", round(t.res$statistic,1),
", p-value = ", formatC(t.res$p.value,
format = "e", digits = 2),
", 95% CI = (", round(t.res$conf.int[1], 2),
", ", round(t.res$conf.int[2], 2),"), ",
if_else(t.res$p.value < 0.05, "H0 rejected",
"can't reject H0")))
}
<- function(ind.dis, title1, title2) {
my.prop.test2 <- sum(ind.dis$Title == title1)
target <- sum(ind.dis$Title == title2)
other <- prop.test(c(target, other),
t.res c(nrow(ind.dis), nrow(ind.dis)),
alternative = "greater",
correct = TRUE)
return(paste0("X-squared = ", round(t.res$statistic,1),
", p-value = ", formatC(t.res$p.value,
format = "e", digits = 2),
", 95% CI = (", round(t.res$conf.int[1], 2),
", ", round(t.res$conf.int[2], 2),"), ",
if_else(t.res$p.value < 0.05, "H0 rejected",
"can't reject H0")))
}
<- industry.dat %>%
p filter(salary_lb != 1000000) %>%
mutate(Q20 = fct_reorder(Q20, salary_lb, .fun='length')) %>%
ggplot(aes(x=Q20, y=salary_lb)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Q5) +
labs(
title = "Users' salary vs industry",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
# Q7
<- job.dat %>%
programming select(c(Q5, starts_with("Q7_"))) %>%
gather("fake_key", "language", -Q5, na.rm = T) %>%
rename(title = Q5) %>%
select(-fake_key) %>%
filter(!language %in% c("None", "Other")) %>%
count(title, language, .drop = FALSE) %>%
complete(title, language) %>%
replace_na(list(n = 0)) %>%
group_by(title) %>%
mutate(prop = prop.table(n))
<- programming %>%
p mutate(text = paste0("Language: ", language, "\n",
"Job title: ", title, "\n",
"Count: ", n, "\n",
"Proportion: ", round(prop, 3))) %>%
ggplot(aes(language, title, fill=prop, text=text)) +
geom_tile() +
scale_fill_gradient(low="white", high="royalblue") +
labs(
title = "Users' favorite programming language",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle=90, hjust=1),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
<- job.dat %>%
ide select(c(Q5, starts_with("Q9_"))) %>%
gather("fake_key", "IDE", -Q5, na.rm = T) %>%
rename(title = Q5) %>%
select(-fake_key) %>%
mutate(IDE = case_when(
== "Visual Studio Code (VSCode)" ~ "VSCode",
IDE == "Jupyter (JupyterLab, Jupyter Notebooks, etc)" ~ "Jupyter Notebook",
IDE TRUE ~ IDE
%>%
)) filter(!IDE %in% c("None", "Other")) %>%
count(title, IDE, .drop = FALSE) %>%
complete(title, IDE) %>%
replace_na(list(n = 0)) %>%
group_by(title) %>%
mutate(prop = prop.table(n))
<- ide %>%
p mutate(text = paste0("IDE: ", IDE, "\n",
"Job title: ", title, "\n",
"Count: ", n, "\n",
"Proportion: ", round(prop, 3))) %>%
ggplot(aes(IDE, title, fill=prop, text=text)) +
geom_tile() +
scale_fill_gradient(low="white", high="royalblue") +
labs(
title = "Users' favorite IDE",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle=90, hjust=1),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
<- job.dat %>%
pj.dat select(c(Q5, Q7_Part_1, Q9_Part_1, Q9_Part_11)) %>%
rename(Title = Q5) %>%
mutate(Python = !is.na(Q7_Part_1),
Jupyter = !is.na(Q9_Part_1) | !is.na(Q9_Part_11)) %>%
select(Title, Python, Jupyter) %>%
pivot_longer(cols = c(Python, Jupyter)) %>%
filter(value == TRUE)
<- function(excludes) {
py.ju.test <- pj.dat %>% filter(!Title %in% excludes)
pj.t.dat print(table(pj.t.dat$name, pj.t.dat$Title))
<- chisq.test(pj.t.dat$name, pj.t.dat$Title,
pj.t.res simulate.p.value = T)
return(paste0("X-squared = ", round(pj.t.res$statistic, 1),
", p-value = ", round(pj.t.res$p.value, 2),
if_else(pj.t.res$p.value < 0.05, ", H0 rejected",
", can't reject H0")))
}table(pj.dat$Title, pj.dat$name)
# Q8
<- job.dat %>%
learning_platform select(c(Q5, starts_with("Q40_"))) %>%
gather("fake_key", "learning", -Q5, na.rm = T) %>%
rename(title = Q5) %>%
select(-fake_key) %>%
mutate(learning = case_when(
== "Cloud-certification programs (direct from AWS, Azure, GCP, or similar)" ~ "Cloud-certif Programs",
learning == "University Courses (resulting in a university degree)" ~ "University",
learning TRUE ~ learning
%>%
)) filter(!learning %in% c("None", "Other")) %>%
count(title, learning, .drop = FALSE) %>%
complete(title, learning) %>%
replace_na(list(n = 0)) %>%
group_by(title) %>%
mutate(prop = prop.table(n))
<- learning_platform %>%
p mutate(text = paste0("Platform: ", learning, "\n",
"Job title: ", title, "\n",
"Count: ", n, "\n",
"Proportion: ", round(prop, 3))) %>%
ggplot(aes(learning, title, fill=prop, text=text)) +
geom_tile() +
scale_fill_gradient(low="white", high="royalblue") +
labs(
title = "Users' favorite learning platforms",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle=90, hjust=1),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
<- job.dat %>%
share_deploy select(c(Q5, starts_with("Q39_"))) %>%
gather("fake_key", "share", -Q5, na.rm = T) %>%
rename(title = Q5) %>%
select(-fake_key) %>%
mutate(share = case_when(
== "I do not share my work publicly" ~ "\'PRIVATE\'",
share TRUE ~ share
%>%
)) filter(!share %in% c("Other")) %>%
count(title, share, .drop = FALSE) %>%
complete(title, share) %>%
replace_na(list(n = 0)) %>%
group_by(title) %>%
mutate(prop = prop.table(n))
<- share_deploy %>%
p mutate(text = paste0("Platform: ", share, "\n",
"Job title: ", title, "\n",
"Count: ", n, "\n",
"Proportion: ", round(prop, 3))) %>%
ggplot(aes(share, title, fill=prop, text=text)) +
geom_tile() +
scale_fill_gradient(low="white", high="royalblue") +
labs(
title = "Users' favorite share platforms",
x = "",
y = "",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle=90, hjust=1),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
<- job.dat %>%
media_source select(c(Q5, starts_with("Q42_"))) %>%
gather("fake_key", "media", -Q5, na.rm = T) %>%
rename(title = Q5) %>%
select(-fake_key) %>%
filter(!media %in% c("None", "Other")) %>%
count(title, media, .drop = FALSE) %>%
complete(title, media) %>%
replace_na(list(n = 0)) %>%
group_by(title) %>%
mutate(prop = prop.table(n)) %>%
separate(media, into = c("media", "media_suffix"), sep = " \\(")
<- media_source %>%
p mutate(text = paste0("Platform: ", media, "\n",
"Job title: ", title, "\n",
"Count: ", n, "\n",
"Proportion: ", round(prop, 3))) %>%
ggplot(aes(media, title, fill=prop, text=text)) +
geom_tile() +
scale_fill_gradient(low="white", high="royalblue") +
labs(
title = "Users' favorite media source",
caption = glue("Author: celeritasML
Source: Kaggle")) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_text(angle=90, hjust=1),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")
# Q9
<- dat %>%
viz_lib select(Q5, starts_with("Q14")) %>%
select(-c(Q14_Part_11, Q14_OTHER))
<- viz_lib %>%
viz_lib pivot_longer(cols=starts_with("Q14")) %>%
select(-name) %>%
drop_na() %>%
filter(!(Q5 %in% c("Other", "DBA/Database Engineer",
"Developer Relations/Advocacy",
"Currently not employed",
"Statistician", "Product Manager")))
ggplot(viz_lib) +
geom_bar(aes(y = value, fill = value)) +
facet_wrap(~ Q5) +
scale_fill_brewer(palette = "Spectral") +
labs(
title = "Data science practitioners' favorite viz libraries",
x = "",
y = "",
caption = glue("Author: celeritasML
Source: Kaggle")
+
) theme(
axis.ticks.x = element_line(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(size = 6),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
<- dat %>%
ml_lib select(Q5, starts_with("Q16")) %>%
select(-c(Q16_Part_17, Q16_OTHER))
<- ml_lib %>%
ml_lib pivot_longer(cols=starts_with("Q16")) %>%
select(-name) %>%
drop_na() %>%
filter(!(Q5 %in% c("Other", "DBA/Database Engineer",
"Developer Relations/Advocacy",
"Currently not employed",
"Statistician", "Product Manager")))
<- ml_lib %>%
top_10_ml group_by(value) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
top_n(10)
<- ml_lib %>%
ml_lib filter(value %in% top_10_ml$value)
ggplot(ml_lib) +
geom_bar(aes(y = value, fill = value)) +
facet_wrap(~ Q5) +
scale_fill_brewer(palette = "Spectral") +
labs(
title = "Data science practitioners' favorite ML libraries",
x = "",
y = "",
caption = glue("Author: celeritasML
Source: Kaggle")
+
) theme(
axis.ticks.x = element_line(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(size = 6),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
%>%
viz_lib group_by(Q5) %>%
table() %>%
addmargins() %>%
print()
%>%
ml_lib group_by(Q5) %>%
table() %>%
addmargins() %>%
print()
# Q10
<- tibble(
aws_user ec2 = dat$Q29_A_Part_1,
s3 = dat$Q30_A_Part_3,
efs = dat$Q30_A_Part_4,
sagemaker = dat$Q31_A_Part_1,
redshift = dat$Q32_A_Part_11,
aurora = dat$Q32_A_Part_12,
rds = dat$Q32_A_Part_13,
dynamodb = dat$Q32_A_Part_14
%>%
) mutate(ec2 = if_else(is.na(ec2), 0, 1),
s3 = if_else(is.na(s3), 0, 1),
efs = if_else(is.na(efs), 0, 1),
sagemaker = if_else(is.na(sagemaker), 0, 1),
redshift = if_else(is.na(redshift), 0, 1),
aurora = if_else(is.na(aurora), 0, 1),
rds = if_else(is.na(rds), 0, 1),
dynamodb = if_else(is.na(dynamodb), 0, 1))
chisq.test(aws_user$ec2, aws_user$s3)
chisq.test(aws_user$ec2, aws_user$efs)
chisq.test(aws_user$ec2, aws_user$sagemaker)
# Q11
<- tibble(
cloud_comp aws_usage = dat$Q27_A_Part_1,
azure_usage = dat$Q27_A_Part_2,
gcp_usage = dat$Q27_A_Part_3
%>%
) mutate(aws_usage = if_else(is.na(aws_usage), FALSE, TRUE),
azure_usage = if_else(is.na(azure_usage), FALSE, TRUE),
gcp_usage = if_else(is.na(gcp_usage), FALSE, TRUE))
prop.test(c(sum(cloud_comp$aws_usage), sum(cloud_comp$azure_usage)),
c(nrow(cloud_comp), nrow(cloud_comp)),
alternative = "two.sided",
correct = TRUE)
prop.test(c(sum(cloud_comp$azure_usage), sum(cloud_comp$gcp_usage)),
c(nrow(cloud_comp), nrow(cloud_comp)),
alternative = "two.sided",
correct = TRUE)
prop.test(c(sum(cloud_comp$gcp_usage), sum(cloud_comp$aws_usage)),
c(nrow(cloud_comp), nrow(cloud_comp)),
alternative = "two.sided",
correct = TRUE)
# Q12
<- dat %>%
lm_dat select(Q1, Q2, Q4, Q5, Q6,
Q7_Part_1, Q7_Part_2, Q7_Part_3,%>%
Q25) drop_na(Q25) %>%
rename(age_group = Q1,
gender = Q2,
education = Q4,
job = Q5,
experience = Q6,
salary = Q25,
python = Q7_Part_1,
r = Q7_Part_2,
sql = Q7_Part_3)
set.seed(511)
# - poverty: below the federal minimum wage
# - low: 40,000 to 79,999
# - medium: 80,000 to 124,999
# - high: 125,000 to 199,999
# - very high: 200,000 to 499,999
# - highest: >= 500,000
<- lm_dat %>%
lm_dat rowwise() %>%
mutate(salary = case_when(
== "poverty" ~ sample(1:39999, 1),
salary == "low" ~ sample(40000:79999, 1),
salary == "medium" ~ sample(80000:124999, 1),
salary == "high" ~ sample(125000:199999, 1),
salary == "very high" ~ sample(200000:499999, 1),
salary == "highest" ~ sample(500000:2000000, 1)
salary %>%
)) ungroup()
ggplot(lm_dat, aes(x=salary)) +
geom_histogram(binwidth = 5000) +
labs(
title = "Histogram of salaries",
subtitle = "raw data without transformation",
caption = glue("Author: celeritasML
Source: Kaggle"))
ggplot(lm_dat, aes(x=log(salary))) +
geom_histogram(binwidth = 0.05) +
labs(
title = "Histogram of log(salary)",
subtitle = "after log-transformation on response",
caption = glue("Author: celeritasML
Source: Kaggle"))
<- lm_dat %>%
lm_dat mutate(python = if_else(is.na(python), 0, 1),
r = if_else(is.na(r), 0, 1),
sql = if_else(is.na(sql), 0, 1))
<- lm(log(salary) ~ . + sql:r + sql:python + r:python + python:r:sql +
model1 :sql + job:r + job:python + job:experience +
job:education, data=lm_dat)
job<- MASS::stepAIC(model1, trace = FALSE)
model2
anova(model2)
summary(model2)