Data Science Career Paths & Skillsets in 2021

A glimpse of Kaggle’s State of Data Science and Machine Learning Survey

Introduction

Each year, Kaggle conducts an annual worldwide survey among active users to provide a comprehensive view of the data science (DS) and machine learning (ML) community1. After preliminary cleaning, a total of 25,973 responses were recorded for analysis. Following the survey, Kaggle hosts an annual Data Science Survey Challenge that awards top notebook authors who tell a rich story about a subset of the DS and ML community.

For this project, the goal is to tell a data science story about data science professionals working in the United States. The raised questions pertain to comparing and contrasting job titles that fall within the general umbrella of data science, such as data scientist, data analyst, machine learning engineer, statistician. The key features with which to tell the data science story include but are not limited to salary, preferences on programming languages and environments, hardware/cloud utilization, diversity, and inclusion. Not only are these interesting questions to ask, but they also provide insights for fellow students in Data Science and Analytics program2 at Georgetown University to narrow down job search.

Data and Cleaning

This research draws data from the 2021 Kaggle Machine Learning & Data Science Survey conducted anonymously among the Kaggle community from 09/01/2021 to 10/04/2021. The dataset consists of 25,973 valid responses to the survey questionnaire encompassing 42 questions about personal background and preference on programming language, IDE, cloud service, computing hardware, database, machine learning package, workflow. Due to the nature of this data science story, a subset of data will be sliced as the data of interest.

To give a more precise view of the data science and machine learning community, the scale of research is narrowed down to only Kaggle users from the United States of America. As a result, 2650 responses(10.2%) are selected from the origin dataset.

Also, some non-relevant or too-specific questions are removed from the dataset, and only 21 questions(50%) are used in the analysis below.

The next step of data cleaning is removing NA values. In the questionnaire, some questions can only choose single answer and some questions allow to choose multiple answers. For single-answered questions, NA value means the question is not answered by respondent, i.e. all students didn’t answer the question about salary. For multiple-answers question, NA value means the option is not selected by respondent, i.e. in the question about programming languages, users didn’t select all listed answers.

Besides that, some special answers are treated as not useful and removed. The first one is OTHER which means the answer is not specified, and the second one is NONE which means none of the answers above is applicable for the respondent.

The origin responses of salary are categorical with 27 classes having format like $150,000-199,999, so two different approach to transform the salary are done.

  • Aggregation: the categorical salaries are grouped into 6 levels {poverty, low, medium, high, very high, highest}.
  • String Split: the categorical salaries are separated into two columns {lower bound, upper bound}.

Statistical Questions and Methods

Data science is a broad and interdisciplinary field. As a result, there is a lot of variation in the day-to-day responsibilities, workflows, skill sets, and salaries for jobs related to data science. The rough idea is to do some key comparisons among the following job titles: data analyst, data engineer, data scientist, machine learning engineer, software engineer, statistician. Preliminarily, these are the questions of interests regarding career paths, with proposed analytical methods in parentheses:

  1. What percentage of the survey respondents are working under these job titles? (EDA)
  2. What are the statistics on salaries for these job titles? (EDA, ANOVA, hypothesis testing)
  3. What levels of education are required for these job titles? (EDA)
  4. Is there a significant income gap between genders for these jobs? (EDA, hypothesis testing)
  5. What is the typical skill set for these jobs? How does it affect the pay rate? (regression, ANOVA)
  6. Is there a certain correlation between industry and the need for these jobs? (EDA, bootstrap, hypothesis testing)

Additionally, it is also worth investigating the requisite skillsets for each of the data science careers as guidance to our program cohort on which skills and technologies to focus the career preparation on. The inquiry includes but is not limited to the following questions:

  1. What programming languages and IDEs do they use? (hypothesis test on the two-way table)
  2. Where do they get and share the knowledge? (EDA)
  3. What data science packages do they use? (EDA)
  4. Are they using cloud computing tools? What are their preferences? Will a user’s preference for cloud computing platforms affect his or her preference for big data products/cloud storage products? (Hypothesis testing)
  5. What is the overall AWS usage percentage among DS practitioners? Is it the same for Google Cloud? (Law of Large Numbers, two proportions Z-test.)

The Analysis and Results

Since the one large data science problem can be divided into several smaller bits, it is practical to combine the analysis and the interpretation sections together. And the questions will be addressed and discussed one by one.

The following questions are about data science career paths.

Job Titles Distributions

What percentage of the survey respondents are working under these job titles?

The percentages are shown in the legends of the following waffle plot.

Salaries

What are the statistics on salaries for these job titles?

The US minimum wage is 7.25 per hour, multiply that for a total \(260\) typical number of workdays a year:

\[7.25 \frac{\$}{\text{hour}} \cdot 8 \frac{\text{hours}}{\text{day}} \cdot 260\frac{\text{days}}{\text{year}} = 15080.\]

The new categories* for salaries will be: - poverty: below the federal minimum wage - low: 15,000 to 49,999 - medium: 50,000 to 99,999 - high: 100,000 to 199,999 - very high: 200,000 to 499,999 - highest: >= 500,000

*loosely based on US federal income tax brackets.

A chi-squared test of independence can be used to test for the statistical significance in the difference in categorical frequencies in each salary category for each job title. On the one hand, the test is done on data-oriented data science roles, see below:

## 
##  Pearson's Chi-squared test with simulated p-value (based on 10000
##  replicates)
## 
## data:  q2_table.1
## X-squared = 180.52, df = NA, p-value = 9.999e-05

Qualitatively, it can be seen that each job title has a seemingly different categorical distribution over the salary brackets. The chi-squared test using simulated p-values reveals that this difference is statistically significant as well with a p-value of \(0.0004998\). At an alpha-level of 0.05, the null hypothesis that the frequency of counts over salary brackets is not independent of the job title.

Because the original salary data is categorical instead of a numerical estimation, a two-sample t-test to test the difference of mean salary is not possible. But the visualization above reveals that for data scientists and data engineers, there are more people in medium and high salary brackets than low salary brackets; the opposite is true for business analyst and data analyst. The visualization alone is able to reveal that “analyst” is considered a more junior role compared to an “engineer” or a “data scientist”. There are just too few DBA/Database engineers and statisticians represented in the dataset for a conclusion to be drawn for these categories.

It is also worth noting that most of the highest earning (\(>\$500,000\)) people in the dataset are data scientists.

Below, the analysis focuses on the job titles that have a focus on software and product management, such as machine learning engineer (MLE), software engineer (SWE), and product manager (PM).

## 
##  Pearson's Chi-squared test with simulated p-value (based on 10000
##  replicates)
## 
## data:  q2_table.2
## X-squared = 39.402, df = NA, p-value = 0.0011

As can be seen above, more MLEs command the highest category of salary than any other job title. The average MLE will earn at or above the medium salary bracket. As for PMs, very few of them ear a low salary; in fact, most of them are in the medium to very high brackets of income. The salary distribution for project managers is similar to that of PMs, except there are significantly more professionals in the low income bracket. Last but not least, most SWEs command a medium to high salary, with still a significant number of professional earning outside this range on both the high end and the lower end.

Requried Education Levels

What levels of education are required for these job titles?

Aggregate Level

The percentages are shown in the legends of the following waffle plot.

## 
##                                                   Master’s degree 
##                                                       0.466037736 
##                                                 Bachelor’s degree 
##                                                       0.269433962 
##                                                   Doctoral degree 
##                                                       0.164528302 
## Some college/university study without earning a bachelor’s degree 
##                                                       0.064905660 
##                                            Professional doctorate 
##                                                       0.016981132 
##                                            I prefer not to answer 
##                                                       0.011320755 
##                              No formal education past high school 
##                                                       0.006792453

At an aggregate level, it is evident that more than half of the data science community on Kaggle has a degree that is above the undergraduate level. Combining percentages of people with a master’s degree or a doctoral degree (including a professional doctoral degree) gives us that \(64.7/%\) of respondents hold a graduate degree. For the combination of programming skill level requirements, as well as the mount of requisite knowledge in statistics machine learning, this is an expected phenomenon.

Most professional data scientists have a master’s degree or above:

The percentages are shown in the legends of the following waffle plot.

The same doesn’t necessary go for software engineering

The percentages are shown in the legends of the following waffle plot.

Gender Gap in Income

Is there a significant income gap between genders for these jobs?

What about at an aggregate level?

##                          
##                           poverty low medium high very high highest
##   Man                         125 108    369  584       187      28
##   Nonbinary                     1   5      5    5         0       1
##   Prefer not to say            11   4      4   12         6       1
##   Prefer to self-describe       1   1      0    1         0       0
##   Woman                        68  57    118  122        19       2

As seen above at an aggregate level, for salary levels high and above there are a total of 799 men and only 143 women. The gender map is apparent.

By groups aggregated depending on years of experience?

## [1] "< 1 years"
##                          
##                           poverty low medium high very high highest
##   Man                          13  20     46   17         4       0
##   Nonbinary                     0   1      0    0         0       0
##   Prefer not to say             0   1      0    0         0       0
##   Prefer to self-describe       0   1      0    0         0       0
##   Woman                        14  15     21    8         3       0
## [1] "10-20 years"
##                    
##                     poverty low medium high very high highest
##   Man                    10   7     34  142        49      11
##   Nonbinary               0   0      0    1         0       0
##   Prefer not to say       1   1      1    1         1       0
##   Woman                   6   5      6   24         7       1
## [1] "1-3 years"
##                          
##                           poverty low medium high very high highest
##   Man                          24  24     81   50         9       1
##   Nonbinary                     0   2      2    0         0       1
##   Prefer not to say             3   1      1    0         0       0
##   Prefer to self-describe       1   0      0    1         0       0
##   Woman                        16  10     29   13         0       0
## [1] "5-10 years"
##                    
##                     poverty low medium high very high highest
##   Man                    16  17     67  140        31       3
##   Nonbinary               1   0      1    1         0       0
##   Prefer not to say       1   1      0    6         2       0
##   Woman                   7   6     12   29         5       0
## [1] "20+ years"
##                    
##                     poverty low medium high very high highest
##   Man                    34  12     37  160        82      12
##   Nonbinary               0   0      1    2         0       0
##   Prefer not to say       5   0      1    3         3       1
##   Woman                   6   2      8   21         4       1
## [1] "3-5 years"
##                          
##                           poverty low medium high very high highest
##   Man                          17  18     76   60         9       1
##   Nonbinary                     0   1      1    1         0       0
##   Prefer not to say             0   0      1    1         0       0
##   Prefer to self-describe       0   0      0    0         0       0
##   Woman                        11   5     21   22         0       0

Skillset vs Income

What is the typical skill set for these jobs? How does it affect the pay rate?

Here the key skill is defined as a skill that has been acquired more than 10% people under certain job title.

From the table, huge salary variances make it very hard to tell whether a skill will increase the salary or not.

Correlation between Industry and Job

Is there a certain correlation between industry and the need for these jobs?

A prop.table version.

## # A tibble: 5 × 6
##   Title                     Academics Computers Finance Internet Medical
##   <chr>                         <dbl>     <dbl>   <dbl>    <dbl>   <dbl>
## 1 Data Analyst                  0.4       0.105   0.258    0.105   0.214
## 2 Data Engineer                 0.047     0.091   0.129    0.105   0.071
## 3 Data Scientist                0.353     0.338   0.387    0.386   0.459
## 4 Machine Learning Engineer     0.082     0.15    0.056    0.211   0.071
## 5 Software Engineer             0.118     0.317   0.169    0.193   0.184

Some hypothesis that use prop.test to examine:

  1. \(H_a\): Data Analyst is more needed in Academics Industry.

To Computers: X-squared = 38.1, p-value = 3.28e-10, 95% CI = (0.2, 1), H0 rejected.
To Finance: X-squared = 4.1, p-value = 2.19e-02, 95% CI = (0.02, 1), H0 rejected.
To Internet: X-squared = 13.2, p-value = 1.38e-04, 95% CI = (0.17, 1), H0 rejected.
To Medical: X-squared = 6.6, p-value = 5.07e-03, 95% CI = (0.06, 1), H0 rejected.

TRUE, Data Analyst is more needed in Academics Industry.

  1. \(H_a\): Data Scientist is more needed in Medical Industry.

To Academics: X-squared = 1.7, p-value = 9.56e-02, 95% CI = (-0.02, 1), can’t reject H0.
To Computers: X-squared = 4.1, p-value = 2.14e-02, 95% CI = (0.02, 1), H0 rejected.
To Finance: X-squared = 0.9, p-value = 1.73e-01, 95% CI = (-0.05, 1), can’t reject H0.
To Internet: X-squared = 0.5, p-value = 2.36e-01, 95% CI = (-0.08, 1), can’t reject H0.

FALSE, Data Analyst has similar demand in these industries.

  1. \(H_a\): Software Engineer is more needed in Computers Industry.

To Academics: X-squared = 12.2, p-value = 2.39e-04, 95% CI = (0.12, 1), H0 rejected.
To Finance: X-squared = 8.8, p-value = 1.51e-03, 95% CI = (0.07, 1), H0 rejected.
To Internet: X-squared = 2.9, p-value = 4.32e-02, 95% CI = (0.02, 1), H0 rejected.
To Medical:X-squared = 5.8, p-value = 8.17e-03, 95% CI = (0.05, 1), H0 rejected.

TRUE, Software Engineer is more needed in Computers Industry.

  1. \(H_a\): Machine Learning Engineer is more needed in Internet Industry.

To Academics: X-squared = 2, p-value = 7.77e-02, 95% CI = (0, 1), can’t reject H0.
To Computers: X-squared = 3.3, p-value = 9.66e-01, 95% CI = (-0.14, 1), can’t reject H0.
To Finance: X-squared = 6.2, p-value = 6.32e-03, 95% CI = (0.04, 1), H0 rejected.
To Medical: X-squared = 3.3, p-value = 3.44e-02, 95% CI = (0.02, 1), H0 rejected.

FALSE, The demand of Machine Learning Engineer in Internet Industry is not significantly greater than in Academics and Computers.

  1. \(H_a\): Overall, Data Scientist is more needed than other jobs.

To Data Analyst: X-squared = 53, p-value = 1.66e-13, 95% CI = (0.14, 1), H0 rejected.
To Data Engineer: X-squared = 143.1, p-value = 2.74e-33, 95% CI = (0.24, 1), H0 rejected.
To Software Engineer: X-squared = 29.5, p-value = 2.76e-08, 95% CI = (0.1, 1), H0 rejected.
To ML Engineer: X-squared = 113.3, p-value = 9.36e-27, 95% CI = (0.22, 1), H0 rejected.

TRUE, Data Scientist is more needed than other 4 jobs.

Salaries for these jobs vary widely from industry to industry.

Languages and IDEs

What programming languages and IDEs do they use?

Survey questions Q7 (daily-used programming language), Q9 (IDE).

In the heatmap of programming language below, some trends can be found:

  • Python, R, and SQL are three most popular language in Kaggle community.
  • Statisticians use R more than Python and SQL.
  • C++, Java, Javascript are mainly used by Software Engineer.
  • Julia and Swift are barely used by Kaggle users.

In the heatmap of IDE below, some trends can be found:

  • Jupyter Notebook is the most popular IDE for Kaggle users.
  • MATLAB is barely used in the industry(last 5 jobs).
  • Statisticians use RStudio the most.
  • VSCode is overall the second common-used IDE.

From two plots above, while Python becomes the most popular programming language for Kaggle users, Jupyter Notebook (a Python IDE) also becomes the most commonly used IDE. Are there any correlation between them?

##                            
##                             Jupyter Python
##   Data Analyst                  167    186
##   Data Engineer                  62     80
##   Data Scientist                357    407
##   Machine Learning Engineer      85    100
##   Software Engineer             134    179
##   Statistician                    9     19
##   Student                       292    396

\(H_a\): The distribution of Python is significantly different from the distribution of Jupyter Notebook.

All job titles: X-squared = 6.1, p-value = 0.4, can’t reject H0.
Without Statistician: X-squared = 4.3, p-value = 0.51, can’t reject H0.
Without Statistician, Student: X-squared = 2, p-value = 0.73, can’t reject H0.

FALSE. There is no significant difference between the distribution of Python and the distribution of Jupyter Notebook among jobs. In the other words, most people who use Python also use Jupyter Notebook as the IDE.

Learning Sources

Where do they get and share the knowledge?

Survey questions Q39 (share and deploy), Q40 (learning resources), Q42 (Media sources).

In the heatmap of learning platform below, some trends can be found:

  • Coursera is the most popular learning platform outside the University.
  • Statisticans learn knowledge mainly from University course.

In the heatmap of sharing platform below, some trends can be found:

  • GitHub is the most popular sharing platform.
  • Kaggle, Colab, Personal Blog are used by a small proportion of Kaggle Users.
  • Other platforms listed on the questionare are barely used by Kaggle Users.
  • A large proportion of people don’t share their works to public.
  • None of Student answered this question.

In the heatmap of media source below, some trends can be found:

  • Blog, Kaggle, YouTube are three most popular media source.
  • Many Statisticians use Journal Publication as their media sources.

Special Job Title: Statistician

From all analysis above, Statistician shows many different preference than other jobs. For example:

  • Only a few people identify themselves as Statistician.
  • They mostly work in Academic industry.
  • They use R more frequently, also they use RStudio a lot.
  • They learn most knowledge from University Courses.
  • Many of them use Journal Publication to get report on data science topics.

Based on these features, Statisticians are very likely to be a job title specially for professors and researchers in academic institutions (i.e. universities).

The Loyalty to AWS

It is clear to see that cloud computing is the general trend of data science in the past couple of years. And it is predictable to be still prevailing in the future years to come, especially the concept of “metaverse” is brewing some new possibilities that require cloud computation as their foundations. The market is majorly occupied by Amazon, Google and Microsoft.

Specifically, will a user’s preference for a cloud computing platforms affect his or her preferences for other tools? For example, an AWS EC2 dedicated user will actually prefer AWS S3 over other products.

A list of AWS services are selected for this hypothesis, including EC2, S3, EFS, SageMaker, RedShift, Aurora, RDS and DynamoDB. The data covers the following survey questions:

  • Survey question Q29-A: computing products (Part_1)
  • Survey question Q30: Storage (Part_3, Part_4)
  • Survey question Q31-A: ML products (Part_1)

The null hypothesis to be tested is:

  • \(H_0:\) The proportion of people who use EC2 is independent of the people who uses S3 (this could be replaced by any other AWS services.)
  • \(H_a:\) The proportion of people who use EC2 is dependent on the people who use S3.

The hypothesis is tested by checking the \(\chi^2\) as the statistic.

\[\chi^2=\sum\frac{(O_i-E_i)^2}{E_i}\] The following three tests are conducted between:

  • EC2 vs S3
  • EC2 vs EFS
  • EC2 vs SageMaker
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  aws_user$ec2 and aws_user$s3
## X-squared = 1532, df = 1, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  aws_user$ec2 and aws_user$efs
## X-squared = 637.99, df = 1, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  aws_user$ec2 and aws_user$sagemaker
## X-squared = 566.34, df = 1, p-value < 2.2e-16

Since all three p-values are significantly smaller than the common critical value 0.05, that means the null hypothesis is rejected. That means the user group of EC2 and all three AWS services’ user groups (S3, EFS and SageMaker) are not independent.

A possible explanation is that EC2 is the core service of AWS, and most of other AWS services rely on running an elastic computing instance. That implies a data science practitioner who frequently uses EC2 will be more likely to use other AWS services.

Cloud Computing Platforms Preferences

Another related question addressed is to compare the market occupancy rate of the top 3 cloud computing providers. But in the survey, the market occupancy rate is instantiated as the using rate among survey takers.

Again, two hypotheses are given:

  • \(H_0: p_A=p_B\)
  • \(H_a:p_A\not=p_B\)

where \(A\) and \(B\) can be replaced by AWS, Azure, or GCP, and \(n_A\), \(n_B\) are sample size of group \(A\) and \(B\) respectively.

The test statistic (z-statistic) can be calculated as follow:

\[z=\frac{p_A-p_B}{\sqrt{p(1-p)/n_A+p(1-p)/n_B}}\]

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(cloud_comp$aws_usage), sum(cloud_comp$azure_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 81.285, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.07606175 0.11865523
## sample estimates:
##    prop 1    prop 2 
## 0.2377358 0.1403774
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(cloud_comp$azure_usage), sum(cloud_comp$gcp_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 1.8823, df = 1, p-value = 0.1701
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.005495458  0.031910552
## sample estimates:
##    prop 1    prop 2 
## 0.1403774 0.1271698
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(cloud_comp$gcp_usage), sum(cloud_comp$aws_usage)) out of c(nrow(cloud_comp), nrow(cloud_comp))
## X-squared = 107.85, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.1315249 -0.0896072
## sample estimates:
##    prop 1    prop 2 
## 0.1271698 0.2377358

The results show that only the p-value of second test is greater than the common critical value 0.05. That means:

  • Reject \(H_0\) that the using rates of AWS and Microsoft Azure are the same.
  • Fail to reject \(H_0\) that the using rates of Microsoft Azure and Google Cloud Platform are the same.
  • Reject \(H_0\) that the using rates of Google Cloud Platform and AWS are the same.

Therefore, to simplify the result, AWS is leading the market (a lot), while Microsoft Azure and Google Cloud Platform are pretty much the same in terms of using rate.

Another perspective of interpretation is to look at the 95% confidence interval of the hypothesis test. For instance, the test between AWS usage and Azure usage gives a confidence interval of \((0.076, 0.119)\). So it’s 95% confident that the true difference is between such an interval.

Linear Regression on Salary

Honestly speaking, the best way to approach this question is to apply a logistic regression on the the dependent variable salary due to its categorical nature. However, in order to experiment a linear model, we take a bold move to transform the salary back to continuous by randomly sampling values in between the pay levels. Three typical languages used by data science practitioners are included as well: Python, R and SQL.

A quick glimpse at the response will uncover the fact that the salary is not normally distributed. In fact, it is right-skewed with a long tail. A log-transformation on salary will not fully fix this issue, but indeed make the linear model fitting easier.

Afterwards, the function MASS::stepAIC() is used to do a stepwise regression in order to find the suitable predictive variables. Note that the selection criterion is AIC, and the model starts with the maximal model with all the predictors and some manually added interactions, including:

  • Interactions between languages, sql:r, sql:python, r:python, python:r:sql.
  • Interactions between jobs and languages, job:sql, job:r, job:python.
  • Interactions between jobs and other variables, job:experience, job:education.

Finally, the automatic procedure of model fitting returns the following model as the result:

\[\log(\text{salary})=\beta_0+\beta_1 \text{age_group} +\beta_2\text{gender}+ \beta_3\text{job}+\beta_4\text{experience}\\ +\beta_5\text{python}+\beta_6\text{r}+\beta_7\text{sql}\\ +\beta_8\text{python}\cdot\text{sql} + \beta_9\text{python}\cdot\text{r} + \beta_{10}\text{job}\cdot\text{python} + \epsilon \]

## Analysis of Variance Table
## 
## Response: log(salary)
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## age_group    10  103.17 10.3170 14.3705 < 2.2e-16 ***
## gender        4   51.48 12.8702 17.9269 1.913e-14 ***
## job          12   63.62  5.3016  7.3845 2.108e-13 ***
## experience    6   55.41  9.2357 12.8644 2.811e-14 ***
## python        1    0.04  0.0448  0.0625   0.80268    
## r             1    0.05  0.0525  0.0731   0.78695    
## sql           1    0.04  0.0379  0.0528   0.81830    
## python:sql    1    4.76  4.7620  6.6329   0.01009 *  
## python:r      1    2.00  2.0006  2.7866   0.09523 .  
## job:python   12   17.89  1.4904  2.0760   0.01575 *  
## Residuals  1795 1288.68  0.7179                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = log(salary) ~ age_group + gender + job + experience + 
##     python + r + sql + python:sql + python:r + job:python, data = lm_dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2002 -0.2588  0.0902  0.4265  3.2106 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            10.530416   0.267915  39.305  < 2e-16
## age_group22-24                          0.375789   0.208249   1.805 0.071319
## age_group25-29                          0.667577   0.198832   3.357 0.000803
## age_group30-34                          0.910161   0.198496   4.585 4.85e-06
## age_group35-39                          0.932372   0.199699   4.669 3.25e-06
## age_group40-44                          0.858800   0.203040   4.230 2.46e-05
## age_group45-49                          0.876625   0.204626   4.284 1.93e-05
## age_group50-54                          1.005792   0.205810   4.887 1.11e-06
## age_group55-59                          0.792527   0.210650   3.762 0.000174
## age_group60-69                          0.725414   0.208899   3.473 0.000528
## age_group70+                            0.400476   0.239091   1.675 0.094109
## genderNonbinary                        -0.528104   0.213639  -2.472 0.013530
## genderPrefer not to say                -0.326387   0.141802  -2.302 0.021466
## genderPrefer to self-describe          -0.491400   0.493495  -0.996 0.319504
## genderWoman                            -0.281689   0.050747  -5.551 3.27e-08
## jobData Analyst                        -0.056646   0.186378  -0.304 0.761218
## jobData Engineer                       -0.430005   0.411058  -1.046 0.295659
## jobData Scientist                      -0.011597   0.222771  -0.052 0.958487
## jobDBA/Database Engineer                0.245765   0.311723   0.788 0.430562
## jobDeveloper Relations/Advocacy         1.933314   0.630497   3.066 0.002199
## jobMachine Learning Engineer           -0.552834   0.384643  -1.437 0.150817
## jobOther                                0.073818   0.168189   0.439 0.660786
## jobProduct Manager                      0.299793   0.288356   1.040 0.298637
## jobProgram/Project Manager              0.193235   0.217665   0.888 0.374786
## jobResearch Scientist                   0.110558   0.220402   0.502 0.615996
## jobSoftware Engineer                    0.013011   0.208370   0.062 0.950217
## jobStatistician                        -0.595571   0.289542  -2.057 0.039836
## experience1-3 years                     0.091679   0.086024   1.066 0.286690
## experience10-20 years                   0.535813   0.089894   5.961 3.02e-09
## experience20+ years                     0.557339   0.092407   6.031 1.97e-09
## experience3-5 years                     0.270869   0.088927   3.046 0.002353
## experience5-10 years                    0.405516   0.085871   4.722 2.51e-06
## experienceI have never written code    -0.057973   0.128541  -0.451 0.652039
## python                                 -0.085435   0.209366  -0.408 0.683274
## r                                       0.163227   0.117684   1.387 0.165616
## sql                                    -0.202115   0.108713  -1.859 0.063167
## python:sql                              0.258247   0.119399   2.163 0.030682
## python:r                               -0.191537   0.127757  -1.499 0.133990
## jobData Analyst:python                  0.042105   0.224697   0.187 0.851379
## jobData Engineer:python                 0.494905   0.436385   1.134 0.256904
## jobData Scientist:python                0.246436   0.249720   0.987 0.323849
## jobDBA/Database Engineer:python        -0.757434   0.383077  -1.977 0.048168
## jobDeveloper Relations/Advocacy:python -1.425580   0.746101  -1.911 0.056202
## jobMachine Learning Engineer:python     0.827107   0.408757   2.023 0.043173
## jobOther:python                         0.024478   0.207962   0.118 0.906315
## jobProduct Manager:python               0.312287   0.335536   0.931 0.352128
## jobProgram/Project Manager:python       0.087848   0.263783   0.333 0.739149
## jobResearch Scientist:python           -0.306212   0.256948  -1.192 0.233525
## jobSoftware Engineer:python             0.003902   0.242807   0.016 0.987181
## jobStatistician:python                  0.448505   0.374177   1.199 0.230825
##                                           
## (Intercept)                            ***
## age_group22-24                         .  
## age_group25-29                         ***
## age_group30-34                         ***
## age_group35-39                         ***
## age_group40-44                         ***
## age_group45-49                         ***
## age_group50-54                         ***
## age_group55-59                         ***
## age_group60-69                         ***
## age_group70+                           .  
## genderNonbinary                        *  
## genderPrefer not to say                *  
## genderPrefer to self-describe             
## genderWoman                            ***
## jobData Analyst                           
## jobData Engineer                          
## jobData Scientist                         
## jobDBA/Database Engineer                  
## jobDeveloper Relations/Advocacy        ** 
## jobMachine Learning Engineer              
## jobOther                                  
## jobProduct Manager                        
## jobProgram/Project Manager                
## jobResearch Scientist                     
## jobSoftware Engineer                      
## jobStatistician                        *  
## experience1-3 years                       
## experience10-20 years                  ***
## experience20+ years                    ***
## experience3-5 years                    ** 
## experience5-10 years                   ***
## experienceI have never written code       
## python                                    
## r                                         
## sql                                    .  
## python:sql                             *  
## python:r                                  
## jobData Analyst:python                    
## jobData Engineer:python                   
## jobData Scientist:python                  
## jobDBA/Database Engineer:python        *  
## jobDeveloper Relations/Advocacy:python .  
## jobMachine Learning Engineer:python    *  
## jobOther:python                           
## jobProduct Manager:python                 
## jobProgram/Project Manager:python         
## jobResearch Scientist:python              
## jobSoftware Engineer:python               
## jobStatistician:python                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8473 on 1795 degrees of freedom
## Multiple R-squared:  0.1881, Adjusted R-squared:  0.1659 
## F-statistic: 8.484 on 49 and 1795 DF,  p-value: < 2.2e-16

By looking at both the summary() table and anova() table, the following can be concluded as the takeaways:

  • The predictive variables age_group, gender, job and experience are significant while the three languages python, r and sql are not significant alone.
  • However, the interaction of python:sql, python:r and job:python are playing some roles in predicting the log(salary) as the response.
  • Generally speaking, the age_group shows new graduate (22-24) and people past 70+ tend to have lower salaries.
  • Comparing to the baseline gender, which is Male, other gender options lead to lower salaries.
  • The experience years and salary are positively related.
  • Some job titles with python as a skillset can significantly relate to salary changes. Although it is still confusing to see the combination of Developer Relations/Advocacy:python has quite a lot negative effect on salary.

In general, this linear model is built upon lots of compromising assumptions, so that the overall power to explain the deviations in not ideal. Nevertheless, it still shows some insights about the data.

Conclusions

Note that since this project covers various aspects of data science including background information, tool preferences and so on, the conclusion cannot be summarized into one sentence. However, there are still many takeaways:

  • Similar job titles don’t mean the same salary expectations. By salary standards, data scientist is a more senior role than data analysts and data engineers.
  • Women are extremely underrepresented in the highest earning DS jobs.
  • Graduate degree holders (MS, PhD, JD, MD) are the majority in the DS community. The only exceptions in terms of job title are data engineers and software engineering.
  • Learn Python, R, and SQL, the industry favors Python over R, but R is quite popular among academics.
  • Take courses at Coursera, Udacity, Udemy, also at University
  • Share your work on GitHub
  • Blog, Kaggle, YouTube are good sources of data science information
  • Both Python and R have extensive amount of useful packages in terms of visualization and machine learning.
  • Pick a cloud computing platform and learn its products of your interest. AWS is highly recommended.

Limitations

Admittedly, Kaggle is a popular data science learning community with a focus on machine learning competition, its user group is representative enough to reflect a partial status quo of data science practitioners in the industry. In addition, one should keep in mind that not all data science professionals are keen on participating in machine learning competitions while working “nine to five”.

Nevertheless, the survey still manages to give the audience a snapshot of what a data science practitioner’s career looks like, especially in a post-pandemic era. Hopefully, it could provide some guidance to graduate students in the States who are pursuing data science-related careers.

Appendix: R scripts

# invalidate cache when the package version changes
knitr::opts_chunk$set(cache = TRUE,
                      echo = FALSE,
                      eval = TRUE,
                      tidy = FALSE,
                      warning = FALSE,
                      cache.extra = packageVersion('tidyverse'))
options(htmltools.dir.version = FALSE)

if (!require("pacman")) install.packages("pacman")
if (!require("waffle")) install.packages("waffle", repos = "https://cinc.rud.is")
pacman::p_load(tidyverse, ggthemes, latex2exp, glue,
               hrbrthemes, plotly, stringr, DT, extrafont,
               tidymodels, rmdformats, waffle, pals)

set.seed(511)

ggplot2::theme_set(
    theme_fivethirtyeight() +
    theme(
        text = element_text(family = "Roboto Condensed"),
        title = element_text(size = 14),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 10),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        panel.grid.minor.x = element_blank()
    )
)

col_names <- names(read_csv(
  "data/kaggle_survey_2021_responses.csv",
  n_max=0))
dat <- read_csv(
  "data/kaggle_survey_2021_responses.csv",
  col_names = col_names, skip=2)

dat <- dat %>%
  filter(Q3=="United States of America" )

job.dat <- dat %>% 
    filter(Q5 %in% c("Data Analyst",
                     "Data Engineer",
                     "Data Scientist",
                     "Machine Learning Engineer",
                     "Software Engineer",
                     "Statistician",
                     "Student")) %>%
    mutate(Q25 = str_remove_all(Q25, "[$,]")) %>%
    mutate(Q25 = str_replace(Q25, ">1000000", "1000000-2000000")) %>%
    separate(Q25, into = c("salary_lb", "salary_ub"), sep = "-") %>%
    mutate(salary_lb = as.numeric(salary_lb)) %>%
    mutate(salary_ub = as.numeric(salary_ub))

# Q1

jtitle <- sort(table(dat$Q5), decreasing = T) %>% 
    as.data.frame() %>%
    as.tibble()
jtitle <- rename(jtitle, `Job Title` = Var1)

x <- jtitle$Freq/25
val_names <- sprintf("%s (%s)", jtitle$`Job Title`,
                     scales::percent(round(x/sum(x), 3)))
names(x) <- val_names
waffle(x, rows = 5, color = pals::parula(n=15)) +
  labs(
        title = "Job Title Distributions",
        subtitle = "US respondents in Kaggle's annual survey",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
  theme_ipsum_rc(grid="") +
  theme_enhance_waffle() +
  theme(legend.position = "bottom",
        legend.text = element_text(size=6))

# Q2

# To change salary categories into FACTOR dtype with descending labels
poverty   = c("$0-999", "1,000-1,999" , "2,000-2,999", "3,000-3,999", "4,000-4,999", "5,000-7,499", "7,500-9,999",
             "10,000-14,999") 
low       = c("15,000-19,999", "20,000-24,999", "25,000-29,999", "30,000-39,999", "40,000-49,999")
medium    = c("50,000-59,999", "60,000-69,999", "70,000-79,999", "80,000-89,999", "90,000-99,999") 
high      = c("100,000-124,999", "125,000-149,999", "150,000-199,999")
very_high = c("200,000-249,999", "250,000-299,999", "300,000-499,999")
highest   = c("$500,000-999,999", ">$1,000,000")

dat$Q25[dat$Q25 %in% poverty] <- "poverty"
dat$Q25[dat$Q25 %in% low] <- "low"
dat$Q25[dat$Q25 %in% medium] <- "medium"
dat$Q25[dat$Q25 %in% high] <- "high"
dat$Q25[dat$Q25 %in% very_high] <- "very high"
dat$Q25[dat$Q25 %in% highest] <- "highest"
dat$Q25 = factor(dat$Q25, levels = c("poverty", "low", "medium", "high", "very high", "highest"), ordered = T)

data_side <- c("Data Scientist", "Data Analyst", "Business Analyst", "Data Engineer", "Statistician", "DBA/Database Engineer")
swe_side <- c("Software Engineer", "Machine Learning Engineer", "Program/Project Manager", "Product Manager") 
academia <- c("Student", "Other", "Research Scientist") 

dat[dat$Q5 %in% data_side & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
  ggplot( aes(x=Q5, y=Q25, color = Q5)) +
    geom_count() + 
    ggtitle("Two-Way Salary Visualizations: Data-Oriented Jobs") +
    xlab("") +
    labs(
        y = "salary",
        caption = glue("Author: celeritasML
                   Source: Kaggle"))+
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.title = element_blank())


dat[dat$Q5 %in% swe_side & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
  ggplot( aes(x=Q5, y=Q25, color = Q5)) +
    geom_count() + 
    ggtitle("Two-Way Salary Visualizations: Engineering-Oriented Jobs") +
    xlab("") + 
    labs(
        y = "salary",
        caption = glue("Author: celeritasML
                   Source: Kaggle"))+
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.title = element_blank())

dat[dat$Q5 %in% academia & !is.na(dat$Q5) & !is.na(dat$Q25), ] %>%
  ggplot( aes(x=Q5, y=Q25, color = Q5)) +
    geom_count() + 
    ggtitle("Two-Way Salary Visualizations: Academic Jobs and Others") +
    xlab("") + 
    labs(
        y = "salary",
        caption = glue("Author: celeritasML
                   Source: Kaggle"))+
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.title = element_blank())


expectedcounts <- function(A){
  r <- rowSums(A)
  c <- colSums(A)
  N = sum(A)
  expected <- outer(r,c)/N
  return(expected)
}


q2_table.1 = table(dat$Q25[dat$Q5 %in% data_side], dat$Q5[dat$Q5 %in% data_side])
q2_table.df = q2_table.1 %>% as.data.frame() %>% as.tibble()
q2_expectedcounts <- expectedcounts(q2_table.1) %>% as.data.frame() %>% as.tibble()

ggplot(q2_table.df, aes(x = Var2, y = Freq, fill = Var1)) +
    geom_col(position = "dodge") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(
        title = "Histogram of Salary",
        subtitle = "data-oriented data science roles",
        caption = glue("Author: celeritasML
                   Source: Kaggle"))

(chisq.test(q2_table.1, simulate.p.value = T, B = 10000))

q2_table.2 = table(dat$Q25[dat$Q5 %in% swe_side], dat$Q5[dat$Q5 %in% swe_side])
q2_table.df2 = q2_table.2 %>% as.data.frame() %>% as.tibble()
q2_expectedcounts2 <- expectedcounts(q2_table.2) %>% as.data.frame() %>% as.tibble()

ggplot(q2_table.df2, aes(x = Var2, y = Freq, fill = Var1)) +
    geom_col(position = "dodge") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(
        title = "Histogram of Salary",
        subtitle = "software and product management related data science roles",
        caption = glue("Author: celeritasML
                   Source: Kaggle"))

(chisq.test(q2_table.2, simulate.p.value = T, B = 10000))

# Q3

edu <- sort(table(dat$Q4), decreasing = T) %>% 
    as.data.frame() %>%
    as.tibble()
edu <- rename(edu, `Degree` = Var1)

x <- edu$Freq/18
val_names <- sprintf("%s (%s)", edu$Degree,
                     scales::percent(round(x/sum(x), 3)))
names(x) <- val_names
waffle(x, rows = 5, color = pals::parula(n=7)) +
  labs(
        title = "Waffle Chart of Education Level",
        subtitle = "US respondents in Kaggle's annual survey",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
  theme_ipsum_rc(grid="") +
  theme_enhance_waffle() +
  theme(legend.position = "bottom",
        legend.text = element_text(size=6))

Q3_temp <- sort(table(dat$Q4), decreasing = T)
(Q3_temp <- Q3_temp / sum(Q3_temp))

edu.1 <- sort(table(dat[dat$Q5 == "Data Scientist",]$Q4), decreasing = T) %>% 
    as.data.frame() %>%
    as.tibble()
edu.1 <- rename(edu.1, `Degree` = Var1)

x <- edu.1$Freq/5
val_names <- sprintf("%s (%s)", edu$Degree,
                     scales::percent(round(x/sum(x), 3)))
names(x) <- val_names
waffle(x, rows = 5, color = pals::parula(n=7)) +
  labs(
        title = "Waffle Chart of Education Level",
        subtitle = "Subset of Data Scientists",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
  theme_ipsum_rc(grid="") +
  theme_enhance_waffle() +
  theme(legend.position = "bottom",
        legend.text = element_text(size=6))

EDU <- table(dat$Q5, dat$Q4) %>% 
    as.data.frame() %>%
    as.tibble()
EDU <- rename(EDU, `Job Title` = Var1)
EDU <- rename(EDU, `Degree` = Var2)

print_pie_chart <- function(job){
    edu <- sort(table(dat[dat$Q5 == job,]$Q4), decreasing = T) %>% 
        as.data.frame() %>%
        as.tibble()
    edu <- rename(edu, `Degree` = Var1)
    
    x <- edu$Freq/ceiling(min(edu$Freq))
    val_names <- sprintf("%s (%s)", edu$Degree,
                         scales::percent(round(x/sum(x), 3)))
    names(x) <- val_names
    p <- waffle(x, rows = 5, color = pals::parula(n=length(unique(edu$Degree)))) +
      labs(
            title = "Waffle Chart of Education Level",
            subtitle = glue("Subset of ", job),
            caption = glue("Author: celeritasML
                       Source: Kaggle")) +
      theme_ipsum_rc(grid="") +
      theme_enhance_waffle() +
      theme(legend.position = "bottom",
            legend.text = element_text(size=6))
    print(p)
}
jobs <- unique(dat$Q5)

print_pie_chart("Software Engineer")
print_pie_chart("Data Engineer")

# Q5

table(dat$Q2, dat$Q25)

for (agegroup in unique(dat$Q6)){
    if (agegroup != "I have never written code"){
        print(agegroup)
        print(table(dat$Q2[dat$Q6 == agegroup], dat$Q25[dat$Q6 == agegroup]))
    }
}

# Q5

skill.set <- job.dat %>% 
    filter(Q5 != "Other") %>%
    select(c(Q5, starts_with("Q7_"), starts_with("Q9_"), 
             starts_with("Q12_"), starts_with("Q14_"),
             starts_with("Q16_"), starts_with("Q17_"),
             starts_with("Q18_"), starts_with("Q19_"),
             salary_lb)) %>%
    mutate(Total = "`Total`") %>%
    gather("fake_key", "skillset", 
           -c(Q5, salary_lb), na.rm = T) %>%
    filter(!skillset %in% c("None", "Other")) %>%
    rename(title = Q5) %>%
    group_by(title, skillset) %>%
    summarise(n = n(), 
              salary_mean = round(mean(salary_lb, na.rm = T)),
              salary_sd = round(sd(salary_lb, na.rm = T)),
              .groups = "drop") %>%
    group_by(title) %>%
    mutate(prop = round(n * 100 / max(n), 1)) %>%
    filter(prop >= 0.1) %>%
    select(-n) %>%
    arrange(title, desc(prop))

datatable(skill.set, filter = 'top', width = 800)

# Q6

industry.dat <- job.dat %>%
    filter(Q5 != "Student") %>%
    select(Q5, Q20, salary_lb, salary_ub) %>%
    mutate(Q20 = case_when(
        Q20 == "Academics/Education" ~ "Academics",
        Q20 == "Accounting/Finance" ~ "Finance", 
        Q20 == "Computers/Technology" ~ "Computers",
        Q20 == "Medical/Pharmaceutical" ~ "Medical",
        Q20 == "Online Service/Internet-based Services" ~ "Internet",
        TRUE ~ Q20
    )) %>%
    filter(Q20 %in% c("Academics", "Finance", "Computers",
                      "Medical", "Internet"))

p <- industry.dat %>% 
    count(Q5, Q20) %>%
    mutate(Q20 = fct_reorder(Q20, n, .fun="sum")) %>%
    rename(title=Q5, Industry=Q20, count=n) %>%
    ggplot(aes(x=title, y=count)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    facet_wrap(~ Industry) +
    labs(
        title = "Users' work industry",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p)

ind.dis <- industry.dat %>% 
    filter(Q5 != "Statistician") %>%
    rename(Title = Q5, Industry = Q20)
ind.dis %>%
    group_by(Industry) %>%
    count(Title) %>%
    mutate(Proportion = round(prop.table(n), 3)) %>%
    select(-n) %>%
    pivot_wider(names_from = Industry, values_from = Proportion)

aca.dis <- ind.dis %>% filter(Industry == "Academics")
com.dis <- ind.dis %>% filter(Industry == "Computers")
fin.dis <- ind.dis %>% filter(Industry == "Finance")
int.dis <- ind.dis %>% filter(Industry == "Internet")
med.dis <- ind.dis %>% filter(Industry == "Medical")

ds.dis <- ind.dis %>% filter(Title == "Data Scientist")
da.dis <- ind.dis %>% filter(Title == "Data Analyst")
de.dis <- ind.dis %>% filter(Title == "Data Engineer")

my.prop.test <- function(target.dis, other.dis, title) {
    target <- sum(target.dis$Title == title)
    other <- sum(other.dis$Title == title)
    t.res <- prop.test(c(target, other),
          c(nrow(target.dis), nrow(other.dis)),
          alternative = "greater",
          correct = TRUE)
    return(paste0("X-squared = ", round(t.res$statistic,1), 
                  ", p-value = ", formatC(t.res$p.value, 
                                          format = "e", digits = 2),
                  ", 95% CI = (", round(t.res$conf.int[1], 2), 
                  ", ", round(t.res$conf.int[2], 2),"), ",
                  if_else(t.res$p.value < 0.05, "H0 rejected", 
                          "can't reject H0")))
}

my.prop.test2 <- function(ind.dis, title1, title2) {
    target <- sum(ind.dis$Title == title1)
    other <- sum(ind.dis$Title == title2)
    t.res <- prop.test(c(target, other),
          c(nrow(ind.dis), nrow(ind.dis)),
          alternative = "greater",
          correct = TRUE)
    return(paste0("X-squared = ", round(t.res$statistic,1), 
                  ", p-value = ", formatC(t.res$p.value, 
                                          format = "e", digits = 2),
                  ", 95% CI = (", round(t.res$conf.int[1], 2), 
                  ", ", round(t.res$conf.int[2], 2),"), ",
                  if_else(t.res$p.value < 0.05, "H0 rejected", 
                          "can't reject H0")))
}

p <- industry.dat %>% 
    filter(salary_lb != 1000000) %>%
    mutate(Q20 = fct_reorder(Q20, salary_lb, .fun='length')) %>%
    ggplot(aes(x=Q20, y=salary_lb)) +
    geom_boxplot() +
    coord_flip() +
    facet_wrap(~ Q5) +
    labs(
        title = "Users' salary vs industry",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_blank(),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")

# Q7

programming <- job.dat %>% 
    select(c(Q5, starts_with("Q7_"))) %>%
    gather("fake_key", "language", -Q5, na.rm = T) %>%
    rename(title = Q5) %>%
    select(-fake_key) %>%
    filter(!language %in% c("None", "Other")) %>%
    count(title, language, .drop = FALSE) %>% 
    complete(title, language) %>%
    replace_na(list(n = 0)) %>%
    group_by(title) %>%
    mutate(prop = prop.table(n))

p <- programming %>% 
    mutate(text = paste0("Language: ", language, "\n", 
                         "Job title: ", title, "\n", 
                         "Count: ", n, "\n",
                         "Proportion: ", round(prop, 3))) %>%
    ggplot(aes(language, title, fill=prop, text=text)) +
    geom_tile() +
    scale_fill_gradient(low="white", high="royalblue") +
    labs(
        title = "Users' favorite programming language",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")

ide <- job.dat %>% 
    select(c(Q5, starts_with("Q9_"))) %>%
    gather("fake_key", "IDE", -Q5, na.rm = T) %>%
    rename(title = Q5) %>%
    select(-fake_key) %>%
    mutate(IDE = case_when(
        IDE == "Visual Studio Code (VSCode)" ~ "VSCode",
        IDE == "Jupyter (JupyterLab, Jupyter Notebooks, etc)" ~ "Jupyter Notebook",
        TRUE ~ IDE
    )) %>%
    filter(!IDE %in% c("None", "Other")) %>%
    count(title, IDE, .drop = FALSE) %>% 
    complete(title, IDE) %>%
    replace_na(list(n = 0)) %>%
    group_by(title) %>%
    mutate(prop = prop.table(n))

p <- ide %>% 
    mutate(text = paste0("IDE: ", IDE, "\n", 
                         "Job title: ", title, "\n", 
                         "Count: ", n, "\n",
                         "Proportion: ", round(prop, 3))) %>%
    ggplot(aes(IDE, title, fill=prop, text=text)) +
    geom_tile() +
    scale_fill_gradient(low="white", high="royalblue") +
    labs(
        title = "Users' favorite IDE",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")

pj.dat <- job.dat %>% 
        select(c(Q5, Q7_Part_1, Q9_Part_1, Q9_Part_11)) %>% 
        rename(Title = Q5) %>%
        mutate(Python = !is.na(Q7_Part_1),
               Jupyter = !is.na(Q9_Part_1) | !is.na(Q9_Part_11)) %>%
        select(Title, Python, Jupyter) %>%
        pivot_longer(cols = c(Python, Jupyter)) %>%
        filter(value == TRUE)
py.ju.test <- function(excludes) {
    pj.t.dat <- pj.dat %>% filter(!Title %in% excludes)
    print(table(pj.t.dat$name, pj.t.dat$Title))
    pj.t.res <- chisq.test(pj.t.dat$name, pj.t.dat$Title, 
                           simulate.p.value = T)
    return(paste0("X-squared = ", round(pj.t.res$statistic, 1), 
                  ", p-value = ", round(pj.t.res$p.value, 2),
                  if_else(pj.t.res$p.value < 0.05, ", H0 rejected", 
                          ", can't reject H0")))
}
table(pj.dat$Title, pj.dat$name)

# Q8

learning_platform <- job.dat %>%
    select(c(Q5, starts_with("Q40_"))) %>%
    gather("fake_key", "learning", -Q5, na.rm = T) %>%
    rename(title = Q5) %>%
    select(-fake_key) %>%
    mutate(learning = case_when(
        learning == "Cloud-certification programs (direct from AWS, Azure, GCP, or similar)" ~ "Cloud-certif Programs",
        learning == "University Courses (resulting in a university degree)" ~ "University",
        TRUE ~ learning
    )) %>%
    filter(!learning %in% c("None", "Other")) %>%
    count(title, learning, .drop = FALSE) %>%
    complete(title, learning) %>%
    replace_na(list(n = 0)) %>%
    group_by(title) %>%
    mutate(prop = prop.table(n))

p <- learning_platform %>%
    mutate(text = paste0("Platform: ", learning, "\n",
                         "Job title: ", title, "\n",
                         "Count: ", n, "\n",
                         "Proportion: ", round(prop, 3))) %>%
    ggplot(aes(learning, title, fill=prop, text=text)) +
    geom_tile() +
    scale_fill_gradient(low="white", high="royalblue") +
    labs(
        title = "Users' favorite learning platforms",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")

share_deploy <- job.dat %>% 
    select(c(Q5, starts_with("Q39_"))) %>%
    gather("fake_key", "share", -Q5, na.rm = T) %>%
    rename(title = Q5) %>%
    select(-fake_key) %>%
    mutate(share = case_when(
        share == "I do not share my work publicly" ~ "\'PRIVATE\'",
        TRUE ~ share
    )) %>%
    filter(!share %in% c("Other")) %>%
    count(title, share, .drop = FALSE) %>% 
    complete(title, share) %>%
    replace_na(list(n = 0)) %>%
    group_by(title) %>%
    mutate(prop = prop.table(n))

p <- share_deploy %>% 
    mutate(text = paste0("Platform: ", share, "\n", 
                         "Job title: ", title, "\n",
                         "Count: ", n, "\n",
                         "Proportion: ", round(prop, 3))) %>%
    ggplot(aes(share, title, fill=prop, text=text)) +
    geom_tile() +
    scale_fill_gradient(low="white", high="royalblue") +
    labs(
        title = "Users' favorite share platforms",
        x = "",
        y = "",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")

media_source <- job.dat %>% 
    select(c(Q5, starts_with("Q42_"))) %>%
    gather("fake_key", "media", -Q5, na.rm = T) %>%
    rename(title = Q5) %>%
    select(-fake_key) %>%
    filter(!media %in% c("None", "Other")) %>%
    count(title, media, .drop = FALSE) %>% 
    complete(title, media) %>%
    replace_na(list(n = 0)) %>%
    group_by(title) %>%
    mutate(prop = prop.table(n)) %>%
    separate(media, into = c("media", "media_suffix"), sep = " \\(")

p <- media_source %>% 
    mutate(text = paste0("Platform: ", media, "\n", 
                         "Job title: ", title, "\n", 
                         "Count: ", n, "\n", 
                         "Proportion: ", round(prop, 3))) %>%
    ggplot(aes(media, title, fill=prop, text=text)) +
    geom_tile() +
    scale_fill_gradient(low="white", high="royalblue") +
    labs(
        title = "Users' favorite media source",
        caption = glue("Author: celeritasML
                   Source: Kaggle")) +
    theme(axis.ticks.x = element_blank(),
          axis.text.x = element_text(angle=90, hjust=1),
          axis.title = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
ggplotly(p, tooltip="text")

# Q9

viz_lib <- dat %>%
  select(Q5, starts_with("Q14")) %>%
  select(-c(Q14_Part_11, Q14_OTHER))

viz_lib <- viz_lib %>%
  pivot_longer(cols=starts_with("Q14")) %>%
  select(-name) %>%
  drop_na() %>%
  filter(!(Q5 %in% c("Other", "DBA/Database Engineer",
                   "Developer Relations/Advocacy",
                   "Currently not employed",
                   "Statistician", "Product Manager")))

ggplot(viz_lib) +
  geom_bar(aes(y = value, fill = value)) +
  facet_wrap(~ Q5) +
  scale_fill_brewer(palette = "Spectral") +
  labs(
    title = "Data science practitioners' favorite viz libraries",
    x = "",
    y = "",
    caption = glue("Author: celeritasML
                   Source: Kaggle")
  ) +
  theme(
        axis.ticks.x = element_line(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 6),
        axis.text.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

ml_lib <- dat %>%
  select(Q5, starts_with("Q16")) %>%
  select(-c(Q16_Part_17, Q16_OTHER))

ml_lib <- ml_lib %>%
  pivot_longer(cols=starts_with("Q16")) %>%
  select(-name) %>%
  drop_na() %>%
  filter(!(Q5 %in% c("Other", "DBA/Database Engineer",
                   "Developer Relations/Advocacy",
                   "Currently not employed",
                   "Statistician", "Product Manager")))

top_10_ml <- ml_lib %>%
  group_by(value) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(10)

ml_lib <- ml_lib %>%
  filter(value %in% top_10_ml$value)

ggplot(ml_lib) +
  geom_bar(aes(y = value, fill = value)) +
  facet_wrap(~ Q5) +
  scale_fill_brewer(palette = "Spectral") +
  labs(
    title = "Data science practitioners' favorite ML libraries",
    x = "",
    y = "",
    caption = glue("Author: celeritasML
                   Source: Kaggle")
  ) +
  theme(
        axis.ticks.x = element_line(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 6),
        axis.text.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

viz_lib %>%
  group_by(Q5) %>%
  table() %>%
  addmargins() %>%
  print()

ml_lib %>%
  group_by(Q5) %>%
  table() %>%
  addmargins() %>%
  print()

# Q10

aws_user <- tibble(
  ec2 = dat$Q29_A_Part_1,
  s3 = dat$Q30_A_Part_3,
  efs = dat$Q30_A_Part_4,
  sagemaker = dat$Q31_A_Part_1,
  redshift = dat$Q32_A_Part_11,
  aurora = dat$Q32_A_Part_12,
  rds = dat$Q32_A_Part_13,
  dynamodb = dat$Q32_A_Part_14
  
) %>%
  mutate(ec2 = if_else(is.na(ec2), 0, 1),
         s3 = if_else(is.na(s3), 0, 1),
         efs = if_else(is.na(efs), 0, 1),
         sagemaker = if_else(is.na(sagemaker), 0, 1),
         redshift = if_else(is.na(redshift), 0, 1),
         aurora = if_else(is.na(aurora), 0, 1),
         rds = if_else(is.na(rds), 0, 1),
         dynamodb = if_else(is.na(dynamodb), 0, 1))
chisq.test(aws_user$ec2, aws_user$s3)
chisq.test(aws_user$ec2, aws_user$efs)
chisq.test(aws_user$ec2, aws_user$sagemaker)

# Q11

cloud_comp <- tibble(
  aws_usage = dat$Q27_A_Part_1,
  azure_usage = dat$Q27_A_Part_2,
  gcp_usage = dat$Q27_A_Part_3
) %>%
  mutate(aws_usage = if_else(is.na(aws_usage), FALSE, TRUE),
         azure_usage = if_else(is.na(azure_usage), FALSE, TRUE),
         gcp_usage = if_else(is.na(gcp_usage), FALSE, TRUE))

prop.test(c(sum(cloud_comp$aws_usage), sum(cloud_comp$azure_usage)),
          c(nrow(cloud_comp), nrow(cloud_comp)),
          alternative = "two.sided",
          correct = TRUE)
prop.test(c(sum(cloud_comp$azure_usage), sum(cloud_comp$gcp_usage)),
          c(nrow(cloud_comp), nrow(cloud_comp)),
          alternative = "two.sided",
          correct = TRUE)
prop.test(c(sum(cloud_comp$gcp_usage), sum(cloud_comp$aws_usage)),
          c(nrow(cloud_comp), nrow(cloud_comp)),
          alternative = "two.sided",
          correct = TRUE)

# Q12

lm_dat <- dat %>%
  select(Q1, Q2, Q4, Q5, Q6,
         Q7_Part_1, Q7_Part_2, Q7_Part_3,
         Q25) %>%
  drop_na(Q25) %>%
  rename(age_group = Q1,
         gender = Q2,
         education = Q4,
         job = Q5,
         experience = Q6,
         salary = Q25,
         python = Q7_Part_1,
         r = Q7_Part_2,
         sql = Q7_Part_3)

set.seed(511)

# - poverty: below the federal minimum wage
# - low: 40,000 to 79,999
# - medium: 80,000 to 124,999
# - high: 125,000 to 199,999
# - very high: 200,000 to 499,999
# - highest: >= 500,000

lm_dat <- lm_dat %>%
  rowwise() %>%
  mutate(salary = case_when(
    salary == "poverty" ~ sample(1:39999, 1),
    salary == "low" ~ sample(40000:79999, 1),
    salary == "medium" ~ sample(80000:124999, 1),
    salary == "high" ~ sample(125000:199999, 1),
    salary == "very high" ~ sample(200000:499999, 1),
    salary == "highest" ~ sample(500000:2000000, 1)
  )) %>%
  ungroup()

ggplot(lm_dat, aes(x=salary)) +
  geom_histogram(binwidth = 5000) +
  labs(
        title = "Histogram of salaries",
        subtitle = "raw data without transformation",
        caption = glue("Author: celeritasML
                   Source: Kaggle"))

ggplot(lm_dat, aes(x=log(salary))) +
  geom_histogram(binwidth = 0.05) +
  labs(
        title = "Histogram of log(salary)",
        subtitle = "after log-transformation on response",
        caption = glue("Author: celeritasML
                   Source: Kaggle"))

lm_dat <- lm_dat %>%
  mutate(python = if_else(is.na(python), 0, 1),
         r = if_else(is.na(r), 0, 1),
         sql = if_else(is.na(sql), 0, 1))

model1 <- lm(log(salary) ~ . + sql:r + sql:python + r:python + python:r:sql +
               job:sql + job:r + job:python + job:experience +
               job:education, data=lm_dat)
model2 <- MASS::stepAIC(model1, trace = FALSE)

anova(model2)
summary(model2)