Data Science Jobs Salary Prediction

Analyse Data Science job salaries using statistical modelling to identify patterns and answer key questions

The Data Science Industry has a range of job salaries depending on different factors. It can be difficult to know which job will give a person the highest or most realistic salary, so being able to predict it depending on different factors such as company size and location will save many users the time and trouble of searching for the right job. A situation where this would be useful is if someone has been offered a position in a small company and would like to check the salary predictor to see what the average salary in a data science position might be in a smaller company. Likewise, if that person was also planning on being a full-time employee, they can check if there is much difference between salaries for full time employees vs part time employees, etc.

The focus and objective of this project will be to predict the salaries of different job titles in the Data Science workforce determined on specific features such as job title, employment type or employee residence. Another goal and objective is to ultimately find which of the many features are the most effective and accurate at predicting the salary. To achieve this, we aim to create a variety of data visualisation plots to help visually represent and show the relevant data and to make informed decisions about which features are most effective at predicting the salary. Some data cleaning and preprocessing will also take place prior to, ensuring the data has been sifted through.

Salary Histogram

Generally, employees were more likely to work remote during 2020 and 2021 and started to transition back to face to face work in 2022. However the outliers here are medium size companies who had a large percetage of people transfer to remote work. All forms of work show a slight increase over time, however part-online, part-face-to-face shows the lowest increase out of all the options. Fully remote employees also tend to have made more money compared to other employees who work partially-remote or not remote at all.

Salary BoxplotSalary Lineplot

Conclusion

Our initial goal and objective from phase 1 was to create an effective and accurate statistical model that can predict the salaries of different job titles of data scientists, effectively improving users lives by giving them a predicted version of what their salary would be depending on a range of factors. Through our findings with our full model and reduced model, we found that the reduced model gets the closest to the actual salaries when predicting, hence why this model was chosen for predicting salaries. We successfully figured out which features were the most important in the predicting process and used these features in our reduced model to find our predicted salaries. This also meant that we ultimately achieved our goals through narrowing the dataset to use more specific features and removing insignificant ones.

Predicted Salary Reduced Model

For the full report, please visit the repository for this project:

https://github.com/labelenn/Data-Science-Salary-Prediction