Основні навички для вакансій в галузі Data Science, перше півріччя 2020

На цій сторінці представлені списки навичок для основних вакансій в галузі Data Science (Big Data Software Engineer / Data Engineer, Data Scientist, Machine Learning Engineer, Data Analyst, NLP Engineer / NLP Data Scientist, CV Engineer, Deep Learning Engineer / Deep Learning Research Engineer) в Україні. Ці вакансії аналізувалися під час дослідження ринку праці у першому півріччі 2020 року.

Big Data Software Engineer  /  Data Engineer

  1. Linear algebra. Calculus. Statistics and Probability Theory.
  2. Machine Learning Algorithms: regression, simulation, scenario analysis, modeling, clustering, decision trees, etc.
  3. Python 3, Pandas, Scikit Learn, Keras, Tensor Flow, Numpy, PyTorch.
  4. Data visualization.
  5. Software engineering methodologies, functional programming or object-oriented programming.
  6. DevOps: containerization and orchestration.
  7. Classic DBs (relational or object): MySQL, PostgreSQL, RDS.
  8. NoSQL (documented): MongoDB, Cassandra, HBase, Elasticsearch, Redis, DynamoDB.
  9. NewSQL (hybrid/in memory): Memsql, VoltDB.
  10. Query engines: Impala, Presto.
  11. Cloud platforms (GCP, AWS). Cloud computation (Dataflow, Dataproc). Streaming (Pub/Sub, Kafka). Data storage (BigQuery, Cloud SQL, Cloud Spanner, Firestore, BigTable).
  12. ETL Concepts / Processes.
  13. Data Warehouse technologies, Data Lake architecture.
  14. Data modeling: Bachman diagrams, Chen’s Notation, Object-relational mapping, etc.
  15. Processing frameworks: Apache Spark (Pyspark/SparkR/sparklyr), Flink, Beam, Kafka streams
  16. Data pipeline and workflow management tools: Azkaban, Luigi, Airflow, etc.

Data Scientist

  1. Python (PyCharm, Pandas, NumPy, bs4, sklearn, scipy). R.
  2. Linear algebra. Calculus. Statistics.
  3. Machine Learning techniques (Decision Trees, Random Forest, SVM, Bayesian, XG Boost, K-Nearest Neighbors) and concepts: regression and classification, clustering, feature selection, feature engineering, the curse of dimensionality, bias-variance tradeoff, SVMs.
  4. Data visualization.
  5. Data Mining (Clustering, Frequent Pattern Mining, Outliers Detection).
  6. Neural Networks and ML Packages (sklearn/sqboost/Tensorflow/Keras, H20).
  7. Cloud platforms (GCP, AWS). Cloud computation (Dataflow, Dataproc). Streaming (Pub/Sub, Kafka). Data storage (BigQuery, Cloud SQL, Cloud Spanner, Firestore, BigTable).
  8. Databases: SQL and non-SQL, AWS cloud storage, GDPR data privacy.
  9. Processing frameworks: Hadoop, Spark.
  10. Business Intelligence Software (Power BI, Tableau, Qlik, Cognos Analytics).

Machine Learning Engineer

  1. Computer science fundamentals, algorithms, mathematics, linear algebra, probability, and statistics.
  2. Python (Pandas, Numpy, Scikit-Learn, Tensorflow, Keras).
  3. Python visualization tools: matplotlib/seaborn, Plotly.
  4. Machine Learning techniques (Decision Trees, Random Forest, SVM, Bayesian, XG Boost, K-Nearest Neighbors) and concepts: regression and classification, clustering, feature selection, feature engineering, the curse of dimensionality, bias-variance tradeoff, SVMs.
  5. Deep Learning: Recurrent Neural Network (LSTM/GRU units), Convolutional Neural Network.
  6. Machine learning frameworks (TensorFlow, Caffe2, PyTorch, Spark ML, scikit-learn) and  ML techniques: GAN, ASR, RL.
  7. Databases: SQL and non-SQL. Hadoop ecosystem.
  8. Processing frameworks: Apache Spark (Pyspark/SparkR/sparklyr)
  9. Cloud platforms (GCP, AWS).

Data Analyst 

  1. Math, Statistics (regression, properties of distributions, statistical tests, and proper usage, etc.) and Probability Theory.
  2. Statistical programming software (R, Python, SAS, Matlab).
  3. Predictive analytics (regression models, time-series analysis and forecasting, survival or duration analysis).
  4. BI tools: Google Data Studio / Microsoft PowerBI / Tableau.
  5. Classic DBs: MySQL.
  6. MS Excel.
  7. A/B  testing.

NLP Engineer / NLP Data Scientist 

  1. Python (sklearn, nltk, gensim, spacy, Tensor Flow, PyTorch, Keras) and Python Data Science toolkit: Jupyter Notebook, Pandas, Numpy, Matplotlib/Seaborn, Scipy.
  2. Databases: SQL and NoSQL (MySQL, MongoDB, PostgreSQL ) .
  3. NLP libraries: NLTK, SpaCy, Stanford CoreNLP etc. 
  4. NLP techniques for text representation: (TF-IDF, Word2Vec), semantic extraction, data structures and modeling.
  5. Methods of  Information Extraction (NER, terminology extraction, keywords extraction, etc.)
  6. Machine Learning techniques and concepts (regression, trees, SVM, ensembles) for NLP tasks.

CV Engineer

  1. Linear Algebra. Geometry. Calculus. Statistics and Probability theory.
  2. Python3, numpy, pandas, seaborn, scipy.
  3. Computer vision / image processing libraries such as: OpenCV,  Pillow.
  4. Convolutional Neural Networks (LSTM, inception, residual, GAN).
  5. Neural network frameworks: TensorFlow, PyTorch.
  6. Computer vision algorithms and architectures: object detection, segmentation, face recognition, image processing, video processing.
  7. Real-time CV systems based on Deep Learning.
  8. Cloud model training (GCP, AWS), Cloud integration, Cloud Platforms.
  9. Performance metrics in object detection and classification, such as mAP and related. 
  10. Big Data (Hadoop, Spark, Hive).

Deep Learning Engineer  / Deep Learning Research Engineer

  1. Python3: numpy, scikit-learn, pandas, scipy.
  2. Statistics (regression, properties of distributions, statistical tests, and proper usage, etc.) and probability theory.
  3. Deep learning frameworks:  Tensorflow, PyTorch; MxNet, Caffe, Keras.
  4. Deep learning architectures: VGG, ResNet, Inception, MobileNet.
  5. Deepnets, hyperparameter optimization, visualization, interpretation.
  6. Machine learning models.