- Active learning!!!
- Read the code
- Manage time
- Block of study
- Google and documentation things
Data Engineering make data available for the end-user, for the purposes of analyttics, model building, app development, etc.
MOVE/STORE: Reliable Data FLow, Infrastructure, Pipelines, ETL, Structured and Unstructured data storage;
EXPLORE/TRANSFORM: Cleaning, anomaly detection, prep.
Common Activities
- Ingest Data from a data source
- Build and mantain a data warehouse
- Create a data pipeline
- Create an analytics table for a specific use case
- Migrate data to the cloud
- Schedule and automate pipelines
- Backfill data
- Debug data quality issue
- Optimize queries
- Design a database
- Python
- SQL
- Modelagem de Data Warehouse
- Modelagem Multidimensional
- Formas de Normalização
- Linux e Cronjobs
- Engenharia de Software para criação das rotinas de orquestração
- Ferramenta de orquestração como Airflow ou Luigi seria bom tambem
https://github.com/igorbarinov/awesome-data-engineering
-
Databases
- SQL ANSI/92
- Relacional: Plano de execução, sistema de caching, indexação, fragmentação de indices, sistema de locking (isolamento de transações) e demais propriedades. PostgreSQL/MySQL e SQLite (https://www.sqlite.org/quirks.html)
- Saber em alto nivel caracteristicas e alguns bancos de dados: Key-Value, Colunar e Documento
- Estudo: Escolher um banco de dados de um desses tipos e ir um pouco mais a fundo. Ex: Redis (https://github.com/fclesio/learning-space/blob/master/Redis/aula-redis.aula)
-
Batch Processing
- Saber em alto nivel as diferenças entre ferramentas
- Estudo: Escolher entre Hadoop ou Spark. Recomendação: Spark - https://spark.apache.org/. Exemplo antigo: https://github.com/fclesio/learning-space/blob/master/Spark%20Structured%20Streaming/default-project/spark-structured-streaming/src/main/scala/com/movile/stream/SparkSQL.scala
-
File System
- Conhecer o HDFS
- Conhecer AWS S3 e o AWS Glacier (https://aws.amazon.com/glacier/)
-
Modelagem de banco de dados (Transacional e OLAP)
- Estudo das 6 formas normais (https://pt.wikipedia.org/wiki/Normaliza%C3%A7%C3%A3o_de_dados)
- Integridade Referencial e Modelagem Transacional: http://www.methodsandtools.com/archive/archive.php?id=9
- Modelagem Dimensional: The Data Warehouse Lifecycle Toolkit: Como realizar o gerenciamento do seu DWH considerando não apenas a modelagem logica e fisica, mas aspectos de manutenção. https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-lifecycle-toolkit/
- The Data Warehouse Toolkit: basico para modelagem de DW. https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/
- ETL: The Data Warehouse ETL Toolkit. https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-etl-toolkit/
-
Stream Processing
- Spark (https://spark.apache.org/streaming/) ou Flink (https://flink.apache.org/). Recomendação Apache Flink.
-
Testing
- Great Expectations: https://greatexpectations.io/
- Beginning Databases With PostgreSQL - From Novice To Professional
- Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming
- Spark: The Definitive Guide: Big Data Processing Made Simple
- PostgreSQL: Up and Running: A Practical Guide to the Advanced Open Source Database
- Hadoop: The Definitive Guide, 4th Edition: Storage and Analysis at Internet Scale
- The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling
- Redis in Action
- The data warehouse ETL toolkit : practical techniques for extracting, cleaning, conforming, and delivering data
- The Definitive Guide to SQLite (Second Edition)
- The Data Warehouse ETL Toolkit : Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data
- Using SQLite
- Refactoring Databases: Evolutionary Database Design (Addison-Wesley Signature Series (Fowler))
1.Udacity nanodegree
- De Cloud