Resources

In the past, I lost track of reality trying to track a gazillion links covering every data-science-friendly programming language under the sun. shakes head Bad idea. Since I program in MATLAB and python daily, I like to keep track of Python and VStudio developments. I’m mostly going to share python and robotics/control dynamics resources that I find useful for analytics, statistical programming, machine learning, data science workflows, robot modelling and web app development.

I’m enjoying MATLAB a lot more recently so I’ll slowly build up this resources page with MATLAB sub-topics that I find bookmark worthy.

In terms of the best place to start for getting into data analysis, I recommend learning SQL as this is by far the most widely used data querying language across the corporate and academic landscapes and if you master SQL, you’ve mastered most of the transformations that are possible for tabular numeric data sets. Nonetheless, I will not cover SQL resources here as I rarely write raw SQL anymore. Instead, I use Python to establish connections with data warehouses, and I query raw data using the popular pandas library to execute SQL code in the backend (via the SQLAlchemy or pandasql library).

Python and R are open-source programming languages for statistical computing and graphics. These two languages have friendly online (and in-person) communities devoted to making data science easier to consume, easier to apply, and more effective at solving business problems. One of the things that I like most about both languages is the thousands of packages available making almost everything in R or Python a little easier from ETL, to method chaining, to developing predictive models and interactive web apps. I certainly welcome any suggestions that you might have for the lists below!

Language Agnostic ETL Frameworks

Arrow: Apache Arrow is a columnar memory format for flat and hierarchical data, organized for efficient analytic operations, supporting zero-copy reads for lightning-fast data access without serialization overhead
DuckDB: DuckDB is an in-process SQL OLAP database management system (that plays nicely with Arrow) capable of larger than memory processing of tabular data
Polars: Polars is a lightning fast DataFrame library/in-memory query engine written in Rust and built upon the Arrow specification - It’s a great tool for efficient data wrangling, data pipelines, snappy APIs and much more
Spark: Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters

Python Books

The Quick Python Book (3e): This book by Naomi Ceder is a few years old now (2018) but it’s the best end-to-end intro on Python that I’ve yet read taking you from basic classes / structures to function writing to working with modules
Python Data Science Handbook: Introduction to the core libraries essential for working with data in Python
Effective Pandas: Patterns for Data Manipulation: Easy to follow tutorials, at your own pace, for mastering the popular Pandas library
Tidy Finance with Python: This is one of my favorite newer books covering complex financial modeling, valuation, and pricing and represents “an opinionated approach to empirical research in financial economics [with an] open-source code base”

Python Packages

NumPy: Brings the computational power of C and Fortran to Python programmers for applying high-level mathematical functions to arrays and more
Pandas: This is the most popular package for data manipulation and analysis with extended operations available for tabular and time series data
Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations in Python
scikit-learn: Built on top of NumPy, SciPy, and matplotlib, “sklearn” makes the development of predictive analysis workflows a simple and reproducible process
Beautiful Soup: The beautifulsoup4 library makes web scraping HTML and XML data a breeze
Streamlit: Using pure Python, this package lets you build interactive web apps in minutes with no UI / front-end experience required
Shiny for Python: The popular Shiny framework for R is finally available for Python - Create highly interactive visualizations, realtime dashboards, data explorers, model demos, sophisticated workflow apps, and anything else you can imagine—all in pure Python, with no web development skills required

R Books: Applied Resources

Tidy Modeling with R: Over the last few months, I’ve learned a lot from this A to Z resource on predictive modeling workflows using the tidymodels framework
Deep Learning with R (2e): In-depth introduction to artificial intelligence and deep learning applications with R using the Keras library
Forecasting Principles and Practice (3e): Said best by the author, “The book is written for three audiences: (1) people finding themselves doing forecasting in business when they may not have had any formal training in the area; (2) undergraduate students studying business; (3) MBA students doing a forecasting elective”
Regression and Other Studies: Super applied textbook on advanced regression techniques, Bayesian inference, and causal inference
Supervised Machine Learning for Text Analysis in R: Written by two Posit software engineers, Emil Hvitfeldt and Julia Silge, this book is a masterclass in natural language processing taking you from the basics of NLP to real-life applications including inference and prediction
Tidy Finance with R: This is one of my favorite newer books covering complex financial modeling, valuation, and pricing and represents “an opinionated approach to empirical research in financial economics [with an] open-source code base in multiple programming languages”

R Packages

tidyverse: A collection of packages for data manipulation and functional programming (I use dplyr, stringr, and purrr on a daily basis)
tidymodels: Hands-down my preferred collection of packages for building reproducible machine learning recipes, workflows, model tuning, model stacking, and cross-validation
tidyverts: A collection of packages for time series analysis that comes out of Rob Hyndman’s lab
DT: This is an R implementation of the popular DataTables JavaScript library that lets you build polished, configurable tables for use in web reports, slides, and Shiny apps
bs4Dash: This R Shiny framework brings Bootstrap + AdminLTE dependencies to Shiny (including 1:1 support for shinydashboard functions) and it’s my go-to for developing enterprise-grade Shiny apps
leaflet: R implementation of the popular Leaflet JavaScript library for developing interactive maps
plotly: An extensive graphic library for creating interactive visualizations and 3D (WebGL) charts
embed: This package is one of my go-to packages for machine learning and I if I’m working on a classification problem, you can count on me incorporating some of the extra steps it provides for the recipes package for embedding predictors into one or more numeric columns