An compressive introduction to Data Science. Exploration of basis
An compressive introduction to Data Science. David Sierra Porta (c) 2022
This (little) book has been designed to introduce the basic elements of data science in various contexts of problems that are usually solved in the discipline. We will try to cover many topics in a self-contained view, in this way the reader learns little by little with modules that advance in complexity and extension of techniques and methodologies. A version of this content in a PDF book format is being built, but is not yet available and will hopefully be completed soon.
In recent years, the worlds of industry, academia, and government have been calling for more capable and productive data science professionals, and the demand for data science professionals in the industry has been growing rapidly. This book presents concepts and skills that can help you meet the challenges of real-world data analysis. It covers concepts of probability, statistical inference, linear regression, and machine learning. It also helps you develop skills such as Python programming, data management with Pandas, data visualization with Matplotlib, Pyplot, Seaborn, Bokeh, etc, algorithm construction, file organization with UNIX/Linux shell, version control with Git and GitHub, and preparing reproducible documents with Overleaf and Python markdown. The book is divided into six parts: an introduction to Python, Data Visualization, Data Management, Statistics with Python, Machine Learning, and Productivity Tools. Each part has several chapters designed to be presented as a lecture and includes dozens of exercises distributed throughout the chapters.
This book is write in a self-content way, all material is in a GitHub repository for a entire course in Data Science. No previous knowledge of Python is necessary or required, although some experience with programming may be helpful. The statistical concepts used to answer the case study questions are only briefly introduced, so a Probability and Statistics textbook is highly recommended for in-depth understanding of these concepts. If you read and understand all the chapters and complete all the exercises, you will be well-positioned to perform basic data analysis tasks and you will be prepared to learn the more advanced concepts and skills needed to become an expert.
We start by going over the basics of Python and the Numpy and Pandas. You learn Python throughout the book, but in the first part we go over the building blocks needed to keep learning. The growing availability of informative datasets and software tools has led to increased reliance on data visualizations in many fields. In the second part we demonstrate how to use Matplotlib to generate graphs and describe important data visualization principles. In the third part we demonstrate the importance of statistics in data analysis by answering case study questions using probability, inference, and regression with Python. The fourth part uses several examples to familiarize the reader with data wrangling. Among the specific skills we learn are web scraping, using regular expressions, and joining and reshaping data tables. In the fifth part we present several challenges that lead us to introduce machine learning. We learn to use the caret package to build prediction algorithms including K-nearest neighbors and random forests. In the final part, we provide a brief introduction to the productivity tools we use on a day-to-day basis in data science projects. These are Jupyter Notebook, UNIX/Linux shell, Git and GitHub, Overleaf and Python Markdown.
A table of contents with materials is available following.
A few very basics tools and sintax using Python - Python_Basics
A few very basics tools and sintax using Python - Python_Basics