Data Science Essentials In Python: Collect - Or... Fix
Go from messy, unstructured artifacts stored in SQL and NoSQL databases to a neat, well-organized dataset with this quick reference for the busy data scientist. Understand text mining, machine learning, and network analysis; process numeric data with the NumPy and Pandas modules; describe and analyze data using statistical and network-theoretical methods; and see actual examples of data analysis at work. This one-stop solution covers the essential data science you need in Python.
Data Science Essentials in Python: Collect - Or...
Whether you are an eager learner of data science or a well-grounded data science practitioner, you can take advantage of this essential introduction to Python for data science. You can use it to the fullest if you already have at least some previous experience in basic coding, in writing general-purpose computer programs in Python, or in some other data-analysis-specific language such as MATLAB or R.
This book will delve directly into Python for data science, providing you with a straight and fast route to solving various data science problems using Python and its powerful data analysis and machine learning packages. The code examples that are provided in this book don't require you to be a master of Python. However, they will assume that you at least know the basics of Python scripting, including data structures such as lists and dictionaries, and the workings of class objects. If you don't feel confident about these subjects or have minimal knowledge of the Python language, before reading this book, we suggest that you take an online tutorial. There are good online tutorials that you may take, such as the one offered by the Code Academy course at -python, the one by Google's Python class at , or even the Whirlwind tour of Python by Jake Vanderplas ( ). All the courses are free, and, in a matter of a few hours of study, they should provide you with all the building blocks that will ensure you enjoy this book to the fullest. In order to provide an integration of the two aforementioned free courses, we have also prepared a tutorial of our own, which can be found in the appendix of this book.
In any case, don't be intimidated by our starting requirements; mastering Python enough for data science applications isn't as arduous as you may think. It's just that we have to assume some basic knowledge on the reader's part because our intention is to go straight to the point of doing data science without having to explain too much about the general aspects of the Python language that we will be using.
Data science is a relatively new knowledge domain, though its core components have been studied and researched for many years by the computer science community. Its components include linear algebra, statistical modeling, visualization, computational linguistics, graph analysis, machine learning, business intelligence, and data storage and retrieval.
In addition, other programming languages such as R and MATLAB provide data scientists with specialized tools to solve specific problems in statistical analysis and matrix manipulation in data science. However, only Python really completes your data scientist skill set with all the key techniques in a scalable and effective way. This multipurpose language is suitable for both development and production alike; it can handle small- to large-scale data problems and it is easy to learn and grasp, no matter what your background or experience is.
First, let's proceed and introduce all the settings you need in order to create a fully working data science environment to test the examples and experiment with the code that we are going to provide you with.
Conda can help you manage two tasks: installing packages and creating virtual environments. In this paragraph, we will explore how conda can help you easily install most of the packages you may need in your data science projects.
As an example, we can create an environment based on Python Version 3.6, having all the necessary Anaconda-packaged libraries installed. This makes sense, for instance, when installing a particular set of packages for a data science project. In order to create such an environment, just perform the following:
The packages that we are now going to introduce are strongly analytical and they will constitute a complete data science toolbox. All of the packages are made up of extensively tested and highly optimized functions for both memory usage and performance, ready to achieve any scripting operation with successful execution. A walkthrough on how to install them is provided in the following section.
This is a GitHub project that easily allows you to create a report from a pandas DataFrame. The package will present the following measures in an interactive HTML report, which is used to evaluate the data at hand for a data science project:
Started as part of SciKits (SciPy Toolkits), Scikit-learn is the core of data science operations in Python. It offers all that you may need in terms of data preprocessing, supervised and unsupervised learning, model selection, validation, and error metrics. Expect us to talk at length about this package throughout this book. Scikit-learn started in 2007 as a Google Summer of Code project by David Cournapeau. Since 2013, it has been taken over by the researchers at INRIA ( Institut national de recherche en informatique et en automatique, that is the French Institute for Research in Computer Science and Automation):
Beautiful Soup, a creation of Leonard Richardson, is a great tool to scrap out data from HTML and XML files that are retrieved from the internet. It works incredibly well, even in the case of tag soups (hence the name), which are collections of malformed, contradictory, and incorrect tags. After choosing your parser (the HTML parser included in Python's standard library works fine), thanks to Beautiful Soup, you can navigate through the objects in the page and extract text, tables, and any other information that you may find useful:
A scientific approach implies fast experimentation of different hypotheses in a reproducible fashion (as does data exploration and analysis in data science), and when using this interface, you will be able more naturally to implement an explorative, iterative, trial and error research strategy during your code writing.
The commands will install the devtools library on your R, then pull and install all the necessary libraries from GitHub (you need to be connected to the internet while running the other commands), and finally register the R kernel both in your R installation and on Jupyter. After that, every time you call the Jupyter Notebook, you will have the choice of running either a Python or an R kernel, allowing you to use the same format and approach for all your data science projects.
After loading the observations and their features, in order to provide a demonstration of how Jupyter can effectively support the development of data science solutions, we will perform some transformations and analysis on the dataset. We will use classes, such as SelectKBest, and methods, such as .getsupport() or .fit(). Don't worry whether these are not clear to you now; they will all be covered extensively later in this book. Try to run the following code:
If you don't like using Jupyter, there are actually a few alternatives that can help you test the code you will find in this book. If you have experience with R, the RStudio ( ) layout may appeal more to you. In this case, Yhat, a company providing data science solutions for decision APIs, offers their data science IDE for Python free of charge, named Rodeo ( ). Rodeo works by using the IPython kernel of Jupyter under the hood, yet it is an interesting alternative given its different user interface.
As we progress through the concepts presented in this book, in order to facilitate the reader's understanding, learning, and memorizing processes, we will illustrate practical and effective data science Python applications on various explicative datasets. The reader will always be able to immediately replicate, modify, and experiment with the proposed instructions and scripts on the data that we will use in this book.
As for the code that you are going to find in this book, we will limit our discussions to the most essential commands in order to inspire you from the beginning of your data science journey with Python to do more with less by leveraging key functions from the packages we presented beforehand.
We encourage you to experiment a lot with this dataset and with similar ones before you work on other complex real data because the advantage of focusing on an accessible, non-trivial data problem is that it can help you to quickly build your foundations on data science.
After a while anyway, though they are useful and interesting for your learning activities, toy datasets will start limiting the variety of different experimentations that you can achieve. In spite of the insights provided, in order to progress, you'll need to gain access to complex and realistic data science topics. Consequently, we will have to resort to some external data.
LIBSVM Data ( cjlin/libsvmtools/datasets/) is a page that gathers data from many other collections. It is maintained by Chih-Jen Lin, one of the authors of LIBSVM, a support vector machines learning algorithm for predictions (Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011). This offers different regression, binary, and multilabel classification datasets that are stored in the LIBSVM format. This repository is quite interesting if you wish to experiment with the support vector machine's algorithm, and, again, it is free for you to download and use the data.
In the next chapter, Data Munging, we will have an overview of the data science pipeline and explore all the key tools to handle and prepare data before you apply any learning algorithm and set up your hypothesis experimentation schedule. 041b061a72