Quick, Effective, and Productive Data Science

Book Description:

Discover how to use the popular RStudio IDE as a professional tool that includes code refactoring support, debugging, and Git version control integration. This book gives you a tour of RStudio and shows you how it helps you do exploratory data analysis; build data visualizations with ggplot; and create custom R packages and web-based interactive visualizations with Shiny. In addition, you will cover common data analysis tasks including importing data from diverse sources such as SAS files, CSV files, and JSON. You will map out the features in RStudio so that you will be able to customize RStudio to fit your own style of coding.

Finally, you will see how to save a ton of time by adopting best practices and using packages to extend RStudio. Learn RStudio IDE is a quick, no-nonsense tutorial of RStudio that will give you a head start to develop the insights you need in your data science projects.|

What You Will Learn

  • Quickly, effectively, and productively use RStudio IDE for building data science applications
  • Install RStudio and program your first Hello World application
  • Adopt the RStudio workflow
  • Make your code reusable using RStudio
  • Use RStudio and Shiny for data visualization projects
  • Debug your code with RStudio
  • Import CSV, SPSS, SAS, JSON, and other data
Who This Book Is For

Programmers who want to start doing data science, but don’t know what tools to focus on to get up to speed quickly.

Data Mining Facebook, Twitter, LinkedIn, Instagram, GitHub, and More

Book Description:

Mine the rich data tucked away in popular social websites such as Twitter, Facebook, LinkedIn, and Instagram. With the third edition of this popular guide, data scientists, analysts, and programmers will learn how to glean insights from social media—including who’s connecting with whom, what they’re talking about, and where they’re located—using Python code examples, Jupyter notebooks, or Docker containers.

In part one, each standalone chapter focuses on one aspect of the social landscape, including each of the major social sites, as well as web pages, blogs and feeds, mailboxes, GitHub, and a newly added chapter covering Instagram. Part two provides a cookbook with two dozen bite-size recipes for solving particular issues with Twitter.

  • Get a straightforward synopsis of the social web landscape
  • Use Docker to easily run each chapter’s example code, packaged as a Jupyter notebook
  • Adapt and contribute to the code’s open source GitHub repository
  • Learn how to employ best-in-class Python 3 tools to slice and dice the data you collect
  • Apply advanced mining techniques such as TFIDF, cosine similarity, collocation analysis, clique detection, and image recognition
  • Build beautiful data visualizations with Python and JavaScript toolkits

Book Description:

Data Wrangling with JavaScript is hands-on guide that will teach you how to create a JavaScript-based data processing pipeline, handle common and exotic data, and master practical troubleshooting strategies.

Architecting in the Cloud with Azure Data Lake, HDInsight, and Spark

Book Description:

Microsoft Azure has over 20 platform-as-a-service (PaaS) offerings that can act in support of a big data analytics solution. So which one is right for your project? This practical book helps you understand the breadth of Azure services by organizing them into a reference framework you can use when crafting your own big data analytics solution.

You’ll not only be able to determine which service best fits the job, but also learn how to implement a complete solution that scales, provides human fault tolerance, and supports future needs.

  • Understand the fundamental patterns of the data lake and lambda architecture
  • Recognize the canonical steps in the analytics data pipeline and learn how to use Azure Data Factory to orchestrate them
  • Implement data lakes and lambda architectures, using Azure Data Lake Store, Data Lake Analytics, HDInsight (including Spark), Stream Analytics, SQL Data Warehouse, and Event Hubs
  • Understand where Azure Machine Learning fits into your analytics pipeline
  • Gain experience using these services on real-world data that has real-world problems, with scenarios ranging from aviation to Internet of Things (IoT)

A Practical, Step-by-Step Guide to Learning Business Analytics

Book Description:

Apply analytics to business problems using two very popular software tools, SAS and R. No matter your industry, this book will provide you with the knowledge and insights you and your business partners need to make better decisions faster.

Learn Business Analytics in Six Steps Using SAS and R teaches you how to solve problems and execute projects through the “DCOVA and I” (Define, Collect, Organize, Visualize, Analyze, and Insights) process. You no longer need to choose between the two most popular software tools. This book puts the best of both worlds―SAS and R―at your fingertips to solve a myriad of problems, whether relating to data science, finance, web usage, product development, or any other business discipline.

What You’ll Learn

  • Use the DCOVA and I process: Define, Collect, Organize, Visualize, Analyze and Insights.
  • Harness both SAS and R, the star analytics technologies in the industry
  • Use various tools to solve significant business challenges
  • Understand how the tools relate to business analytics
  • See seven case studies for hands-on practice
Who This Book Is For

This book is for all IT professionals, especially data analysts, as well as anyone who

  • Likes to solve business problems and is good with logical thinking and numbers
  • Wants to enter the analytics world and is looking for a structured book to reach that goal
  • Is currently working on SAS , R, or any other analytics software and strives to use its full power

A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets

Book Description:

Learn how to build a data science technology stack and perform good data science with repeatable methods. You will learn how to turn data lakes into business assets.

The data science technology stack demonstrated in Practical Data Science is built from components in general use in the industry. Data scientist Andreas Vermeulen demonstrates in detail how to build and provision a technology stack to yield repeatable results. He shows you how to apply practical methods to extract actionable business knowledge from data lakes consisting of data from a polyglot of data types and dimensions.

What You’ll Learn

  • Become fluent in the essential concepts and terminology of data science and data engineering
  • Build and use a technology stack that meets industry criteria
  • Master the methods for retrieving actionable business knowledge
  • Coordinate the handling of polyglot data types in a data lake for repeatable results
Who This Book Is For

Data scientists and data engineers who are required to convert data from a data lake into actionable knowledge for their business, and students who aspire to be data scientists and data engineers

A Problem-Solution Approach with PySpark2

Book Description:

Quickly find solutions to common programming problems encountered while processing big data. Content is presented in the popular problem-solution format. Look up the programming problem that you want to solve. Read the solution. Apply the solution directly in your own code. Problem solved!

PySpark Recipes covers Hadoop and its shortcomings. The architecture of Spark, PySpark, and RDD are presented. You will learn to apply RDD to solve day-to-day big data problems. Python and NumPy are included and make it easy for new learners of PySpark to understand and adopt the model.

What You Will Learn

  • Understand the advanced features of PySpark2 and SparkSQL
  • Optimize your code
  • Program SparkSQL with Python
  • Use Spark Streaming and Spark MLlib with Python
  • Perform graph analysis with GraphFrames
Who This Book Is For

Data analysts, Python programmers, big data enthusiasts

 

Real-Time Data and Stream Processing at Scale

Book Description:

Every enterprise application creates data, whether it’s log messages, metrics, user activity, outgoing messages, or something else. And how to move all of this data becomes nearly as important as the data itself. If you’re an application architect, developer, or production engineer new to Apache Kafka, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds.

Engineers from Confluent and LinkedIn who are responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. Through detailed examples, you’ll learn Kafka’s design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer.

  • Understand publish-subscribe messaging and how it fits in the big data ecosystem.
  • Explore Kafka producers and consumers for writing and reading messages
  • Understand Kafka patterns and use-case requirements to ensure reliable data delivery
  • Get best practices for building data pipelines and applications with Kafka
  • Manage Kafka in production, and learn to perform monitoring, tuning, and maintenance tasks
  • Learn the most critical metrics among Kafka’s operational measurements
  • Explore how Kafka’s stream delivery capabilities make it a perfect source for stream processing systems

Building Full-Stack Data Analytics Applications with Spark

Book Description:

Data science teams looking to turn research into useful analytics applications require not only the right tools, but also the right approach if they’re to succeed. With the revised second edition of this hands-on guide, up-and-coming data scientists will learn how to use the Agile Data Science development methodology to build data applications with Python, Apache Spark, Kafka, and other tools.

Author Russell Jurney demonstrates how to compose a data platform for building, deploying, and refining analytics applications with Apache Kafka, MongoDB, ElasticSearch, d3.js, scikit-learn, and Apache Airflow. You’ll learn an iterative approach that lets you quickly change the kind of analysis you’re doing, depending on what the data is telling you. Publish data science work as a web application, and affect meaningful change in your organization.

  • Build value from your data in a series of agile sprints, using the data-value pyramid
  • Extract features for statistical models from a single dataset
  • Visualize data with charts, and expose different aspects through interactive reports
  • Use historical data to predict the future via classification and regression
  • Translate predictions into actions
  • Get feedback from users after each sprint to keep your project on track

A Test-Driven Approach

Book Description:

Gain the confidence you need to apply machine learning in your daily work. With this practical guide, author Matthew Kirk shows you how to integrate and test machine learning algorithms in your code, without the academic subtext.

Featuring graphs and highlighted code examples throughout, the book features tests with Python’s Numpy, Pandas, Scikit-Learn, and SciPy data science libraries. If you’re a software engineer or business analyst interested in data science, this book will help you:

  • Reference real-world examples to test each algorithm through engaging, hands-on exercises
  • Apply test-driven development (TDD) to write and run tests before you start coding
  • Explore techniques for improving your machine-learning models with data extraction and feature development
  • Watch out for the risks of machine learning, such as underfitting or overfitting data
  • Work with K-Nearest Neighbors, neural networks, clustering, and other algorithms

A Practitioner's Approach

Book Description:

Although interest in machine learning has reached a high point, lofty expectations often scuttle projects before they get very far. How can machine learning—especially deep neural networks—make a real difference in your organization? This hands-on guide not only provides the most practical information available on the subject, but also helps you get started building efficient deep learning networks.

Authors Adam Gibson and Josh Patterson provide theory on deep learning before introducing their open-source Deeplearning4j (DL4J) library for developing production-class workflows. Through real-world examples, you’ll learn methods and strategies for training deep network architectures and running deep learning workflows on Spark and Hadoop with DL4J.

  • Dive into machine learning concepts in general, as well as deep learning in particular
  • Understand how deep networks evolved from neural network fundamentals
  • Explore the major deep network architectures, including Convolutional and Recurrent
  • Learn how to map specific deep networks to the right problem
  • Walk through the fundamentals of tuning general neural networks and specific deep network architectures
  • Use vectorization techniques for different data types with DataVec, DL4J’s workflow tool
  • Learn how to use DL4J natively on Spark and Hadoop

Practical Methods for Scientists and Engineers

Book Description:

Data Science is booming thanks to R and Python, but Java brings the robustness, convenience, and ability to scale critical to today’s data science applications. With this practical book, Java software engineers looking to add data science skills will take a logical journey through the data science pipeline. Author Michael Brzustowicz explains the basic math theory behind each step of the data science process, as well as how to apply these concepts with Java.

You’ll learn the critical roles that data IO, linear algebra, statistics, data operations, learning and prediction, and Hadoop MapReduce play in the process. Throughout this book, you’ll find code examples you can use in your applications.

  • Examine methods for obtaining, cleaning, and arranging data into its purest form
  • Understand the matrix structure that your data should take
  • Learn basic concepts for testing the origin and validity of data
  • Transform your data into stable and usable numerical values
  • Understand supervised and unsupervised learning algorithms, and methods for evaluating their success
  • Get up and running with MapReduce, using customized components suitable for data science algorithms

Explore, understand, and prepare real data using RapidMiner's practical tips and tricks

Book Description:

Data is everywhere and the amount is increasing so much that the gap between what people can understand and what is available is widening relentlessly. There is a huge value in data, but much of this value lies untapped. 80% of data mining is about understanding data, exploring it, cleaning it, and structuring it so that it can be mined. RapidMiner is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics. It is used for research, education, training, rapid prototyping, application development, and industrial applications.

Exploring Data with RapidMiner is packed with practical examples to help practitioners get to grips with their own data. The chapters within this book are arranged within an overall framework and can additionally be consulted on an ad-hoc basis. It provides simple to intermediate examples showing modeling, visualization, and more using RapidMiner.

Exploring Data with RapidMiner is a helpful guide that presents the important steps in a logical order. This book starts with importing data and then lead you through cleaning, handling missing values, visualizing, and extracting additional information, as well as understanding the time constraints that real data places on getting a result. The book uses real examples to help you understand how to set up processes, quickly..

This book will give you a solid understanding of the possibilities that RapidMiner gives for exploring data and you will be inspired to use it for your own work.

What you will learn from this book

  • Import real data from files in multiple formats and from databases
  • Extract features from structured and unstructured data
  • Restructure, reduce, and summarize data to help you understand it more easily and process it more quickly
  • Visualize data in new ways to help you understand it
  • Detect outliers and methods to handle them
  • Detect missing data and implement ways to handle it
  • Understand resource constraints and what to do about them

Easy, hands-on recipes to help you understand Hive and its integration with frameworks that are used widely in today's big data world

Book Description:

Hive was developed by Facebook and later open sourced in Apache community. Hive provides SQL like interface to run queries on Big Data frameworks. Hive provides SQL like syntax also called as HiveQL that includes all SQL capabilities like analytical functions which are the need of the hour in today’s Big Data world.

This book provides you easy installation steps with different types of metastores supported by Hive. This book has simple and easy to learn recipes for configuring Hive clients and services. You would also learn different Hive optimizations including Partitions and Bucketing. The book also covers the source code explanation of latest Hive version.

Hive Query Language is being used by other frameworks including spark. Towards the end you will cover integration of Hive with these frameworks.

Who This Book Is For

The book is intended for those who want to start in Hive or who have basic understanding of Hive framework. Prior knowledge of basic SQL command is also required

What You Will Learn

  • Learn different features and offering on the latest Hive
  • Understand the working and structure of the Hive internals
  • Get an insight on the latest development in Hive framework
  • Grasp the concepts of Hive Data Model
  • Master the key concepts like Partition, Buckets and Statistics
  • Know how to integrate Hive with other frameworks such as Spark, Accumulo, etc

Build automatic classfication and predicition models using unsupervised learning

Book Description:

Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures. With the superb memory management and the full integration with multi-node big data platforms, the H2O engine has become more and more popular among data scientists in the field of deep learning.

This book will introduce you to the deep learning package H2O with R and help you understand the concepts of deep learning. We will start by setting up important deep learning packages available in R and then move towards building models related to neural networks, prediction, and deep prediction, all of this with the help of real-life examples.

After installing the H2O package, you will learn about prediction algorithms. Moving ahead, concepts such as overfitting data, anomalous data, and deep prediction models are explained. Finally, the book will cover concepts relating to tuning and optimizing models.

What you will learn

  • Set up the R package H2O to train deep learning models
  • Understand the core concepts behind deep learning models
  • Use Autoencoders to identify anomalous data or outliers
  • Predict or classify data automatically using deep neural networks
  • Build generalizable models using regularization to avoid overfitting the training data

Designing Next-Generation Machine Intelligence Algorithms

Book Description:

With the reinvigoration of neural networks in the 2000s, deep learning has become an extremely active area of research that is paving the way for modern machine learning. This book uses exposition and examples to help you understand major concepts in this complicated field.

Large companies such as Google, Microsoft, and Facebook have taken notice, and are actively growing in-house deep learning teams. For the rest of us however, deep learning is still a pretty complex and difficult subject to grasp. If you have a basic understanding of what machine learning is, have familiarity with the Python programming language, and have some mathematical background with calculus, this book will help you get started.

Book Description:

Dive deeper into SPSS Statistics for more efficient, accurate, and sophisticated data analysis and visualization

SPSS Statistics for Data Analysis and Visualization goes beyond the basics of SPSS Statistics to show you advanced techniques that exploit the full capabilities of SPSS. The authors explain when and why to use each technique, and then walk you through the execution with a pragmatic, nuts and bolts example. Coverage includes extensive, in-depth discussion of advanced statistical techniques, data visualization, predictive analytics, and SPSS programming, including automation and integration with other languages like R and Python. You’ll learn the best methods to power through an analysis, with more efficient, elegant, and accurate code.

IBM SPSS Statistics is complex: true mastery requires a deep understanding of statistical theory, the user interface, and programming. Most users don’t encounter all of the methods SPSS offers, leaving many little-known modules undiscovered. This book walks you through tools you may have never noticed, and shows you how they can be used to streamline your workflow and enable you to produce more accurate results.

  • Conduct a more efficient and accurate analysis
  • Display complex relationships and create better visualizations
  • Model complex interactions and master predictive analytics
  • Integrate R and Python with SPSS Statistics for more efficient, more powerful code

These “hidden tools” can help you produce charts that simply wouldn’t be possible any other way, and the support for other programming languages gives you better options for solving complex problems. If you’re ready to take advantage of everything this powerful software package has to offer, SPSS Statistics for Data Analysis and Visualization is the expert-led training you need.

An Introduction for Data Scientists

Book Description:

Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus on particular analyses you can build, the data warehousing techniques that Hadoop provides, and higher order data workflows this framework can produce.

Data scientists and analysts will learn how to perform a wide range of techniques, from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data systems available to build and empower data products that can handle—and actually require—huge amounts of data.

  • Understand core concepts behind Hadoop and cluster computing
  • Use design patterns and parallel analytical algorithms to create distributed data analysis jobs
  • Learn about data management, mining, and warehousing in a distributed context using Apache Hive and HBase
  • Use Sqoop and Apache Flume to ingest data from relational databases
  • Program complex Hadoop and Spark applications with Apache Pig and Spark DataFrames
  • Perform machine learning techniques such as classification, clustering, and collaborative filtering with Spark’s MLlib

Book Description:

Real-World Machine Learning is a practical guide designed to teach working developers the art of ML project execution. Without overdosing you on academic theory and complex mathematics, it introduces the day-to-day practice of machine learning, preparing you to successfully build and deploy powerful ML systems.

Book Description:

Big data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. New methods of working with big data, such as Hadoop and MapReduce, offer alternatives to traditional data warehousing.

Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built, which can process analytics algorithms over a large scale dataset in a scalable manner. This can be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop.

You will start with the installation and configuration of R and Hadoop. Next, you will discover information on various practical data analytics examples with R and Hadoop. Finally, you will learn how to import/export from various data sources to R. Big Data Analytics with R and Hadoop will also give you an easy understanding of the R and Hadoop connectors RHIPE, RHadoop, and Hadoop streaming.

What you will learn from this book

  • Integrate R and Hadoop via RHIPE, RHadoop, and Hadoop streaming
  • Develop and run a MapReduce application that runs with R and Hadoop
  • Handle HDFS data from within R using RHIPE and RHadoop
  • Run Hadoop streaming and MapReduce with R
  • Import and export from various data sources to R