A recent study of data scientists on Twitter found that they spend 80% of their time in data cleaning rather than mining or modelling data and 59% among them found it least enjoyable part of their work. This very fact led me to think about the different ways to overcome this obstacle and make this phase of data prepping more enjoyable. If you are struggling with the same as well, I welcome you to the world of Tidyverse!
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
What is Tidyverse?
In simple words, tidyverse is about the connections between the tools that make the workflow possible. The tidyverse is a coherent system of packages for data manipulation, exploration and visualization that share common design philosophy.
If you have any previous experience in R, you can continue with this article. Otherwise, I would recommend working with R for Data Science by Hadley Wickham and Garrett Grolemund.
Core packages in Tidyverse
dplyr is my favourite package among all others in tidyverse. This is widely used for data manipulation in R. This package has functions to filter the data to group the data.
Here is a list of functions that dplyr provides for data manipulation:
- select(): Select columns from your dataset
- filter(): Filter out certain rows that meet your criteria(s)
- group_by(): Group different observations together such that the original dataset does not change.
- summarise(): Summarise any of the above functions
- arrange(): Arrange your column data in ascending or descending order
- join(): Perform left, right, full, and inner joins in R
- mutate(): Create new columns by preserving the existing variables
tidyr is a new package that makes it easy to “tidy” your data. It boosts the power of dplyr for data manipulation and pre-processing. The two most important properties of tidy data are, each column is a variable and each row is an observation.
tidyr provides four main functions for tidying your messy data:
- gather(): This function “gathers” multiple columns from your dataset and converts them into key-value pairs
- spread(): This takes two columns and “spreads” them into multiple columns
- separate(): This function helps in separating or splitting a single column into numerous columns
- unite(): Works opposite to the separate() function. It helps in combining two or more columns into one
readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data. The readr package solves the problem of parsing a flat file into a tibble. This provides an improvement over the standard file importing methods and significantly improves the computation speed.
purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for loops with code that is easier to write and more expressive. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read.
We work with dataframes in R. It’s one of the first things we learn about R — convert your data into a data frame. tibble is a modern type of data frame. Tibbles are dataframes that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code. Tibble package helps in easy handling of big datasets containing complex objects.
Dealing with string variables can get complicated at times. They can have a huge impact on our expected output. stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations.
Some basic functions that you can perform with the stringr package are:
- str_sub(): Extract substrings from a character vector
- str_trim():Trim white spaces
- str_length(): Checks the length of the string
- str_to_lower/str_to_upper: Converts the string into upper case or lower case
The forcats package is devoted to managing with categorical factors or variables. Anybody who has worked with categorical information knows what a nightmare they can be.
R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Factors are also helpful for reordering character vectors to improve the display. The goal of the forcats package is to provide a suite of tools that solve common problems with factors, including changing the order of levels or the values. Some examples include:
- fct_reorder(): Reordering a factor by another variable.
- fct_infreq(): Reordering a factor by the frequency of values.
- fct_relevel(): Changing the order of a factor by hand.
- fct_lump(): Collapsing the least/most frequent values of a factor into “other”
I suppose this is the package every one of us is familiar with. ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
There is so much we can do with this package. Whether it’s building box plots, density plots, violin plots, tile plots, time series plots — you name it and ggplot2 has a function for it.
I hope you found this article helpful. I want to hear your thoughts, feedback, and experience with Tidyverse. Let me know in the comments section below!