Data Science

Data Cleaning with Python

A guide to data cleaning using the Airbnb NY data set

Benedict Neo
bitgrit Data Science Publication
9 min readMay 21, 2021

--

Photo by Filiberto Santillán on Unsplash

It is widely known that data scientists spend a lot of their time cleaning data, you even might have heard that data scientists spend 80% of their time finding, cleaning, and reorganizing data, and only 20% analyzing it and producing insights.

What is Data Cleaning?

According the Wikipedia, Data Cleaning is:

the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

From this definition, we can say that the aspects of clean data are:

  • Complete
  • Correct
  • Accurate
  • Relevant

Why is it important?

Data cleaning is a crucial step in the data science pipeline as the insights and results you produce is only as good as the data you have. As the old adage goes — garbage in, garbage out.

Using dirty data to produce analysis will result in erroneous predictions that engenders bad decisions and dangerous outcomes. Not only that, most machine learning algorithms only work when your data is properly cleaned and fit for modeling.

How to clean data

With all that said, in this article, I will be sharing a few common tasks you will perform in data cleaning, using the New York City Airbnb Open Data from Kaggle as an example. Note that I used this dataset and it might differ from the one from Kaggle.

In particular, this article will be centered around these 3 topics below

  • Missing data
  • Inconsistent data/Irrelevant features
  • Outliers

I hope you find this article practical and useful in your learning, or your work.

You can find the notebook with the code I use in this article hosted on Jovian.ai here.

I recommend you to download that notebook and experiment with it yourself so you get hands-on with this article.

Now let’s dive in!

Importing Libraries

First we start by importing the necessary libraries for data cleaning.

Load the data

Then we load the data. For my case, I loaded it from a csv file hosted on Github, but you can upload the csv file and import that data using pd.read_csv()

Notice that I copy the original dataset using .copy(). This is for a data cleaning example later on in the article. It’s also good practice to copy the data set when you want to test something out so you won’t have to re-run the entire notebook if you make a mistake somewhere.

Exploratory Data Analysis

Running head gives us a peek of our dataset. Although this doesn’t really tell us much about our data besides showing us how the data looks like.

Data info

Calling info() on our dataset tells us tons of information about our data frame like the shape (rows, columns), the data type of our features, and the memory usage.

Data Type

Separating our features into numerical and categorical early on is useful and here is a helper function that does that. This is accomplished by using the select_dtypes() function that columns with the ‘object’ data type as categorical and the rest as numerical.

Now we have some idea of our data, we can move on to cleaning it by first checking for missing values.

Missing Data

Missing data is common in all kinds of data and is tricky to deal with. Most machine learning techniques do not work with missing values and it has to be addressed early on.

The two common methods to deal with missing values are

  • dropping the rows / columns
  • imputing them based on other observations i.e. the mean or median

There are a few problems to these approaches.

For example, by dropping rows/columns, you’re essentially losing information that might be useful for prediction

On the other hand, imputing values will introduce bias to your data but it still might better than removing your features.

Here is a great analogy for this dilemma in this article by Elite Data Science.

Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle.

To properly deal with missing data, it depends on the context of the problem you’re trying to solve, and it’s recommended to experiment with different methods to effectively utilize the data you have in hand.

Here are a couple useful helper functions you can use to visualize missing values.

Columns with missing values

This function only prints out columns with missing values, and shows its amount

If you want to see missing values for all columns, use this command:

df.isnull().sum()

Percentage missing

This gives you the percentage of missing values in each of the columns. Knowing the percentage can be useful in determining whether you should drop the column.

The percentage is calculated using the mean

Heatmap of missing values

Heatmaps are also useful to visualize your missing values, in particular at which point of the data do missing values exists.

Now that you know which columns have missing values, it’s time to deal with them.

Technique to deal with missing values

The common techniques I will be sharing are:

1. Drop the feature
2. Drop the row
3. Impute the missing value
4. Replace it

Dropping feature

Dropping feature usually isn’t recommended because you’re losing information. But if you’re sure that the column isn’t important, or simply has too many missing values, you can choose to drop them. For example, for this dataset, the host_name column was removed for ethical reasons, and id was removed because it was was unnecessary.

To drop features, use drop and set axis to 1 and inplace to true. Axis is 1 because we want to drop columns (0 means row), and inplace is True because you're transforming it directly on your dataset.

Dropping the row

If you want to remove rows, you do so using dropna . I’m not going to do that because there are no missing values in price.

Imputing

For imputing, there are 3 main techniques shown below.

  1. fillna — filling in null values based on given value (mean, median, mode, or specified value)
  2. bfill / ffill — stands for backward fill and forward fill (filling in missing values based on the value after or before the column.)
  3. Simple Imputer — Sk-learn’s built-in function that imputes missing values (commonly used alongside a pipeline when building ML models)

Below you can find examples of applying these methods to the price column if it had missing values.

Replace

To replace values, the fillna function is also used.

You define the value you want to replace in the key, and the substitute in the value — {column_name: replacement_for_NA}

Here are examples for replacing values in the columns reviews_per_month and name

Hooray we have no missing values left!

Now let’s move on to dealing with inconsistent or irrelevant features.

Inconsistent data/Irrelevant features

Inconsistent data refers to things like spelling errors in your data, column names that are not relevant to the data, the wrong data type, etc.

Here are a couple examples for dealing with these issues.

Remove rows based on regex

Let’s say you want to remove rows that contain a certain word. For my example, I chose the word noisy/Noisy as my target, and I used the function str.contains() to find the indexes that contain those rows.

Then, using the drop function, and setting axis to index, I can supply the indexes I have and drop those rows.

Printing out the number of rows, you can see it reduced by three.

Spelling errors in categorical data

Sometimes your categorical data might have spelling errors or different capitalization that can mess up your categorization.

I will be using the neighbourhood_group column as an example.

You can see the different types of neighborhoods are already well categorized. But what if it wasn’t?

To simulate a scenario where some of the data had capitalization or spelling issues, I sampled 2 rows from the data, and replaced them with the wrong spelling.

You can see now how the categorization is messed up. “Manhattan” and “manhatann” refer to the same thing, but they aren’t in the same category because of capitalization. Same goes for “brookln” due to spelling issues.

We can fix this by using the replace function in pandas. We first give the values that are wrong, then supply the right ones. Notice the values have to match each other in the list, i.e. “manhatann” → “Manhattan”.

Now our groups are fixed again!

Renaming columns

There are cases where you want to rename your columns as well.

You can do this by using a dictionary, setting the key as the original column name, and the value as the new column name.

Then using the rename function we give our dictionary and voila, the columns names have changed.

Converting to DateTime

If you have data that should be a datetime object, but are strings, you can use the pd.to_datetime, and pass it the format that represents your data.

Just like that, the column has converted into a datatime data type.

Duplicates

There are cases where your rows have duplicate values, this could’ve happened due to some mishaps in your data collection.

To find out if you have duplicated values, call duplicated().any() on your data frame, and if it’s true, use the drop_duplicates function

You can also specify columns where you want to remove duplicate values like below.

Change data type to reduce memory

Changing data type is common if you want to reduce memory usage.

To do so, you can use the astype(‘dtype’) function where you specify the dtype you want.

In my example, I changed the data type for the host_id column from int64 to int32

Observe the memory before changing the data type

And after.

You can see the memory reduced from 6.5+ to 6.3+ MB.

Here is more information on changing data types.

That’s all I have on inconsistent data, let’s move on to outliers.

Outliers

Outliers can be dangerous as they can skew your model and give you predictions that are biased and erroneous.

The best way to find outliers is to use the describe function and look at information such as maximum and mean.

You can also plot a histogram and look at the distribution of your data.

In this histogram, you can see that most of the data is around 0 to 5000.

A boxplot is also useful in detecting outliers.

As you can see, the price column has multiple data points that are outliers (above of the maximum in the boxplot)

For categorical data, you can plot a bar chart to see whether a particular category to view the count of the categories.

Outliers in categorical data is tricky, because you have to determine whether it’s appropriate to call it an outlier based on context.

Some outliers are more obvious. Let’s say there’s an experiment done where 1000 people choose between a glass of water and a glass of milk. If the final result is 1 person who chose a glass of water, and 999 people choosing a glass of milk, that 1 person can be considered an outlier.

However, in some cases, outliers depend on context. In my example, you see that Manhattan and Brooklyn has significantly more data than Staten Island. This doesn’t count as an outlier, since Manhattan and Brooklyn has a higher housing density as compared to Staten Island.

Dealing with outliers

Dealing with outliers is similar to removing missing values, the only difference is the way you find outliers.

To categorize numerical values as outliers, there are statistical techniques like using the standard deviation and the Interquartile range. You can refer to this article for code examples to do that.

For categorical values, if they have very low frequency (like Staten island in the example above), it still may become a problem for your model. Depending on context and nature of the data, you can choose to group them into one category, for example “Others”. This way your model will be less biased and you’re not losing any information.

The importance of documenting

For all the data cleaning tasks you see above, it’s important to document your process in data cleaning, i.e. what tools you used, what functions you created, and your approach.

This is so that others to understand what you did, and it can eventually become the culture or process for how your organization deal with dirty data.

Learn by doing

The best way to learn is to practice it hands-on, this definitely applies to data cleaning because different datasets require different methods to clean.

That said, you’re in luck! We have a new competition released — the Viral Tweets Prediction Challenge.

This is the perfect opportunity for you to practice data cleaning as the dataset given are tweets and will be an interesting challenge to tackle. This competition ends on July 6, 2021 so sign up now and test your skills today!

That’s all for this article, thanks for reading and I hope you learned something new from it!

--

--