4 Steps for Cleaning Your Messy Data

Data analytics, which helps with decision-making, market research, and more, is becoming an important procedure for various businesses. However, before analyzing your data, you want to make sure that your data is “ready”. In other words, you want your data to be clean, well-structured and unambiguous so that it won’t affect your next step of data analysis.

Here are the steps that you should take to check and clean your raw data:

1.Identify & remove invalid data

For example, for a database of survey responses, you should remove the responses from survey test or preview. Invalid survey responses, due to technical errors or other reasons, should also be removed. As for incomplete data, it depends on how many questions that were completed and if the responses are good enough for analysis. If the participant complete most of the survey only except for one demographic question or open-ended question, I would keep this response as valid data.

messy data.png

3.Recode/rename data

Make sure your data column names are the exact variable names, not values or ambiguous names. For example, you should put the variable names “Ad Display Time”, “Ad theme”and “Age” on column headers instead of “Independent variable 1”, “Independent variable 2” and “Demographic”.

Sometimes variables are indicated as codes. For example, in the gender column, “Female” and “Male” may be indicated as “0” and “1” respectively. In order to analyze your data accurately and conveniently later, you should change “0” and “1” to “Female” and “Male”. Otherwise, some statistical tools may recognize them as numeric variables instead of nominal variables.

Also, when you ask two questions in a positive tone and a negative tone respectively, and use 7-point scale for both questions, you should reverse the responses of one of them to keep things consistent. For example, when you ask “do you like hamburgers” and “are you uncomfortable with fast food” in your survey, score at 3 or less usually indicates dislike/dissatisfaction/negative in the first question while that indicates comfort/like/positive in the second question, vice versa.

messy data 2.png

3.Form variables & set up formulas

See the demo above, sometimes you ask the same questions in both positive/negative tones to make sure participants understand your questions and they are not biased by the tones. Then when you clean the data, you can set up a column to average the scores of the multi-measurement questions.

4.Double check your data 

Double check if multiple variables are in one column or if there’s anything ambiguous.

Here are the steps for cleaning your messy survey data. In my next post, I will introduce how to conduct simple data analysis, hypothesis tests (including t-test, ANOVA, regression), and visualization in JMP, a statistical software from SAS package. 


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s