Data analytics, which helps with decision-making, market research, and more, is becoming an important procedure for various businesses. However, before analyzing your data, you want to make sure that your data is “ready”. In other words, you want your data to be clean, well-structured and unambiguous so that it won’t affect your next step of data analysis.
Here are the steps that you should take to check and clean your raw data:
1.Identify & remove invalid data
For example, for a database of survey responses, you should remove the responses from survey test or preview. Invalid survey responses, due to technical errors or other reasons, should also be removed. As for incomplete data, it depends on how many questions that were completed and if the responses are good enough for analysis. If the participant complete most of the survey only except for one demographic question or open-ended question, I would keep this response as valid data.
Make sure your data column names are the exact variable names, not values or ambiguous names. For example, you should put the variable names “Ad Display Time”, “Ad theme”and “Age” on column headers instead of “Independent variable 1”, “Independent variable 2” and “Demographic”.
Sometimes variables are indicated as codes. For example, in the gender column, “Female” and “Male” may be indicated as “0” and “1” respectively. In order to analyze your data accurately and conveniently later, you should change “0” and “1” to “Female” and “Male”. Otherwise, some statistical tools may recognize them as numeric variables instead of nominal variables.
Also, when you ask two questions in a positive tone and a negative tone respectively, and use 7-point scale for both questions, you should reverse the responses of one of them to keep things consistent. For example, when you ask “do you like hamburgers” and “are you uncomfortable with fast food” in your survey, score at 3 or less usually indicates dislike/dissatisfaction/negative in the first question while that indicates comfort/like/positive in the second question, vice versa.
3.Form variables & set up formulas
See the demo above, sometimes you ask the same questions in both positive/negative tones to make sure participants understand your questions and they are not biased by the tones. Then when you clean the data, you can set up a column to average the scores of the multi-measurement questions.
4.Double check your data
Double check if multiple variables are in one column or if there’s anything ambiguous.
Here are the steps for cleaning your messy survey data. In my next post, I will introduce how to conduct simple data analysis, hypothesis tests (including t-test, ANOVA, regression), and visualization in JMP, a statistical software from SAS package.