How to Remove Duplicate Rows from a Survey Dataset

Learn how to remove duplicate rows from a survey dataset using Python and pandas. Clean your data by identifying and dropping repeated entries to ensure accurate analysis.




Duplicate rows in survey data occur when the same response is recorded more than once, often due to system errors or repeated submissions. 

These duplicates can distort analysis results and must be removed before processing.


Step 0: Load the Data

import pandas as pd
from google.colab import files

uploaded = files.upload()

file_name = list(uploaded.keys())[0]

df = pd.read_excel(file_name, skiprows=4)

df = df.dropna(axis=1, how='all')

df.head()



Step 1: Identify duplicates

df.duplicated().sum()





Step 2: Remove duplicates

df = df.drop_duplicates()
print(df)



Step 3: Confirm removal

df.duplicated().sum()
print(df)




Key point

Always deduplicate early in your pipeline to ensure each survey response is counted only once.


Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data