How to Standardise Inconsistent Text Values (e.g. 'Nairobi' vs 'nairobi')
In real-world datasets, text columns are rarely clean. You’ll often see variations like "Kenya", "kenya ", "KENYA", or "Ken-ya".
Standardising these values is essential for reliable grouping, analysis, and reporting.
Below is a minimal, practical workflow using a CSV file in Python (Pandas).
1. Load the Data
import pandas as pd
df = pd.read_csv("survey_data.csv")
df.head()
2. Inspect the inconsistent column
df['country'].unique()
This helps you identify variations like:
extra spaces
inconsistent casing
spelling differences
3. Standardise text values
Step 1: Lowercase everything
Step 2: Strip spaces
Step 3: Replace known inconsistencies
df['country_clean'] = (
df['country']
.str.lower()
.str.strip()
.replace({
'ken-ya': 'kenya',
'k e n y a': 'kenya'
})
)
4. Optional: Capitalise properly
If you want clean reporting format:
df['country_clean'] = df['country_clean'].str.title()
5. Validate results
df[['country', 'country_clean']].drop_duplicates()
Key idea
Standardisation is not just cleaning—it is making categories comparable so your analysis doesn’t split the same entity into multiple fake groups.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment