How to Standardise Inconsistent Text Values (e.g. 'Nairobi' vs 'nairobi')

In real-world datasets, text columns are rarely clean. You’ll often see variations like "Kenya", "kenya ", "KENYA", or "Ken-ya"




Standardising these values is essential for reliable grouping, analysis, and reporting.

Below is a minimal, practical workflow using a CSV file in Python (Pandas).


1. Load the Data

import pandas as pd

df = pd.read_csv("survey_data.csv")

df.head()






2. Inspect the inconsistent column

df['country'].unique()

This helps you identify variations like:

  • extra spaces

  • inconsistent casing

  • spelling differences




3. Standardise text values

Step 1: Lowercase everything

Step 2: Strip spaces

Step 3: Replace known inconsistencies

df['country_clean'] = (
    df['country']
    .str.lower()
    .str.strip()
    .replace({
        'ken-ya': 'kenya',
        'k e n y a': 'kenya'
    })
)



4. Optional: Capitalise properly

If you want clean reporting format:

df['country_clean'] = df['country_clean'].str.title()


5. Validate results

df[['country', 'country_clean']].drop_duplicates()




Key idea

Standardisation is not just cleaning—it is making categories comparable so your analysis doesn’t split the same entity into multiple fake groups.



Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data