2024-08-30

How to Remove Diacritics and Sanitize Strings in Python

When working with international text data, you often need to convert characters with diacritical marks (like accents, umlauts, or other language-specific symbols) to their basic Latin equivalents. This process, known as diacritic removal or string sanitization, can be crucial for tasks such as:

  • Improving search functionality
  • Normalizing data for analysis
  • Ensuring compatibility with systems that only support ASCII characters

In this guide, we'll show you how to easily sanitize strings in Python using the unidecode library.

Step 1: Install the unidecode Library

First, you need to install the unidecode library. Open your terminal or command prompt and run:

pip install unidecode

Step 2: Import the Library

In your Python script, import the unidecode module:

import unidecode

Step 3: Create a Sanitization Function

Define a function that will sanitize your strings:

def sanitize_string(text):
    return unidecode.unidecode(text)

This function takes a string as input and returns the sanitized version.

Step 4: Use the Function

Now you can use this function to sanitize any string. Here's an example with various languages:

# Example usage
examples = [
    "Łódź",  # Polish
    "España",  # Spanish
    "München",  # German
    "François",  # French
    "Νίκος",  # Greek
    "Россия"  # Russian
]

for example in examples:
    sanitized = sanitize_string(example)
    print(f"Original: {example}")
    print(f"Sanitized: {sanitized}")
    print()

Step 5: Run the Script

When you run this script, you'll see output like this:

Original: Łó
Sanitized: Lodz

Original: España
Sanitized: Espana

Original: München
Sanitized: Munchen

Original: François
Sanitized: Francois

Original: Νίκος
Sanitized: Nikos

Original: Россия
Sanitized: Rossiia

Alternative Libraries

While unidecode is comprehensive, you might also explore alternatives like unicodedata from the Python standard library for more control over the process.