Fixing Encoding Issues in Ruby with String#scrub

When working with data in Ruby, especially when processing user input or reading from external sources, it’s common to come across strings that contain invalid or corrupted characters. This can happen due to encoding issues or poorly formatted input. Thankfully, Ruby offers a built-in method called String#scrub to handle these cases cleanly and efficiently.

What is the scrub Method?

The scrub method replaces invalid characters in a string with a character or sequence of your choice. It’s handy when you need to ensure that a string is valid UTF-8, which is crucial for things like email addresses, names, and other user-facing or system-critical data—especially when external sources send malformed input.

Example

Imagine you received an email string from an external system, but it contains an invalid byte:

1
email = "michael.smith\x80@example.com"

The byte \x80 is not valid in UTF-8. Trying to use this string as-is could cause errors or unexpected behavior in libraries that expect properly encoded strings.

To clean up the string and ensure it’s safe to use, you can use scrub to remove the invalid character entirely:

1
2
cleaned_email = email.scrub('')
puts cleaned_email

Output:

1
michael.smith@example.com

Now the string is valid UTF-8, and you can safely use or validate the email address without issues.

The scrub method is a simple but powerful method for dealing with corrupted or malformed data in Ruby. By ensuring your strings are clean and properly encoded, you can avoid subtle bugs and ensure your application behaves reliably.

If you’re working with data from external sources like CSV files, APIs, or form submissions, using scrub can save you a lot of debugging time and protect your application from unpredictable behavior.