I've recently published an article about Input sanitization and escaping for database and STDOUT using PHP. The announcement of the article started a discussion on Freenode – when data should be sanitised, how and if it should be sanitised at all. I was pro data sanitisation whenever possible. It turns out, I was wrong.
Allow me to illustrate the argument and why I was wrong.
$input = filter_input_array(INPUT_POST, [
'name' => FILTER_SANITIZE_STRING,
'username' => FILTER_SANITIZE_STRING,
'password' => FILTER_SANITIZE_STRING
]);
Name
Name is expected to be an alphabetical string; possibly include a dot, hyphen or an apostrophe. Therefore, "Tom Becker", "Martha Tilly-Watson" and "Charles de Batz-Castelmore d'Artagnan" will all pass the filter untouched. After all, FILTER_SANITIZE_STRING is only meant to Strip tags, optionally strip or encode special characters.
(http://am.php.net/manual/en/filter.filters.sanitize.php)
Username
Username is often an alphanumerical string, possibly containing other characters and symbols, such as >.!()<*. The first realisation is that if the username is formatted as a HTML tag, it will be trimmed, e.g. "3<5 6>2" will become "32" and "<tron>" will vanish completely.
If you are using data sanitisation in a similar way to the example below, then you are doing it wrong.
$username = filter_input(INPUT_POST, 'username', FILTER_SANITIZE_STRING);
if(!empty($username) /* and not in the database */) { /* proceed with the registration */ }
You are changing user's input without even warning the user. Instead, you should check if the original value has been changed (e.g. $username != $_POST['username']); if it did change, then there are two things you can do about it:
- Redirect the user back to the form and tell that field X value has been sanitised. User can either approve the change by resubmitting the form or amend the value before doing so.
- Redirect the user back to the form, keep the original data in the form and inform the user that field X does not comply with the format.
I personally prefer the second option because the data is being kept intact. Some (like I did earlier) will argue that this adds redundancy (user will need to manually emend the input). However, consider this example… what if I wrote a long equation to my notebook and it vanished the moment I've submitted the form and I don't recall it anymore?
Password
However, the example that struck me the most was the password field. I used to use FILTER_SANITIZE_STRING filter without realising that I am corrupting the data. The worst is that you are not given any warnings – if you filter the data the same way in registration and authentication flow, then the password hash is the same. That is, not until you fix the issue and users start getting seemingly random "wrong password" errors.
The truth is that, in most cases there is no need for FILTER_SANITIZE_STRING; FILTER_UNSAFE_RAW can be used instead (or no filter). Not unless you intend the behaviour otherwise, the data should be converted to HTML entities on the frontend and stored to the database intact.