Tutorial: Always these invisible control characters in data feeds!

How to identify and remove control characters data feeds

Tutorial: Always these invisible control characters in data feeds!

A data feed contains structured information, often in the form of rows and columns, that is automatically transferred between systems. This often involves product data for search engine services such as Google Shopping, Adwords or affiliate programs. The great advantage of data feeds is that this technology also makes it possible to transfer larger data sets. However, stowaways keep creeping into the data feed, which can trigger various problems: Control characters! They end up in our data feeds for various reasons and remain there unrecognized until they are made visible with a technical tool (e.g. a HEX editor). For example, the so-called BOM (byte order mark) is such a control character that causes problems when reading or processing - i.e. you should identify and remove these invisible characters. This is not always easy, because you just can't see these characters.

How unwanted control characters arise

There are various reasons for the appearance of control characters in the data feed. One common cause is external sources. For example, if the data is transferred from suppliers or copied and pasted from a Word or Excel file into the company's own systems, invisible characters can be transferred via the clipboard during copying. Invisible because they are not displayed by the browser after pasting. It therefore appears as if no further characters have been transferred. If the viewer does not explicitly look for them, they remain undetected and can be very annoying for string comparisons and matching between systems, among other things.

Common problems caused by invisible control characters

We would like to explain the problems that control characters in the data feed can lead to using the following example.

The file name "myproductfile.xls" is inserted into the data feed by copy & paste from an external Word file. Now, however, there is an invisible character at the beginning or end. The download process now tries to find a file with the invisible character. But of course this file does not exist, because there is only one file without this character in the filename. The result is an error message and usually a futile search for solutions to this problem, since it is not obvious in the data feed.

However, this problem can also occur in reverse. Possibly there is a file (data feed) which already contains invisible control characters. For example in the article number. This could be like this: ABC[strangeCharacter]123, but when reading the file, only ABC123 is displayed. If you now search for ABC123 using the function "Ctrl+F", you will not find any entry.

Besides file names and article numbers, many other components in the data feed can be affected. For example, the order number or customer names. In sum, control characters therefore lead to a large chaos of error messages, which are easy to fix, but difficult to detect.

Remove control characters in data feed

If you encounter such problems, it is advisable to check all ID and/or identifier fields such as article numbers, order numbers, EANs, etc., we recommend using the so called RegEx - function. How exactly this works, you will learn in our Cookbook How do I remove control characters like bom-byte-order-mark?

Conclusion

Control characters in the data feed pose a tricky problem in data processing. After all, the characters are "invisible" but still there. Especially by copy & paste actions from Word or Excel files, they accidentally get into our data feed and cause a lot of chaos there.

To avoid this, you should remove invisible characters from all data before you process them further. This is possible with the help of RegularExpressions in combination with the "Find & Replace" function or the Freemarker Replace script. These can be applied to data feeds using the Mapper Step.

More about this in our Cookbook

Our whitepaper for Makers: No Code Integration & Automation

Related articles


Last updated November 2, 2022
Chat with us