Q: I have an IFS file that's in EBCDIC. How can I translate it to ASCII?
Q: My file is ASCII. How can I make it UTF-8?
Q: My customer has uploaded a document, and the IFS thinks it's ASCII, but it's really EBCDIC. How do I fix this problem?
A: The IBM i operating system provides simple tools for converting from any character set to any other character set. It also provides an easy way to describe the character set of a file, so programs will know what data to expect in the file.
I'll start out with some background information that will help you understand how this works. If you already know how it works, or you want the "abridged version," you can skip down to the sections titled "Changing the CCSID in the IFS" and "Translating a file in the IFS."
There are many different character sets in the world. We're all familiar with the concept of ASCII and EBCDIC--but if you think about it, there are many different character sets called ASCII! The ASCII used on the Apple II that I grew up with is different from the one used on the original IBM PC, which in turn is different from the one used on the Unix machines of the time. Furthermore, it varies from culture to culture. For example, the ASCII character set used in the U.K. is different from the one used in the U.S., because the UK has some different symbols, such as a different currency symbol. Even more different is the ASCII used in Russia, which has many Cyrillic characters that we do not use in the English language. Consider the Asian languages in which there are thousands of characters in the alphabet, all different from what we use in English. Those are just a few examples--I'm sure you'll agree that there are many, many different varieties of ASCII.
The same is, of course, true of EBCDIC. Just as there are many varieties of ASCII, there are also many varieties of EBCDIC, and for the same reasons.
Then, there's also Unicode. Unicode is designed to be universal, containing all the characters in all cultures--yet, there are still several varieties of it. UTF-8, UTF-16, UTF-32, UCS-2, UCS-4, and so forth.
Indeed, there are many thousands of character sets in the world. So how do you keep track of them all? How do you specify the precise character set you want? That's where CCSIDs come in.
CCSID stands for Coded Character Set Identifier--though most people just refer to them as CCSIDs. The acronym CCSID is often pronounced like "see-sid."
What is a CCSID, really? Well, suppose you were to take all those different flavors of ASCII, EBCDIC and Unicode and list them on a piece of paper. Then, assign a number to every character set you came up with. That way, whenever you refer to that number, the system would know precisely which character set you mean to use. That would be a CCSID! That's all a CCSID is: A number that identifies a specific character set.
Here's a list of a few CCSIDs that I use, just to give you the idea:
| CCSID | Description |
|---|---|
| 37 | EBCDIC used in U.S., Canada, Brazil, Portugal, New Zealand, Australia, Netherlands |
| 285 | EBCDIC used in U.K. |
| 819 | ISO 8859-1 (a Latin-1 variant of ASCII) |
| 1252 | Microsoft Windows variant of Latin-1 ASCII |
| 1208 | UTF-8 (Unicode) |
| 1200 | UTF-16 (Unicode) |
| 13488 | UCS-2 (Unicode) |
The IBM i operating system is a fantastic operating system for working with different character sets because every disk object has a CCSID attribute to it, and the operating system has built-in tools that automatically convert from one CCSID to another as desired.
Contrast this with Windows. When you install Windows, you select your language and region, and Windows figures out the appropriate character set. From that point on, all files you create are assumed to be in that character set. If they're not, it's up to the application to somehow keep track of each object and which character set it was created with. Windows doesn't do it, the application has to. Likewise, since Windows assumes everything to be the same, it doesn't have the capability of automatically translating the data.
On the IBM i operating system (like most operating systems), all disk objects have attributes. These attributes are maintained by the operating system to provide users and programs with information about the objects. Common attributes are the date the object was created, the date it was last used, the size of the object on disk, and so forth.
On IBM i, there's also an attribute that keeps track of the CCSID of an object. If a program opens a disk object in "text" mode, the system automatically translates it from the CCSID of the disk object to the CCSID requested by the program. That way, it's easy to write programs that work with any possible character set! The program has to worry only about its own CCSID and lets the OS take care of translation.
It's important to understand that the CCSID is just an attribute of a file, and a file is only a collection of bytes. The system doesn't know that those bytes are text and doesn't know that the text is supposed to be a particular character set, unless you tell it.
Think of the CCSID attribute in much the same way you'd think of a label on a jar of jelly. You might have strawberry, raspberry, or grape jelly. How do you know which one is in the jar? Well, presumably at the time the makers filled the jar, they knew what it was, and they took the care to put the right label on it. When the time comes to use that jelly, you can read that label, and you'll know what's inside.
Suppose, however, that the person canning the jelly was less careful, and put a raspberry label on a jar of strawberry jelly? Now you have a bit of a problem.
The same sort of problem occurs with CCSIDs. When a user creates a file, he has to mark it with a CCSID, (or take the default.) When a program reads the file, if it was marked with the wrong CCSID, characters will be mistranslated. A file that contains ASCII data, but was marked with an EBCDIC CCSID will look like "garbage" when it's viewed. Likewise, a file with U.S. EBCDIC data, but marked with a German EBCDIC CCSID will be mistranslated. Since they're both EBCDIC, most characters will look good, but some like the $ (dollar sign) or @ (at symbol) will be mistranslated.
Mismarked CCSIDs are extremely common when a file is created by a file transfer application such as FTP. Oftentimes, a file is transferred from a system such as Windows or Unix that has no notion of the character set of the file--so the FTP server doesn't know which CCSID to mark the data with. Instead, it marks it with whatever the default CCSID of the FTP server is (it's usually 819, but you can set this with the CHGFTPA command). If the data in the file doesn't happen to be in that default character set, you'll have a mismarked file.
When you find a file in the IFS marked with the wrong CCSID, you can change it with the CHGATR CL command. For example:
CHGATR OBJ('/cust/89421/uploads/dailybatch.csv') ATR(*CCSID) VALUE(1252)
This simply changes the CCSID attribute, in other words, the label on the outside of the jelly jar. It does not change the contents of the file. So if the file was marked with the wrong CCSID, you can use this command to correct the problem.
Sometimes the CCSID is marked correctly, but you want to translate a file from one character set to another. For example, perhaps you have software running in PASE that expects a file to be in ASCII--after all, PASE software is AIX software, so it can't take advantage of the IBM i CCSID support, and therefore the data must be translated to ASCII before it can be used.
Other common examples: You have data in EBCDIC that needs to be ASCII so a Windows program can read it. Or it needs to be Unicode so a Java program can read it.
It's easy to translate IFS files from any CCSID to any other CCSID by copying them with the CPY CL command:
CPY OBJ('/cust/89421/work/temp.csv')
TOOBJ('/cust/89421/downloads/dailyoutput.csv')
FROMCCSID(*OBJ)
TOCCSID(1252)
DTAFMT(*TEXT)
The special value of *OBJ means that the CPY command will use the CCSID that the file is marked with. In the above example, I've created a temporary file (temp.csv) that's in EBCDIC CCSID 37. I'm using the CPY command to translate it from 37 to Windows Latin-1 ASCII (CCSID 1252). In this example, my customer will later retrieve the file using SFTP, and that customer won't have to worry about translating it, because I've already converted it to ASCII.
The same technique works anytime you need the file to be ASCII. It doesn't matter whether FTP, SFTP, HTTP, or Windows Networking is used to transfer the file, or whether you need it to be ASCII for use locally, such as a PASE program. The point is, I've translated the data from EBCDIC to ASCII by using the CPY command.
Simply by changing the TOCCSID, I can translate it to a different character set entirely. For example, if I had used TOCCSID(1208), the output would've been UTF-8 instead of ASCII.
Links:
[1] http://systeminetwork.com/author/scott-klement