===== !== !== Japanese-HOWTO.txt for Samba release 2.2.x !== Contributor: TAKAHASHI Motonobu Date: 26 Nov 2001 Status: Current How to use Japanese on Samba 2.0.x/2.2.x ======================================== This document explains how to use a Japanese file name and share name in Samba and the notes for it. Although actually Samba Japanese Edition provided by Samba Users Group Japan is widely used in Japan, the following document is for original versions of Samba as long as there is no specification clearly, In addition, even if your UNIX itself cannot treat Japanese, it is possible to treat Japanese file names in Samba. Settings for using Japanese =========================== To use Japanese, you should set the two parameters, "client code page" and "coding system" appropriately. They defines "using Japanese" and "the encoding method for Japanese" respectively. Both "client code page" and "coding system" must be put on the top of your smb.conf. Samba cannot recognize the encoding method of Japanese in smb.conf without these parameters since they also define the encoding method of Japanese as well. Be carefull when you edit smb.conf by hand! They are automatically set up on the top of smb.conf when smb.conf is edited with SWAT. "client code page" is set as 932, which is the codepage for Japanese. "coding system" is a parameter expected to use in Japanese environment to determines the encoding method of the Japanese file name on your Samba server. Mainly for historical reasons, there are several encoding method in Japanese, which are not fully compatible with each other. Moreover, Samba also offers several unique encoding method to keep interoperability with UNIX which cannot use a Japanese file name. "coding system" defines which encoding method to use. The decision of "coding system" to use ====================================== It is a difficult issue that which "coding system" value to use. At least five values, SJIS, EUC, CAP, HEX and UTF8 (UTF8 is available only in Samba 2.2 series) are generally used and all have merits and demerits. The standard enoding method on Windows is "Shift_JIS", equivalent to SJIS (although Unicode 2.0 encoded with UCS-2 is used internally in Windows NT series, Shift_JIS is externally used for Japanese as well as ASCII is used for English). BUT using SJIS, same as Windows in Samba is not always the best selection as described below. Please read the following explanation and choose a value suitable for you. Although there are more values, I will not explained here since they are hardly used. You can determine the value for "coding system" according to the following judgment order. The detail of each values is mentioned later. 1. set to "HEX" unless subsequent conditions are satisfied. "HEX" is "safety" because it uses ASCII characters only in [0-9a-f:] to express Japanese file names 2. set to "CAP" if the directory shared with Samba is also shared with CAP or Netatalk. Since CAP and Netatalk usually write file names with "CAP" form, it is necessary for Samba to use the same encoding method. However, in case of Netatalk applied EUC-JP patch, file names are written with "EUC-JP" form and it is also necessary for Samba to use "EUC" as well. 3. If you need to use Japanese file names on UNIX, set to "EUC" if the form used on the UNIX is EUC-JP, set to "SJIS" if the form used on the UNIX is Shift_JIS and set to "UTF8" if the form used on the UNIX is UTF-8. Usually, EUC-JP is used on Linux, FreeBSD, Solaris, IRIX, and Tru64 UNIX, Shift_JIS is used on HP-UX and AIX. Most UNIX for commercial can use both to change their locale. However, much of freewares can work only with EUC-JP regardless of setting on UNIX, using EUC-JP is also considerable in the case of using such softwares mainly. There is no allround way to satisfy all conditions. If some conditions are inconsistent, unfortunately you need to give up one of them. The detail for each values of "coding system" ============================================= Here is the detail and merit and demerit for each value of "coding system". o HEX In the case of "HEX", for example if a Japanese file name consist of 0x8ba4 and 0x974c (a 4 bytes Japanese character string meaning "share") and ".txt" is written from Windows on Samba, the file name on UNIX becomes ":8b:a4:97:4c.txt" (a 16 bytes ASCII string). This is Samba original specification. The greatest merit of "HEX" is the interoperability with English environment. In the case of "HEX", all Japanese file names are written on UNIX with the original encoding method, only using some ASCII characters. This is very safety because there can be no problems of broken file names or aborting a command during parsing filenames even if your UNIX cannot treat Japanese characters. On the other hand, since 6 bytes is used to express a 2 bytes character, in the case of using long file names, they may be exceeded over 128 bytes, which is the limit of filename length in Samba 2.0 series. Moreover, it is very inconvenient for users using a Japanese file name written from Windows since the file name is visible only as an encoded ASCII characters string. o CAP In the case of "CAP", for example if a Japanese file name consist of 0x8ba4 and 0x974c (a 4 bytes Japanese character string meaning "share") and ".txt" is written from Windows on Samba, the file name on UNIX becomes ":8b:a4:97L.txt" (a 14 bytes ASCII string). This is a specification using in CAP and Netatalk, file server softwares for Macintosh. The difference from "HEX" is that when a 2 byte Japanese character is devided into 2 bytes, a byte which can be expressed as an ASCII character is not encoded as ":xx" form but is written as the ASCII character itself. A character which is allowed to use in a file name on UNIX but is unpleasant may be contained in the "CAP" encoded file name. you need to take care of containing a "\(0x5c)" in a file name. The greatest merit of "CAP" is the compatibility of encoding file names with CAP or Netatalk, file server softwares of Macintosh. Since they usually write a file name on UNIX with CAP form, if a directory is shared with both Samba and Netatalk, you need to use "CAP" to avoid Japanese filenames are broken. However, recently there are some systems where the Netatalk which is applied a patch to write file names with EUC-JP is installed (i.e. Japanese original Vine Linux), where you need to choose "EUC" instead of "CAP". Most merits and demerits of "CAP" is basically same as "HEX", except "HEX" is more safety. It is better to use "HEX" or other values unless you need to use "CAP". o EUC In the case of "EUC", for example if a Japanese file name consist of 0x8ba4 and 0x974c (a 4 bytes Japanese character string meaning "share") and ".txt" is written from Windows on Samba, the file name on UNIX becomes 0xb6a6, 0xcdad, ".txt" (a 8 bytes BINARY string). "EUC" is equivalent to the industry standard called EUC-JP, widely used in Japanese UNIX (although EUC contains specifications for langauages other than Japanese, such as EUC-KR, "EUC" in Samba is only for EUC-JP). The greatest merit of "EUC" is the interoperability with "Japanized" UNIX. Since EUC-JP is usually used on Open source UNIX, Linux and FreeBSD, and on commercial based UNIX, Solaris, IRIX and Tru64 UNIX as the default Japanese character code (however, it is also possible on Solaris to use Shift_JIS and UTF-8, on Tru64 UNIX to use Shift_JIS). To use "EUC", most Japanese file names created from Windows can be referred to also on UNIX. Also, most Japanized free softwares work mainly with EUC-JP only. It is good to choose "EUC" when using Japanese file names on these UNIX. However, when your locale is not set for EUC-JP, there are some characters which cannot be displayed displayed correctly. Although there is no character which needs to be carefully treated like "\ (0x5c)", broken file names may be displayed and some commands may be aborted during parsing filenames. Moreover, there are NOT fully compatibility with Windows. the user defined characters available in Windows is not available with "EUC" because of its specification. Therefore, if you use "EUC", you need to avoid using imcompatible characters for file names. o SJIS "SJIS" is equivalent to Shift_JIS, used as a standard on Japanese Windows. In the case of "SJIS", for example if a Japanese file name consist of 0x8ba4 and 0x974c (a 4 bytes Japanese character string meaning "share") and ".txt" is written from Windows on Samba, the file name on UNIX becomes 0x8ba4, 0x974c, ".txt" (a 8 bytes BINARY string), same as Windows. The greatest merit of "SJIS" is, contrary to EUC, the interoperability with Windows. Since there is no conversion, it is fully compatible with Windows and the "user defined characters" and "vendor defined characters", which have problems mentioned later can be used comparatively safely. However, like EUC, broken file names may be displayed and some commands may be aborted during parsing filenames. especially unlike "EUC", there may be "\ (0x5c)" in file names, which need to be treated carefully. Since Shift_JIS is usually used on some commercial based UNIX, HP-UX and AIX as the default Japanese character code (however, it is also possible to use EUC-JP), To use "SJIS", most Japanese file names created from Windows can be referred to also on UNIX. However, mentioned in the description of "EUC", most Japanized free softwares work actually with EUC-JP only. You had better confirm to use if the Japanized free software can work with Shift_JIS. If your UNIX is already working with Shift_JIS and there is a user who needs to use Japanese file names written from Windows, basically "SJIS" is the best choice. Using "SJIS" on the UNIX which cannot treat Shift_JIS for the purpose that compatibility with Windows is most important, you should not touch files written from Windows on UNIX. o UTF8 "UTF8" is equivalent to UTF-8, the international standard defined by Unicode.org. in UTF-8, a *character* is expressed with 1 - 3 *bytes*. In case of Japanese, most characters are expressed with 3 bytes. Since on Windows Shift_JIS, where a character is expressed with 2 bytes, is used to express Japanese, basically a byte length of a UTF-8 string grows 1.5 times the length of a original Shift_JIS string. In the case of "UTF8", for example if a Japanese file name consist of 0x8ba4 and 0x974c (a 4 bytes Japanese character string meaning "share") and ".txt" is written from Windows on Samba, the file name on UNIX becomes 0xe585, 0xb1e6, 0x9c89, ".txt" (a 10 bytes BINARY string). For the Japanese processing in Samba, there is no merit for using "UTF8" unless Japanese file name can be treated when your UNIX uses UTF-8 as its current locale. Like "EUC", when your locale is not set for UTF-8, there are some broken file names may be displayed and some commands may be aborted during parsing filenames. Moreover there may be "\ (0x5c)" in file names, which need to be treated carefully. UTF-8 can be used on some commercial based UNIX such as Solaris and HP-UX. However, mentioned in the description of "EUC", most Japanized free softwares work actually with EUC-JP only and there are few ones correctly working with UTF-8 than that with Shift_JIS. You had better confirm to use if the Japanized free software can work with UTF-8. Therefore there are few case that UTF-8 is actually used as the encoding method of a file system. In addition, although it is not directly concerned with Samba, since there is a delicate difference between iconv() function, which is generally used on UNIX and the functions used on other platforms, such as Windows and Java about the conversion table between Shift_JIS and Unicode, you should be carefully to treat UTF-8. Therefore using "UTF8" is not considerable now for Samba. Although Mac OS X uses UTF-8 as its encoding method of a file name, it uses Unicode 3.1 as its character set instead of Unicode 2.0, which Samba assumes the character set for "UTF8". Using "UTF8" on Mac OS X, therefore, some characters becomes broken so that it is also not recommended now. Notes for changing "coding system" ================================== Changing "coding system" once set up, it is necessary to change the encoding method for Japanese file names which already exist on the file system as well. The easiest way is to get backup of files on Windows at once before changing "coding system" and to restore them after the changing. In the archive of Samba Japanese Edition, which is mentioned later in detail, there is a perl script named "smbchartool", which supports this work. to use this, you will do this work simply on UNIX. HOWTO and Notes for including Japanese characters in smb.conf ============================================================= In Samba 2.0.7 and later, it is allowed to include Japanese (and some other language's) characters in smb.conf to set "coding system" and "client code page" parameter appropriately. You need to write Japanese characters with the encoding method, which is set by "coding system" parameter. For example, if you will create a Japanese section, which is 0x8ba4 and 0x974c (a 4 bytes Japanese character string meaning "share") under "coding system = HEX", you need to write as follows: ----- [global] client code page = 932 coding system = HEX ... [:8b:a4:97:4c] path = /tmp ----- SUGJ (Samba Users Group Japan) tested that using Japanese string is allowed to be included not only in share names (and file names) but also in these parameters: - the comment of the server (server string) - the comment of the share (comment) - user names in username map Although using Japanese strings may be included in most parameters which take strings as its value, since there are several problems found in those for using Japanese, it is recommended not to use Japanese strings there. Issues for using Japanese and Samba Japanese Edition ====================================================== There are some problems in Japanese processing for Samba, apart from the issues which encoding method to use. The biggest one is that the Shift_JIS code for some Windows-oriented *characters* are different with Windows 9x series (Windows 95/98/Me) and Windows NT series (Windows NT/2000/XP) for some historical reasons. These codes must be processed as the same in Samba, but this process is not implemented correctly in current Samba and problems that a Japanese file name written from Windows 9x cannot be read from Windows NT will sometimes occur. There is another problem that "user defined characters" cannot be used in EUC-JP, generally used on UNIX. Although use of "user defined characters" is decreasing with the spread of the Internet, they are still indispensable in some commercial or public systems, where lots of KANJI characters are required to display their names correctly and etc. These characters are Windows-oriented but is widely used as an industry standard, so it is indespensable on business to treat them correctly. Moreover, Windows NT series treat that some special KANJI characters such as full-width Roman numerials (full-width Alphabet, full-width Cyrilic and full-width Greek alphabet) are case-insensitive like same as ASCII characters, but Windodws 9x series treat they are case-sensitive. The current implementation of Samba is different from both. In addition, there are some implementations where using Japanese characters is not expected, sousing Japanese like for a Windows may cause a trouble in an unexpected place. Samba Japanese Edition, developed by SUGJ, is developed in order to solve these problems. In Samba 2.0.7-ja-2.2 or later, these problems on file/directory name and shared name are solved. Regrettably such works are not fully merged with original Samba because lots place of source codes are modified. In Samba Japanese Edition Japanese processing is extended as follows: 1. "Normalization of Shift_JIS" In Samba 2.0.7-ja-2.2 or later, the problems that the case recognition and the code for some characters are different between Windows 9x series and Windows NT series are coped with. 2. "User defined character" can be used on EUC-JP In Samba 2.0.7-ja-2.2 or later, "EUC3", Samba Japanese Edition-original value is available for "coding system" parameter, which is based on eucJP-open, an industry standard in order to use "user defined characters" with EUC-JP. 3. Implementation for UTF-8 In Samba 2.0.7-ja-2.0 or later, "UTF8" is available for "coding system" parameter, which allows to write file names with UTF-8, based on Unicode 2.0. This is merged with Samba 2.2 series. 4. Mac OS X issues In Samba 2.0.10-ja-1.0 or later, "UTF8-MAC" is available for "coding system" parameter, which allows to write file names with UTF-8, based on Unicode 3.1 and "Normalization Form", which is neccessary to write into the file systems of Mac OS X. SUGJ is currently developing Samba Japanese Edition for Samba 2.2 series, which is based on Samba 2.2 series and merged with these work. And these work will be merged with HEAD branch, which will be Samba 3.0 series.