[Winpcap-users] pcap_open_offline and unicode charsets

Sun Nov 8 12:55:01 PST 2009

> -----Original Message-----
> From: Guy Harris [mailto:guy at alum.mit.edu]
> Sent: Sunday, November 08, 2009 2:07 PM
> To: voytechs at yahoo.com; winpcap-users at winpcap.org
> Subject: Re: [Winpcap-users] pcap_open_offline and unicode charsets
>
>
> On Nov 7, 2009, at 5:23 PM, Mark Bednarczyk wrote:
>
> > What support is there, for unicode character based file names in
> > WinPcap to functions such as pcap_open_offline?
> >
> >  I have users that are trying to open a file with some chineese
> > characters in its filename. As far as I understand it, fopen under
> > unix (especially under linux) should handle unicode 8-bit with no
> > problems.
>
> I presume by "unicode 8-bit" you mean UTF-8-encoded Unicode.
>

Yes, exactly.

> fopen() on UN*Xes passes the pathname on to open(); that
> means that it should handle any sequence of octets as long as
> the octet value 0x2f is used *only* as a pathname component
> separator and the octet value 0x00 is used *only* as a
> pathname string terminator.
>
> Most local file systems will not attempt to interpret that
> string, except to treate 0x2f (/) as a component separator
> and 0x00 ('\0') as a pathname string terminator.  Whether a
> particular file name is encoded as UTF-8, or ISO 8859/1, ...
> is another matter; I have the impression that various UN*Xes
> are tending towards UTF-8 as the most common encoding, but
> there are probably still systems using other encodings.

Ok, that makes sense.

>
> > Linux also handles wider widths but only in a non-intentional way
> > where wider width chars are handled as 8-bit entities (ie.
> 0x1065 is
> > handled as 2 separate 8-bit chars: 0x65 and 0x10 where order is
> > dependent on processor endianness.)
>
> I would hope it does no such thing, especially with, for
> example, the wide character 0x2f65 (?) - if you hand any UN*X
> API that takes pathnames an octet sequenc containing the
> octet 0x2f followed by the octet 0x65, I would hope that it
> would be interpreted as containing "/ e", and, similarly, if
> you had it a string containing the octet 0x65 followed by the
> octet 0x2f, I would hope that it would be interpreted as
> containing "e/".

Right, that a good point. From some google searches, that is how people were 
explaining how fopen handled UTF-16 strings, but you are right, that would not 
work in practice.

>
> > Under MSFC is different and you have to use MS specific wfopen and
> > wopen calls which take unicode (or wide chars).
> >
> > Does WinPcap provide any support for unicode and call the
> appropriate
> > "open" function?
>
> No, it just uses fopen(), just as libpcap does on UN*X.
>

Did not know that. I thought they provided a separate call for UTF-16 (or wide 
strings as they call it.)

> In theory, it could convert from UTF-8 to UTF-16 and call
> _wfopen(), but that could conceivably break existing
> applications that either explicitly or implicitly expect the
> path argument to
> pcap_open_offline() to work the same as the path argument to fopen().
>
> My inclination would be to, in WinPcap, provide
> pcap_wopen_offline(), or something such as that, taking a
> UTF-16 pathname as an argument.

That would work for me and my users very well.

>
> > My library gets its filename from a java string and it currently
> > converts it to plain UTF-8 charset and that works fine.
>
> On UN*X, it should perhaps be converted to whatever the
> locale's filename character set is.

But I don't actually call on any fopen calls directly. I rely on libpcap to 
work with the filesystem. Therefore I would like to go by the specs the 
libpcap provides for the pcap_open_offline call. It would be nice to somehow 
handle and provide a definitive specification when passing in a string.

>
> I'm not sure how that would be determined, however.  I might
> be tempted to assume that, if the environment variable
> LC_CTYPE is set that specifies the encoding, otherwise if
> LANG is set that specifies the encoding, otherwise it might
> be the C locale (which, I think, unfortunately says the
> encoding is ASCII).  However, GLib (not glibc,
> GLib) has its own additional environment variables:
>
> 	http://library.gnome.org/devel/glib/stable/glib-running.html
>
> and I'm not sure why that's the case.
>
> > But in reality I'd like to support all unicode widths 8, 16 and even
> > 32 bit. I'm not sure how those wider unicode chars would be handled.
>
> How are they handled elsewhere in Java?  The File class seems
> to work with Strings, and the String class, at least as I
> understand the documentation, uses UTF-16 (presumably that's
> what you mean by "unicode [width] ... 16 ... bit").=

Java has extensive unicode support for even the extended unicode widths where 
they combine 2 UTF-16 chars to describe a single character. This of course 
stems from the fact that some languages have more then 65K worth of 
characters. You can easily define "charsets" to build strings. I'm trying to 
balance between java's unicode support with what libpcap/winpcap provides. 
Which it looks like will be OS and even filesystem dependent.

Here is how java represents unicode characters:

The char data type (and therefore the value that a Character object 
encapsulates) are based on the original Unicode specification, which defined 
characters as fixed-width 16-bit entities. The Unicode standard has since been 
changed to allow for characters whose representation requires more than 16 
bits. The range of legal code points is now U+0000 to U+10FFFF, known as 
Unicode scalar value. (Refer to the definition of the U+n notation in the 
Unicode standard.)

The set of characters from U+0000 to U+FFFF is sometimes referred to as the 
Basic Multilingual Plane (BMP). Characters whose code points are greater than 
U+FFFF are called supplementary characters. The Java 2 platform uses the 
UTF-16 representation in char arrays and in the String and StringBuffer 
classes. In this representation, supplementary characters are represented as a 
pair of char values, the first from the high-surrogates range, 
(\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

A char value, therefore, represents Basic Multilingual Plane (BMP) code 
points, including the surrogate code points, or code units of the UTF-16 
encoding. An int value represents all Unicode code points, including 
supplementary code points. The lower (least significant) 21 bits of int are 
used to represent Unicode code points and the upper (most significant) 11 bits 
must be zero. Unless otherwise specified, the behavior with respect to 
supplementary characters and surrogate char values is as follows:

The methods that only accept a char value cannot support supplementary 
characters. They treat char values from the surrogate ranges as undefined 
characters. For example, Character.isLetter('\uD840') returns false, even 
though this specific value if followed by any low-surrogate value in a string 
would represent a letter.
The methods that accept an int value support all Unicode characters, including 
supplementary characters. For example, Character.isLetter(0x2F81A) returns 
true because the code point value represents a letter (a CJK ideograph).
In the J2SE API documentation, Unicode code point is used for character values 
in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 
16-bit char values that are code units of the UTF-16 encoding. For more 
information on Unicode terminology, refer to the Unicode Glossary.

Source: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html

So in summary, I think the answer is that UTF-8 is supported on all/most 
platforms and filesystem types right now. The UTF-16 which is what my user is 
using for some chineese characters in filename will not work with libpcap's 
pcap_open_offline(). The platform he is on is ubuntu which is a debian 
derivative. I'm not sure what application created the file in the first place. 
May be we can discern if fopen was used to created the file using UTF-16 
encoding or some other system call.

Cheers,
mark..
http://jnetpcap.com