[Winpcap-users] pcap_open_offline and unicode charsets
Guy Harris
guy at alum.mit.edu
Sun Nov 8 14:28:28 PST 2009
On Nov 8, 2009, at 12:55 PM, Mark Bednarczyk wrote:
>>> My library gets its filename from a java string and it currently
>>> converts it to plain UTF-8 charset and that works fine.
>>
>> On UN*X, it should perhaps be converted to whatever the
>> locale's filename character set is.
>
> But I don't actually call on any fopen calls directly. I rely on
> libpcap to
> work with the filesystem. Therefore I would like to go by the specs
> the
> libpcap provides for the pcap_open_offline call. It would be nice to
> somehow
> handle and provide a definitive specification when passing in a
> string.
The definitive specification is "it calls fopen(), so it does the same
thing as fopen()".
*If* a file name happens to be encoded, in the file system, using
UTF-8, you would hand that UTF-8 string to fopen() to open it, so you
would do the same with pcap_open_offline(). If, instead, it happens
to be encoded using ISO 8859/1, or 8859/2, or 8859/15, or..., or
KOI-8, or Shift-JIS, or EUJIS, or..., you'd hand a string in *that*
encoding. (Sorry, but UN*X internationalization antedated Unicode, so
they had to do *something*, and ended up doing a variety of different
things in different locales. Oh, and don't get me started about
Unicode normalization forms....)
>
>>
>> I'm not sure how that would be determined, however. I might
>> be tempted to assume that, if the environment variable
>> LC_CTYPE is set that specifies the encoding, otherwise if
>> LANG is set that specifies the encoding, otherwise it might
>> be the C locale (which, I think, unfortunately says the
>> encoding is ASCII). However, GLib (not glibc,
>> GLib) has its own additional environment variables:
>>
>> http://library.gnome.org/devel/glib/stable/glib-running.html
>>
>> and I'm not sure why that's the case.
>>
>>> But in reality I'd like to support all unicode widths 8, 16 and even
>>> 32 bit. I'm not sure how those wider unicode chars would be handled.
>>
>> How are they handled elsewhere in Java? The File class seems
>> to work with Strings, and the String class, at least as I
>> understand the documentation, uses UTF-16 (presumably that's
>> what you mean by "unicode [width] ... 16 ... bit").=
>
> Java has extensive unicode support for even the extended unicode
> widths where
> they combine 2 UTF-16 chars to describe a single character.
If that's "surrogate pairs", that's more like "combining two 16-bit
codes" - a surrogate pair is a single character, represented as two
"code units":
http://unicode.org/standard/principles.html
"Encoding Forms
Character encoding standards define not only the identity of each
character and its numeric value, or code point, but also how this
value is represented in bits.
The Unicode Standard defines three encoding forms that allow the same
data to be transmitted in a byte, word or double word oriented format
(i.e. in 8, 16 or 32-bits per code unit). All three encoding forms
encode the same common character repertoire and can be efficiently
transformed into one another without loss of data. The Unicode
Consortium fully endorses the use of any of these encoding forms as a
conformant way of implementing the Unicode Standard.
UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of
bytes. It has the advantages that the Unicode characters corresponding
to the familiar ASCII set have the same byte values as ASCII, and that
Unicode characters transformed into UTF-8 can be used with much
existing software without extensive software rewrites.
UTF-16 is popular in many environments that need to balance efficient
access to characters with economical use of storage. It is reasonably
compact and all the heavily used characters fit into a single 16-bit
code unit, while all other characters are accessible via pairs of 16-
bit code units.
UTF-32 is popular where memory space is no concern, but fixed width,
single code unit access to characters is desired. Each Unicode
character is encoded in a single 32-bit code unit when using UTF-32.
All three encoding forms need at most 4 bytes (or 32-bits) of data for
each character."
At least as I read the description of the String class:
http://java.sun.com/javase/6/docs/api/java/lang/String.html
it's based on UTF-16:
"A String represents a string in the UTF-16 format in which
supplementary characters are represented by surrogate pairs (see the
section Unicode Character Representations in the Character class for
more information). Index values refer to char code units, so a
supplementary character uses two positions in a String.
The String class provides methods for dealing with Unicode code points
(i.e., characters), in addition to those for dealing with Unicode code
units (i.e., char values)."
> Here is how java represents unicode characters:
>
> The char data type (and therefore the value that a Character object
> encapsulates) are based on the original Unicode specification, which
> defined
> characters as fixed-width 16-bit entities.
Meaning it can't handle characters outside the BMP.
However, from your example in "Decoding packets manually":
String file = "capturefile.pcap";
...
Pcap pcap = Pcap.openOffline(file, errbuf);
it appears that you use Strings for pathnames. As per my earlier
mail, pathnames seem to be Strings, hence UTF-16-encoded, so the
pathnames you'll be handed are UTF-16, not UCS-2 (UCS-2 encodes only
the BMP, with one 16-bit code unit per code point).
> So in summary, I think the answer is that UTF-8 is supported on all/
> most
> platforms and filesystem types right now.
It's supported on UN*Xes where file names happen to be encoded in
UTF-8. Mac OS X does that (in fact, that's all that's supported in HFS
+, although, *on disk*, HFS+ uses, I think, UTF-16, but what you see
in the UN*X APIs is UTF-8; the OS X SMB client assumes all file names
are UTF-8, mapping them to UTF-16 over the wire and mapping stuff
received from over the wire from UTF-16 back to UTF-8). Other UN*Xes
probably allow other encodings, hence my comment about mapping from
UTF-8 to the native file name encoding.
On Windows, however, it's not going to work - on Windows, I don't
think fopen() takes UTF-8-encoded pathnames, I think it takes
pathnames encoded in whatever the current "code page" is. That means
that there could be unopenable files (e.g., if your current code page
is an Asian DBCS code page, you probably won't be able to open a file
named "Müller's network problem.pcap").
You'd need pcap_wopen_offline(), or something such as that, to fully
support Unicode pathnames.
> The UTF-16 which is what my user is
> using for some chineese characters in filename will not work with
> libpcap's
> pcap_open_offline(). The platform he is on is ubuntu
...which, being a Linux distribution, and hence a UN*X, expects
pathnames to be sequences of octets, with '/' as separators and '\0'
as a terminator. Handing it a UTF-16 string isn't going to work very
well.
*If* the file's name is encoded with UTF-8, handing it a UTF-8 string
should work. If it's encoded in some other encoding, such as Big5:
http://en.wikipedia.org/wiki/Big5
or GB 2312:
http://en.wikipedia.org/wiki/GB2312
it probably won't work.
> I'm not sure what application created the file in the first place.
> May be we can discern if fopen was used to created the file using
> UTF-16
> encoding or some other system call.
Ultimately, the system call used to create the file was either open()
or creat() (and the former is a superset of the latter); they take
octet strings in some superset-of-ASCII encoding (UTF-8, ISO 8859/x,
Big5, GB 2312, Shift JIS, etc.), so that all octets in the range 0x00
through 0x7F represent the corresponding ASCII character, and only
octets with the 0x80 bit set are used to encode other characters.
The issue probably doesn't involve UTF-16, as that's not a octet-
string superset-of-ASCII encoding; it probably involves UTF-8 vs. some
other encoding of Chinese.
As for the other user who filed
http://jnetpcap.com/node/456
he's using Windows (as per the reference to MinGW and jnetpcap.dll),
so his problem may ultimately be caused by the lack of
pcap_wopen_offline().
>
> Cheers,
> mark..
> http://jnetpcap.com
>
>
> _______________________________________________
> Winpcap-users mailing list
> Winpcap-users at winpcap.org
> https://www.winpcap.org/mailman/listinfo/winpcap-users
More information about the Winpcap-users
mailing list