Python Standard Encodings / Python Specific Encodings

发布于 2021-10-04  990 次阅读

Python Standard Encodings

Python comes with a number of codecs built-in, either implemented as C functions or with dictionaries as mapping tables. The following table lists the codecs by name, together with a few common aliases, and the languages for which the encoding is likely used. Neither the list of aliases nor the list of languages is meant to be exhaustive. Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec.

Many of the character sets support the same languages. They vary in individual characters (e.g. whether the EURO SIGN is supported or not), and in the assignment of characters to code positions. For the European languages in particular, the following variants typically exist:

  • an ISO 8859 codeset
  • a Microsoft Windows code page, which is typically derived from an 8859 codeset, but replaces control characters with additional graphic characters
  • an IBM EBCDIC code page
  • an IBM PC code page, which is ASCII compatible
Codec Aliases Languages
ascii 646, us-ascii English
big5 big5-tw, csbig5 Traditional Chinese
big5hkscs big5-hkscs, hkscs Traditional Chinese
cp037 IBM037, IBM039 English
cp424 EBCDIC-CP-HE, IBM424 Hebrew
cp437 437, IBM437 English
cp500 EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500 Western Europe
cp720 Arabic
cp737 Greek
cp775 IBM775 Baltic languages
cp850 850, IBM850 Western Europe
cp852 852, IBM852 Central and Eastern Europe
cp855 855, IBM855 Bulgarian, Byelorussian, Macedonian, Russian, Serbian
cp856 Hebrew
cp857 857, IBM857 Turkish
cp858 858, IBM858 Western Europe
cp860 860, IBM860 Portuguese
cp861 861, CP-IS, IBM861 Icelandic
cp862 862, IBM862 Hebrew
cp863 863, IBM863 Canadian
cp864 IBM864 Arabic
cp865 865, IBM865 Danish, Norwegian
cp866 866, IBM866 Russian
cp869 869, CP-GR, IBM869 Greek
cp874 Thai
cp875 Greek
cp932 932, ms932, mskanji, ms-kanji Japanese
cp949 949, ms949, uhc Korean
cp950 950, ms950 Traditional Chinese
cp1006 Urdu
cp1026 ibm1026 Turkish
cp1140 ibm1140 Western Europe
cp1250 windows-1250 Central and Eastern Europe
cp1251 windows-1251 Bulgarian, Byelorussian, Macedonian, Russian, Serbian
cp1252 windows-1252 Western Europe
cp1253 windows-1253 Greek
cp1254 windows-1254 Turkish
cp1255 windows-1255 Hebrew
cp1256 windows-1256 Arabic
cp1257 windows-1257 Baltic languages
cp1258 windows-1258 Vietnamese
euc_jp eucjp, ujis, u-jis Japanese
euc_jis_2004 jisx0213, eucjis2004 Japanese
euc_jisx0213 eucjisx0213 Japanese
euc_kr euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001 Korean
gb2312 chinese, csiso58gb231280, euc- cn, euccn, eucgb2312-cn, gb2312-1980, gb2312-80, iso- ir-58 Simplified Chinese
gbk 936, cp936, ms936 Unified Chinese
gb18030 gb18030-2000 Unified Chinese
hz hzgb, hz-gb, hz-gb-2312 Simplified Chinese
iso2022_jp csiso2022jp, iso2022jp, iso-2022-jp Japanese
iso2022_jp_1 iso2022jp-1, iso-2022-jp-1 Japanese
iso2022_jp_2 iso2022jp-2, iso-2022-jp-2 Japanese, Korean, Simplified Chinese, Western Europe, Greek
iso2022_jp_2004 iso2022jp-2004, iso-2022-jp-2004 Japanese
iso2022_jp_3 iso2022jp-3, iso-2022-jp-3 Japanese
iso2022_jp_ext iso2022jp-ext, iso-2022-jp-ext Japanese
iso2022_kr csiso2022kr, iso2022kr, iso-2022-kr Korean
latin_1 iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1 West Europe
iso8859_2 iso-8859-2, latin2, L2 Central and Eastern Europe
iso8859_3 iso-8859-3, latin3, L3 Esperanto, Maltese
iso8859_4 iso-8859-4, latin4, L4 Baltic languages
iso8859_5 iso-8859-5, cyrillic Bulgarian, Byelorussian, Macedonian, Russian, Serbian
iso8859_6 iso-8859-6, arabic Arabic
iso8859_7 iso-8859-7, greek, greek8 Greek
iso8859_8 iso-8859-8, hebrew Hebrew
iso8859_9 iso-8859-9, latin5, L5 Turkish
iso8859_10 iso-8859-10, latin6, L6 Nordic languages
iso8859_11 iso-8859-11, thai Thai languages
iso8859_13 iso-8859-13, latin7, L7 Baltic languages
iso8859_14 iso-8859-14, latin8, L8 Celtic languages
iso8859_15 iso-8859-15, latin9, L9 Western Europe
iso8859_16 iso-8859-16, latin10, L10 South-Eastern Europe
johab cp1361, ms1361 Korean
koi8_r Russian
koi8_u Ukrainian
mac_cyrillic maccyrillic Bulgarian, Byelorussian, Macedonian, Russian, Serbian
mac_greek macgreek Greek
mac_iceland maciceland Icelandic
mac_latin2 maclatin2, maccentraleurope Central and Eastern Europe
mac_roman macroman Western Europe
mac_turkish macturkish Turkish
ptcp154 csptcp154, pt154, cp154, cyrillic-asian Kazakh
shift_jis csshiftjis, shiftjis, sjis, s_jis Japanese
shift_jis_2004 shiftjis2004, sjis_2004, sjis2004 Japanese
shift_jisx0213 shiftjisx0213, sjisx0213, s_jisx0213 Japanese
utf_32 U32, utf32 all languages
utf_32_be UTF-32BE all languages
utf_32_le UTF-32LE all languages
utf_16 U16, utf16 all languages
utf_16_be UTF-16BE all languages (BMP only)
utf_16_le UTF-16LE all languages (BMP only)
utf_7 U7, unicode-1-1-utf-7 all languages
utf_8 U8, UTF, utf8 all languages
utf_8_sig all languages

Python Specific Encodings

A number of predefined codecs are specific to Python, so their codec names have no meaning outside Python. These are listed in the tables below based on the expected input and output types (note that while text encodings are the most common use case for codecs, the underlying codec infrastructure supports arbitrary data transforms rather than just text encodings). For asymmetric codecs, the stated purpose describes the encoding direction.

The following codecs provide unicode-to-str encoding [1] and str-to-unicode decoding [2], similar to the Unicode text encodings.

Codec Aliases Purpose
idna Implements RFC 3490, see also encodings.idna
mbcs dbcs Windows only: Encode operand according to the ANSI codepage (CP_ACP)
palmos Encoding of PalmOS 3.5
punycode Implements RFC 3492
raw_unicode_escape Produce a string that is suitable as raw Unicode literal in Python source code
rot_13 rot13 Returns the Caesar-cypher encryption of the operand
undefined Raise an exception for all conversions. Can be used as the system encoding if no automatic coercion between byte and Unicode strings is desired.
unicode_escape Produce a string that is suitable as Unicode literal in Python source code
unicode_internal Return the internal representation of the operand

New in version 2.3: The idna and punycode encodings.

The following codecs provide str-to-str encoding and decoding [2].

Codec Aliases Purpose Encoder/decoder
base64_codec base64, base-64 Convert operand to multiline MIME base64 (the result always includes a trailing 'n') base64.encodestring(),base64.decodestring()
bz2_codec bz2 Compress the operand using bz2 bz2.compress()bz2.decompress()
hex_codec hex Convert operand to hexadecimal representation, with two digits per byte binascii.b2a_hex()binascii.a2b_hex()
quopri_codec quopri, quoted-printable, quotedprintable Convert operand to MIME quoted printable quopri.encode() with quotetabs=True,quopri.decode()
string_escape Produce a string that is suitable as string literal in Python source code
uu_codec uu Convert the operand using uuencode uu.encode()uu.decode()
zlib_codec zip, zlib Compress the operand using gzip zlib.compress()zlib.decompress()
[1] str objects are also accepted as input in place of unicode objects. They are implicitly converted to unicode by decoding them using the default encoding. If this conversion fails, it may lead to encoding operations raising UnicodeDecodeError.
[2] (12) unicode objects are also accepted as input in place of str objects. They are implicitly converted to str by encoding them using the default encoding. If this conversion fails, it may lead to decoding operations raising UnicodeEncodeError.