different php encodings

  1. Publisher
  2. x64 (aka andi)

novice script writers do not care about such a thing as encoding

novice script writers do not care about such a thing as encoding. Therefore, on sites you can sometimes find a terrible mess, when the data from the database is obtained in one encoding, the page is formed in another, and the server is given the third. as a result, if the page can be decrypted, then at least 2 times. So, why does such a problem happen and how to overcome it?

in the Russian segment most often you can find the so-called windows-encoding. call it differently: windows-1251, cp1251 or even ansi. the next is utf-8. You can also find the name unicode, but this is not entirely correct, since Unicode is the general name for the whole group (utf-8, utf-16, utf-32). and a very popular rarity is koi8-r or simply koi-8 - the once popular Linux coding. Of course, it is possible to meet something else in the Russian segment, but this is rather an “indulgence” by the author.

The main difference between utf-8 and others (primarily windows-1251 and koi8-r) is the last one-byte, and the maximum number of characters that can be represented using these encodings is limited to 256. It goes without saying that for a complete presentation of the text of this may not be enough. and for html a solution was found - the use of so-called mnemonics. for example:

© - & copy;

In addition to the fact that each such character is described by a group of characters, the code becomes unreadable and the work with the text becomes more complicated. this is where the multibyte utf-8 comes to the rescue. it is very convenient to use letters of different alphabets and different symbols in one text.

Thus, the most comfortable set of initial conditions is as follows: the coding of the database, php scripts and html pages / js scripts should be the same. Of course, you can use different ones, but in this case there is a risk of getting confused. it does not matter which code page is used. if the site is only for a Russian-speaking audience, windows-1251 will be quite enough. otherwise, utf-8 would be the logical choice. the first option is more or less clear. multibyte encoding will require some gestures.

When working with utf-8, a standard notepad notepad will not work ! The fact is that this editor, when saving a file in this encoding, adds a signature to the beginning - 3 characters, the so-called bom (byte order mark), which can be used to determine the encoding when opening a file. it is better to choose another editor: notepad2 or notepad ++ . in the settings you must choose to save without a signature.

The next important step is working with the database. It is highly desirable that the encoding of the base / table / text field matches the script encoding (it could be cp1251 or utf-8, or something else). if the data from the database is obtained in the form of "zyuk", most likely the encoding connection is different from the data stored in the database. The following query will help to overcome the situation (execute immediately after connecting to the database):

if the site uses windows-1251, you should specify it - cp1251.

in general, there is nothing difficult. only, the standard php functions are not designed to work with multibyte strings. but there are standard libraries that will help correct the situation: iconv and mbstring . for regular expressions, there is also a necessary switch that is activated with the modifier u .

Well, the data from the database is obtained, the scripts are written according to all the rules. It remains to send the correct title and display the page code in the user's browser. we send heading so:

header ('Content-Type: text / html; charset = utf-8');

if single-byte encoding is used, the value for the charset will be different - windows-1251 . After that, problems should not remain.

Some simplest examples of working with utf-8 in php:

example 1: iconv, number of characters per line

$ s = 'string'; # string in utf-8 $ cnt1 = strlen ($ s); # will contain the value $ 12 cnt2 = iconv_strlen ($ s, 'UTF-8'); # correct value, 6

example 2: mbstring, the number of characters in a string

$ s = 'string'; # string in utf-8 $ cnt1 = strlen ($ s); # will contain the value $ 12 cnt2 = mb_strlen ($ s, 'UTF-8'); # correct value, 6

example 3: regular expressions, search and replace

$ s = 'String'; # line in utf-8 $ s = preg_replace ('/ p / i', 'd', $ s); # replacement will not happen $ s = preg_replace ('/ p / iu', 'd', $ s); # result word dock

the i modifier prescribes case-insensitive search, and the u modifier tells the regular expression engine to work with utf-8 strings.

if someone says that php cannot work with utf-8, it will be wrong. For several years now I have been doing all my projects in this encoding and there were no problems at all. Search engines themselves have long used this wonderful encoding.


offline 11 hours

x64 (aka andi)

Comments: 2846 Publications: 395 Registration: 02-04-2009

So, why does such a problem happen and how to overcome it?