less_retarded_wiki/byte.md
2023-08-23 20:31:35 +02:00

3.8 KiB

Byte

Byte (symbol: B) is a basic unit of information, nowadays practically always consisting of 8 bits (in which case it is also called an octet), which allow it to store 2^8 = 256 distinct values (for example a number in range 0 to 255). It is usually the smallest unit of memory a CPU is able to operate on, and memory addresses are assigned by one byte steps. We use bytes to measure the size of memory and derive higher memory units from it, such as a kilobyte (kB, 1000 bytes), kibibyte (KiB, 1024 bytes), megabyte (MB, 10^6 bytes) etc. In programming a one byte variable is nowadays seen as very small and used if we are really limited by memory constraints (e.g. embdedded) or to mimic older 8bit computers ("retro games" etc.): one byte can be used to store very small numbers (while in mainstream processors numbers nowadays mostly have 4 or 8 bytes), text characters (ASCII, ...), very primitive colors (see RGB332, palettes, ...) etc.

Historically byte was used to stand for the basic addressable unit of memory that could store one text character or another "basic value" and could therefore have a different size than 8 bits: e.g. ASCII machines might have had a 7bit byte, 16bit machines a 16bit byte etc.; in C (standard 99) char is the "byte" data type, its byte size is always 1 (sizeof(char) == 1), though its number of bits (CHAR_BIT) can be greater or equal to 8; if you need an exact 8bit byte use types such as int8_t and uint8_t from the standard stdint library. From now on we will implicitly talk about 8bit bytes.

Value of one byte can be written exactly with two hexadecimal digits with each digit always corresponding to higher/lower 4 bits, making mental conversions very easy; this is very convenient compared to decimal representation, so programmers prefer to write byte values in hexadecimal. For example a byte whose binary value is 11010010 is D2 in hexadecimal (1101 is always D and 0010 is always 2), while in decimal we get 210.

Byte frequency/probability: it may be interesting and/or useful (e.g. for compression) to know how often different byte values appear in the data we process with computers -- indeed, this always DEPENDS; if we are working with plain ASCII text, we will never encounter values above 127, and on the other hand if we are processing photos from a polar expedition, we will likely mostly encounter byte values of 255 (as snow will cause most pixels to be completely white). In general we may expect values such as 0, 255, 1 and 2 to be most frequent, as many times these are e.g. assigned special meanings in data encodings, they may be cutoff values etc. Here is a table of measured byte frequencies in real data:

{ Measured by me. ~drummyfish }

type of data least c. 2nd least c. 3rd least c. 3rd most c. 2nd most c. most c.
GNU/Linux x86 executable 0x9e (0%) 0xb2 (0%) 0x9a (0%) 0x48 (2%) 0xff (3%) 0x00 (32%)
bare metal ARM executable 0xcf (0%) 0xb7 (0%) 0xa7 (0%) 0xff (2%) 0x01 (3%) 0x00 (15%)
UTF8 English txt book 0x00 (0%) 0x01 (0%) 0x02 (0%) 0x74 (t, 6%) 0x65 (e, 8%) 0x20 ( , 14%)
C source code 0x00 (0%) 0x01 (0%) 0x02 (0%) 0x31 (1, 6%) 0x20 ( , 12%) 0x2c (,, 16%)
raw 24bit RGB photo image 0x07 (0%) 0x09 (0%) 0x08 (0%) 0xdd (0%) 0x00 (1%) 0xff (25%)