You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

20 lines
3.8 KiB
Markdown

# Byte
Byte (symbol: B) is a basic unit of [information](information.md), nowadays already practically always consisting of 8 [bits](bit.md) (for which it's also called an **octet**), that allow it to store 2^8 = 256 distinct values (for example a number in range 0 to 255). It is commonly the smallest unit of computer memory a [CPU](cpu.md) is able to operate on; memory addresses are assigned by steps of one byte. We use bytes to measure the size of [memory](memory.md) and derive higher memory [units](memory_units.md) such as a kilobyte (kB, 1000 bytes), kibibyte (KiB, 1024 bytes), megabyte (MB, 10^6 bytes) and so forth. In conventional [programming](programming.md) a one byte [variable](variable.md) is seen as very small and used if we are really limited by memory constraints (e.g. [embedded](embedded.md)) or to mimic older 8bit computers ("[retro](retro.md) games" etc.): one byte can be used to store very small numbers (while in mainstream processors numbers nowadays mostly have 4 or 8 bytes), text characters ([ASCII](ascii.md), ...), very primitive [colors](color.md) (see [RGB332](rgb332.md), [palettes](palette.md), ...) etc.
Historically *byte* was used to stand for the basic addressable unit of memory capable of storing one text character or another "basic value" and could therefore have a different size than 8 bits: for example ASCII machines might have had a 7bit byte, 16bit machines a 16bit byte etc.; in [C](c.md) (standard 99) `char` is the "byte" data type, its byte size is always 1 (`sizeof(char) == 1`), though its number of bits (`CHAR_BIT`) can be greater or equal to 8; if you need an exact 8bit byte use types such as `int8_t` and `uint8_t` from the standard `stdint` library. From now on we will implicitly talk about 8bit bytes.
**Value of one byte can be written exactly with two [hexadecimal](hexadecimal.md) digits** with each digit always corresponding to higher/lower 4 bits, making mental conversions very easy; this is very convenient compared to [decimal](decimal.md) representation, so programmers prefer to write byte values in hexadecimal. For example a byte whose binary value is *11010010* is *D2* in hexadecimal (*1101* is always *D* and *0010* is always *2*), while in decimal we get 210.
**Byte frequency/probability**: it may be [interesting](interesting.md) and/or useful (e.g. for [compression](compression.md)) to know how often different byte values appear in the data we process with computers -- indeed, this always DEPENDS; if we are working with plain [ASCII](ascii.md) text, we will never encounter values above 127, and on the other hand if we are processing photos from a polar expedition, we will likely mostly encounter byte values of 255 (as snow will cause most pixels to be completely white). In general we may expect values such as [0](zero.md), 255, [1](one.md) and [2](two.md) to be most frequent, as many times these are e.g. assigned special meanings in data encodings, they may be cutoff values etc. Here is a table of measured byte frequencies in real data:
{ Measured by me. ~drummyfish }
| type of data | least c. | 2nd least c. | 3rd least c. | 3rd most c. | 2nd most c. | most c. |
| -------------------------- | --------- | ------------ | ------------ | ------------ | ------------- | ------------- |
| GNU/Linux x86 executable | 0x9e (0%) | 0xb2 (0%) | 0x9a (0%) | 0x48 (2%) | 0xff (3%) | 0x00 (32%) |
| bare metal ARM executable | 0xcf (0%) | 0xb7 (0%) | 0xa7 (0%) | 0xff (2%) | 0x01 (3%) | 0x00 (15%) |
| UTF8 English txt book | 0x00 (0%) | 0x01 (0%) | 0x02 (0%) |0x74 (`t`, 6%)|0x65 (`e`, 8%) |0x20 (` `, 14%)|
| C source code | 0x00 (0%) | 0x01 (0%) | 0x02 (0%) |0x31 (`1`, 6%)|0x20 (` `, 12%)|0x2c (`,`, 16%)|
| raw 24bit RGB photo image | 0x07 (0%) | 0x09 (0%) | 0x08 (0%) | 0xdd (0%) | 0x00 (1%) | 0xff (25%) |