less_retarded_wiki/3d_rendering.md
2024-08-31 14:44:45 +02:00

54 KiB

3D Rendering

See also 3D modeling.

In computer graphics 3D rendering is the process of computing images which represent a projected view of 3D objects through a virtual camera.

There are many methods and algorithms for doing so differing in many aspects such as computation complexity, implementation complexity, realism of the result, representation of the 3D data, limitations of viewing and so on. If you are just interested in the realtime 3D rendering used in gaymes nowadays, you are probably interested in GPU-accelerated 3D rasterization with APIs such as OpenGL and Vulkan.

LRS has a simple 3D rendering library called small3dlib.

Methods

As most existing 3D "frameworks" are harmful, a LRS programmer is likely to write his own 3D rendering system that suits his program best, therefore we should list some common methods of achieving 3D. Besides that, it's just pretty interesting to see what there is in the store.

A very important realization of a graphics programmer is that 3D rendering is to a great extent about faking (especially the mainstream realtime 3D) -- it is an endeavor that seeks to produce something that looks somehow familiar to HUMAN sight specifically and so even though the methods are mathematical, the endeavor is really an art in the end, not dissimilar to that of a magician who searches for "smoke and mirrors" hacks to produce illusions for the audience. Reality is infinitely complex, we use nothing else but approximations and neglecting that rely on assumptions about human sight such as "60 FPS looks like smooth movement to human eye", "infrared spectrum is invisible", "humans can't tell a mirror reflection is a bit off", "inner corners are usually darker than flat surfaces", "no shadow is completely black because light scatters in the atmosphere" etc. Really 3D graphics is nothing but searching for what looks good enough, and deciding this relies on a SUBJECTIVE judgement of a human (and sometimes every individual). In theory -- if we had infinitely powerful computers -- we would just program in a few lines of electromagnetic equations and run the precise simulation of light propagating in 3D environment to produce an absolutely realistic result, but though some methods try to come close to said approach, we simply won't ever have infinitely powerful computers. For this we have to resort to a bit more ugly approach of identifying specific notable real-life phenomena individually (for example caustics, Fresnel, mirror reflections, refractions, subsurface scattering, metallicity, noise, motion blur and myriads of others) and addressing each one individually with special treatment, many times correcting and masking our imperfections (e.g. applying antialiasing because we dared to use a simplified model of light sampling, applying texture filtering because we dared to only use finite amount of memory for our data, applying postprocessing etc.).

Rendering spectrum: The book Real-Time Rendering mentions that methods for 3D rendering can be seen as lying on a spectrum, one extreme of which is appearance reproduction and the other physics simulation. Methods closer to trying to imitate the appearance try to simply focus on imitating the look of an object on the monitor that the actual 3D object would have in real life, without being concerned with how that look arises in real life (i.e. closer to the "faking" approach mentioned above) -- these may e.g. use image data such as photographs; these methods may rely on lightfields, photo textures etc. The physics simulation methods try to replicate the behavior of light in real life -- their main goal is to solve the rendering equation, still only more or less approximately -- and so, through internally imitating the same processes, come to similar visual results that arise in real world: these methods rely on creating 3D geometry (e.g. that made of triangles or voxels), computing light reflections and global illumination. This is often easier to program but more computationally demanding. Most methods lie somewhere in between these two extremes: for example billboards and particle systems may use a texture to represent an object while at the same time using 3D quads (very simple 3D models) to correctly deform the textures by perspective and solve their visibility. The classic polygonal 3D models are also usually somewhere in between: the 3D geometry and shading are trying to simulate the physics, but e.g. a photo texture mapped on such 3D model is the opposite appearance-based approach (PBR further tries to shift the use of textures more towards the physics simulation end).

With this said, let's now take a look at possible classifications of 3D rendering methods. As seen, there are many ways:

  • by order:
    • object order: The method iterates on objects and draws object by object, one after another. This results in pixels being drawn to "random" places on the screen and possibly already drawn pixels being overdrawn with new pixels (though this can be further reduced). Typically requires a frame buffer and double buffering, often also z-buffer (or sorting), i.e. requires a lot of memory. This method is also a bit ugly but typically also faster than the alternative, so it is prevailing nowadays.
    • image order: The method iterates on screen pixels, typically going pixel by pixel from left to right, top to bottom, deciding the color of each pixel independently. May be easier to program and require less memory (no frame buffer is needed, see e.g. frameless rendering), however though parallelism is applicable here (many pixels may potentially be independently computed in parallel, speeding up rendering), the algorithms used (e.g. path tracing) often have to expensively simulate light behavior and so performance is still an issue.
  • by speed:
    • realtime: Able to render at interactive FPS, typically used in games etc.
    • offline: Spends a lot of time (even many minutes) on rendering each frame with the goal to produce output of extreme quality, typically used to render 3D movies etc.
  • by relative limitation:
    • primitive/"pseudo3D"/2.5D/...: Older methods that produce 3D views but had great limitations e.g. in camera degrees of freedom or possible environment geometry that was usually limited to a "2D sector map" (see e.g. Doom).
    • full/"true" 3D: The "new" way of 3D rendering that allows freely rotating camera, arbitrary 3D geometry etc. Though this still has limitations (as any computer approximation of reality), many people just call this the "true" 3D.
  • by approach (sides of above mentioned rendering spectrum):
    • appearance based: Focuses on achieving desired appearance by any means necessary, faking, "cheating", not trying to stay physically correct. This is typically faster.
    • physics simulation (see also physically based rendering): Focuses on simulating the underlying physics of reality with high correctness so that we also get a very realistic result.
  • by main method/algorithm (see also the table below):
    • rasterization: Appearance based object order methods further based on a relatively simple algorithm capable of drawing (rasterizing) a simple geometric shape (such as a triangle) which we then use to draw the whole complex 3D scene (composed of great many of triangles).
    • ray casting/tracing: Physics simulation image order methods further based on tracing paths of light in a manner that's closer to reality.
    • ...
  • by 3D data (vector vs raster classification applies here just as in 2D graphics):
  • by hardware:
    • software rendering: Rendering only with CPU. This is typically slower as a CPU mostly performs sequential computation, eliminating the possible parallelism optimization, however the approach is more KISS and portable.
    • GPU accelerated: Making use of specialized graphics rendering hardware (GPU) that typically uses heavy parallelism to immensely speed up rendering. While this is the mainstream, extremely fast way of rendering, it is also greatly bloated while often being an overkill that greatly complicates programming and makes programs less portable, less future proof etc.
  • by realism of output:
  • hybrids: Methods may be combined and/or lie in between different extremes, for example we may see a rasterizer 3D renderer that uses ray tracing to add detail (shadows, reflections, ...) to the scene, we may see renderers that allow triangle meshes as well as voxels etc. { One nice hybrid looking engine is e.g. Chasm: The Rift. ~drummyfish }
  • ...

Finally a table of some common 3D rendering methods follows, including the most simple, most advanced and some unconventional ones. Note that here we talk about methods and techniques rather than algorithms, i.e. general approaches that are often modified and combined into a specific rendering algorithm. For example the traditional triangle rasterization is sometimes combined with raytracing to add e.g. realistic reflections. The methods may also be further enriched with features such as texturing, antialiasing and so on. The table below should help you choose the base 3D rendering method for your specific program.

The methods may be tagged with the following:

  • 2.5D: primitive 3D, often called pseudo 3D or fake 3D, having significant limitations e.g. in degrees of freedom of the camera
  • off: slow method usually used for offline (non-realtime) rendering (even though they indeed may run in real time e.g. with the help of powerful GPUs)
  • IO vs OO: image order (rendering by pixels) vs object order (rendering by objects)
method notes
3D raycasting IO off, shoots rays from camera
2D raycasting IO 2.5D, e.g. Wolf3D
AI image synthesis "just let AI magic do it"
beamtracing IO off
billboarding OO
BSP rendering 2.5D, e.g. Doom
conetracing IO off
"dungeon crawler" OO 2.5D, e.g. Eye of the Beholder
edge list, scanline, span rasterization IO, e.g. Quake 1
ellipsoid rasterization OO, e.g. Ecstatica
flat-shaded 1 point perspective OO 2.5D, e.g. Skyroads
reverse raytracing (photon tracing) OO off, inefficient
image based rendering generally using images as 3D data
light fields image-based, similar to holography
mode 7 IO 2.5D, e.g. F-Zero
parallax scrolling 2.5D, very primitive
pathtracing IO off, Monte Carlo, high realism
portal rendering 2.5D, e.g. Duke3D
prerendered view angles 2.5D, e.g. Iridion II (GBA)
raymarching IO off, e.g. with SDFs
raytracing IO off, recursive 3D raycasting
segmented road OO 2.5D, e.g. Outrun
shear warp rednering IO, volumetric
splatting OO, rendering with 2D blobs
texture slicing OO, volumetric, layering textures
triangle rasterization OO, traditional in GPUs
voxel space rendering OO 2.5D, e.g. Comanche
wireframe rendering OO, just lines

TODO: Rescue On Fractalus!

TODO: find out how build engine/slab6 voxel rendering worked and possibly add it here (from http://advsys.net/ken/voxlap.htm seems to be based on raycasting)

TODO: VoxelQuest has some innovative voxel rendering, check it out (https://www.voxelquest.com/news/how-does-voxel-quest-work-now-august-2015-update)

3D Rendering Basics For Nubs

If you're a complete noob and are asking what the essence of 3D is or just how to render simple 3Dish pictures for your game without needing a PhD, here's the very basics. Yes, you can use some 3D engine such as Godot that has all the 3D rendering preprogrammed, but you'll surrender to bloat, you won't really know what's going on and your ability to tinker with the rendering or optimizing it will be basically zero... AND you'll miss on all the fun :) So here we just foreshadow some concepts you should start with if you want to program your own 3D rendering.

The absolute basic thing in 3D is probably perspective, or the concept which says that "things further away look smaller". This is basically the number one thing you need to know and with which you can make simple 3D pictures, even though there are many more effects and concepts that "make pictures look 3D" and which you can potentially study later (lighting, shadows, focus and blur, stereoscopy, parallax, visibility/obstruction etc.). { It's probably possible to make something akin "3D" even without perspective, just with orthographic projection, but that's just getting to details now. Let's just suppose we need perspective. ~drummyfish }

If you don't have rotating camera and other fancy things, perspective is actually mathematically very simple, you basically just divide the object's size by its distance from the viewer, i.e. its Z coordinate (you may divide by some multiple of Z coordinate, e.g. by 2 * Z to get different field of view) -- the further away it is, the bigger number its size gets divided by so the smaller it becomes. This "dividing by distance" ultimately applies to all distances, so in the end even the details on the object get scaled according to their individual distance, but as a first approximation you may just consider scaling objects as a whole. Just keep in mind you should only draw objects whose Z coordinate is above some threshold (usually called a near plane) so that you don't divide by 0! With this "dividing by distance" trick you can make an extremely simple "3Dish" renderer that just draws sprites on the screen and scales them according to the perspective rules (e.g. some space simulator where the sprites are balls representing planets). There is one more thing you'll need to handle: visibility, i.e. nearer objects have to cover the further away objects -- you can do this by simply sorting the objects by distance and drawing them back-to-front (painter's algorithm).

Here is some "simple" C code that demonstrates perspective and draws a basic animated wireframe cuboid as ASCII in terminal:

#include <stdio.h>

#define SCREEN_W 50       // ASCII screen width
#define SCREEN_H 22       // ASCII screen height
#define LINE_POINTS 64    // how many points for drawing a line
#define FOV 8             // affects "field of view"
#define FRAMES 30         // how many animation frames to draw

char screen[SCREEN_W * SCREEN_H];

void showScreen(void)
{
  for (int y = 0; y < SCREEN_H; ++y)
  {
    for (int x = 0; x < SCREEN_W; ++x)
      putchar(screen[y * SCREEN_W + x]);

    putchar('\n');
  }
}

void clearScreen(void)
{
  for (int i = 0; i < SCREEN_W * SCREEN_H; ++i)
    screen[i] = ' ';
}

// Draws point to 2D ASCII screen, [0,0] means center.
int drawPoint2D(int x, int y, char c)
{
  x = SCREEN_W / 2 + x;
  y = SCREEN_H / 2 + y;

  if (x >= 0 && x < SCREEN_W && y >= 0 && y <= SCREEN_H)
    screen[y * SCREEN_W + x] = c;
}

// Divides coord. by distance taking "FOV" into account => perspective.
int perspective(int coord, int distance)
{
  return (FOV * coord) / distance;
}

void drawPoint3D(int x, int y, int z, char c)
{
  if (z <= 0)
    return; // at or beyond camera, don't draw

  drawPoint2D(perspective(x,z),perspective(y,z),c);
}

int interpolate(int a, int b, int n)
{
  return a + ((b - a) * n) / LINE_POINTS;
}

void drawLine3D(int x1, int y1, int z1, int x2, int y2, int z2, char c)
{
  for (int i = 0; i < LINE_POINTS; ++i) // draw a few points to form a line
    drawPoint3D(interpolate(x1,x2,i),interpolate(y1,y2,i),interpolate(z1,z2,i),c);
}

int main(void)
{
  int shiftX, shiftY, shiftZ;

  #define N 12  // side length
  #define C '*'

  // cuboid points:
  //      X                   Y            Z
  #define PA -2 * N + shiftX, N + shiftY,  N + shiftZ
  #define PB 2 * N + shiftX,  N + shiftY,  N + shiftZ
  #define PC 2 * N + shiftX,  N + shiftY,  2 * N + shiftZ
  #define PD -2 * N + shiftX, N + shiftY,  2 * N + shiftZ
  #define PE -2 * N + shiftX, -N + shiftY, N + shiftZ
  #define PF 2 * N + shiftX,  -N + shiftY, N + shiftZ
  #define PG 2 * N + shiftX,  -N + shiftY, 2 * N + shiftZ
  #define PH -2 * N + shiftX, -N + shiftY, 2 * N + shiftZ

  for (int i = 0; i < FRAMES; ++i) // render animation
  {
    clearScreen();

    shiftX = -N + (i * 4 * N) / FRAMES; // animate
    shiftY = -N / 3 + (i * N) / FRAMES;
    shiftZ = 0; 

    // bottom:
    drawLine3D(PA,PB,C); drawLine3D(PB,PC,C); drawLine3D(PC,PD,C); drawLine3D(PD,PA,C);

    // top:
    drawLine3D(PE,PF,C); drawLine3D(PF,PG,C); drawLine3D(PG,PH,C); drawLine3D(PH,PE,C);

    // sides:
    drawLine3D(PA,PE,C); drawLine3D(PB,PF,C); drawLine3D(PC,PG,C); drawLine3D(PD,PH,C);

    drawPoint3D(PA,'A'); drawPoint3D(PB,'B'); // corners
    drawPoint3D(PC,'C'); drawPoint3D(PD,'D');
    drawPoint3D(PE,'E'); drawPoint3D(PF,'F');
    drawPoint3D(PG,'G'); drawPoint3D(PH,'H');

    showScreen();

    puts("press key to animate");
    getchar();
  }

  return 0;
}

One frame of the animation should look like this:

                 E*******************************F
                 * *                       ***   *
                 *  **                 ***       *
                 *   H***************G*          *
                 *   *               *           *
                 *   *               *           *
                 *   *               *           *
                 *   *               *           *
                 *   *               *           *
                 *   *               *           *
                 *   D***************C           *
                 *  **                ***        *
                 *  *                    *       *
                 * *                       **    *
                 ***                         * * *
                 A*******************************B

press key to animate

PRO TIP: It will also help if you learn a bit about photography because 3D usually tries to simulate cameras and 3D programmers adopt many terms and concepts from photography. At least learn the very basics such as focal length, pinhole camera, the "exposure triangle" (shutter speed, aperture, ISO) etc. You should know how focal length is related to FOV, what the "f number" means, how to use exposure settings to increase or decrease things like motion blur and depth of field, what HDR means etc.

Mainstream Realtime 3D

You may have come here just to learn about the typical realtime 3D rendering used in today's games because aside from research and niche areas this kind of 3D is what we normally deal with in practice. This is what this section is about.

These days "game 3D" means a GPU accelerated 3D rasterization done with rendering APIs such as OpenGL, Vulkan, Direct3D or Metal (the last two being proprietary and therefore shit) and higher level engines above them, e.g. Godot, OpenSceneGraph etc. The methods seem to be evolving to some kind of rasterization/pathtracing hybrid, but rasterization is still the basis.

This mainstream rendering uses an object order approach (it blits 3D objects onto the screen rather than determining each pixel's color separately) and works on the principle of triangle rasterization, i.e. 3D models are composed of triangles (or higher polygons which are however eventually broken down into triangles) and these triangles are projected onto the screen according to the position of the virtual camera and laws of perspective. Projecting the triangles means finding the 2D screen coordinates of each of the triangle's three vertices -- once we have thee coordinates, we draw (rasterize) the triangle to the screen just as a "normal" 2D triangle (well, with some asterisks).

Additionally things such as z-buffering (for determining correct overlap of triangles) and double buffering are used, which makes this approach very memory (RAM/VRAM) expensive -- of course mainstream computers have more than enough memory but smaller computers (e.g. embedded) may suffer and be unable to handle this kind of rendering. Thankfully it is possible to adapt and imitate this kind of rendering even on "small" computers -- even those that don't have a GPU, i.e. with pure software rendering. For this we e.g. replace z-buffering with painter's algorithm (triangle sorting), drop features like perspective correction, MIP mapping etc. (of course quality of the output will go down).

Also additionally there's a lot of bloat added in such as complex screen space shaders, pathtracing (popularly known as raytracing), megatexturing, shadow rendering, postprocessing, compute shaders etc. This may make it difficult to get into "modern" 3D rendering. Remember to keep it simple.

On PCs the whole rendering process is hardware-accelerated with a GPU (graphics card). GPU is a special hardware capable of performing many operations in parallel (as opposed to a CPU which mostly computes sequentially with low level of parallelism) -- this is ideal for graphics because we can for example perform mapping and drawing of many triangles at once, greatly increasing the speed of rendering (FPS). However this hugely increases the complexity of the whole rendering system, we have to have a special API and drivers for communication with the GPU and we have to upload data (3D models, textures, ...) to the GPU before we want to render them. Debugging gets a lot more difficult. So again, this is bloat, consider avoiding GPUs.

GPUs nowadays are no longer just focusing on graphics, but are kind of a general device that can be used for more than just 3D rendering (e.g. crypto mining, training AI etc.) and can no longer even perform 3D rendering completely by themselves -- for this they have to be programmed. I.e. if we want to use a GPU for rendering, not only do we need a GPU but also some extra code. This code is provided by "systems" such as OpenGL or Vulkan which consist of an API (an interface we use from a programming language) and the underlying implementation in a form of a driver (e.g. Mesa3D). Any such rendering system has its own architecture and details of how it works, so we have to study it a bit if we want to use it.

The important part of a system such as OpenGL is its rendering pipeline. Pipeline is the "path" through which data go through the rendering process. Each rendering system and even potentially each of its version may have a slightly different pipeline (but generally all mainstream pipelines somehow achieve rasterizing triangles, the difference is in details of how they achieve it). The pipeline consists of stages that follow one after another (e.g. the mentioned mapping of vertices and drawing of triangles constitute separate stages). A very important fact is that some (not all) of these stages are programmable with so called shaders. A shader is a program written in a special language (e.g. GLSL for OpenGL) running on the GPU that processes the data in some stage of the pipeline (therefore we distinguish different types of shaders based on at which part of the pipeline they reside). In early GPUs stages were not programmable but they became so as to give a greater flexibility -- shaders allow us to implement all kinds of effects that would otherwise be impossible.

Let's see what a typical pipeline might look like, similarly to something we might see e.g. in OpenGL. We normally simulate such a pipeline also in software renderers. Note that the details such as the coordinate system handedness and presence, order, naming or programmability of different stages will differ in any particular pipeline, this is just one possible scenario:

  1. Vertex data (e.g. 3D model space coordinates of triangle vertices of a 3D model) are taken from a vertex buffer (a GPU memory to which the data have been uploaded).
  2. Stage: vertex shader: Each vertex is processed with a vertex shader, i.e. one vertex goes into the shader and one vertex (processed) goes out. Here the shader typically maps the vertex 3D coordinates to the screen 2D coordinates (or normalized device coordinates) by:
  1. Possible optional stages that follow are tessellation and geometry processing (tessellation shaders and geometry shader). These offer possibility of advanced vertex processing (e.g. generation of extra vertices which vertex shaders are unable to do).
  2. Stage: vertex post processing: Usually not programmable (no shaders here). Here the GPU does things such as clipping (handling vertices outside the screen space), primitive assembly and perspective divide (transforming from [homogeneous coordinates](homogeneous coordinates.md) to traditional cartesian coordinates).
  3. Stage: rasterization: Usually not programmable, the GPU here turns triangles into actual pixels (or fragments), possibly applying backface culling, perspective correction and things like stencil test and depth test (even though if fragment shaders are allowed to modify depth, this may be postpones to later).
  4. Stage: pixel/fragment processing: Each pixel (fragment) produced by rasterization is processed here by a pixel/fragment shader. The shader is passed the pixel/fragment along with its coordinates, depth and possibly other attributes, and outputs a processed pixel/fragment with a specific color. Typically here we perform shading and texturing (pixel/fragment shaders can access texture data which are again stored in texture buffers on the GPU).
  5. Now the pixels are written to the output buffer which will be shown on screen. This can potentially be preceded by other operations such as depth tests, as mentioned above.

Complete Fucking Detailed Example Of Rendering A 3D Model By Hand

WORK IN PROGRESS

{ This turned out to be long as hell, sowwy. ~drummyfish }

This is an example of how two very simple 3D models would be rendered using the traditional triangle rasterization pipeline. Note that this is VERY simplified, it's just to give you an idea of the whole process, BUT if you understand this you will basically get an understanding of it all.

Keep in mind this all can be done just with fixed point, floating point is NOT required.

First we need to say what conventions we'll stick to:

  • We'll be using ROW VECTORS, i.e. we'll be writing vectors like [x,y,z]. Some people rather use column vectors, which then means their matrices are also transposed and they do multiplication in opposite direction etcetc. Watch out about this, it's quite important to know which convention you're using, because e.g. matrix multiplication is non-commutative (i.e. with matrices A * B does NOT generally equal B * A) and the order you need to multiply in depends on this convention, so be careful.
  • Counterclockwise triangles are front facing, clockwise ones are back facing (invisible).
  • We'll be using LEFT HANDED coordinate systems, i.e X axis goes right, Y goes up, Z goes forward (right handed system would be the same except Z would go the opposite way -- backwards). Watch out: some systems (for example OpenGL) use the other one. I.e. our coordinate system looks like this:
  Y ^     _
    |    _/| Z
    |  _/
    |_/
    '-------> X

Now let's have a simple 3D model data of a quad. Quad is basically just a square made of four vertices and two triangles, it will look like this:

quadModel:

v3________v2
 |     _/ |
 |   _/   |
 | _/     |
 |/_______|
v0        v1

In a computer this is represented with two arrays: vertices and triangles. Our vertices here are (notices all Z coordinates are zero, i.e. it is a 3D model but it's flat):

quadVertices:

v0 = [-1, -1,  0]
v1 = [ 1, -1,  0]
v2 = [ 1,  1,  0]
v3 = [-1,  1,  0]

And our triangles are (they are indices to the vertex array, i.e. each triangle says which three vertices from the above array to connect to get the triangle):

quadTriangles:

t0 = [0,1,2]
t1 = [0,2,3]

Note the triangles here (from our point of view) go counterclockwise -- this is called winding and is usually important because of so called backface culling -- the order of vertices basically determines which is the front side of the triangle and which is the back side, and rendering systems often just draw the front sides for efficiency (back faces are understood to be on the inside of objects and invisible).

Now the vertex coordinates of the model above are in so called model space -- these are the coordinates that are stored in the 3D model's file, it's the model's "default" state of coordinates. The concept of different spaces is also important because 3D rendering is a lot about just transforming coordinates between different spaces ("frames of reference"). We'll see that the coordinates in model space will later on be transformed to world space, view space, clip space, screen space etc.

OK, so next let's have 2 actual 3D model instances that use the above defined 3D model data. Notice the difference between 3D model DATA and 3D model INSTANCE: instance is simply one concrete, specific model that has its own place in the global 3D world (world space) and will be rendered, while data is just numbers in memory representing some 3D geometry etc. There can be several instances of the same 3D data (just like in OOP there can be multiple instances/objects of a class); this is very efficient because there can be just one 3D data (like a model of a car) and then many instances of it (like many cars in a virtual city). Our two model instances will be called quad0 and quad1.

Each model instance has its own transformation. Transformation says where the model is placed, how it's rotated, how it's scaled and so on -- different 3D engines may offer different kind of transformations, some may support things like flips, skews, non-uniform scaling, but usually at least three basic transforms are supported: translation (AKA offset, shift, position), rotation and scale. The transforms can be combined, e.g. a model can be shifted, rotated and scaled at the same time. Here we'll just rotate quad0 by 45 degrees (pi/4 radians) around Y (vertical) axis and translate quad1 one unit to the right and 2 back:

quad0:

  • translation = [0,0,0]
  • rotation = [0,pi/4,0]

quad1:

  • translation = [1,0,2]
  • rotation = [0,0,0]

So now we have two model instances in our world. For rendering we'll also need a camera -- the virtual window to our world. Camera is also a 3D object: it doesn't have 3D model data associated but it does have a transformation so that we can, naturally, adjust the view we want to render. Camera also has additional properties like field of view (FOV), aspect ratio (we'll just consider 1:1 here), near and far distances and so on. Our camera will just be shifted two units back (so that it can see the first quad that stays at position [0,0,0]):

camera:

  • translation = [0,0,-2]
  • rotation = [0,0,0]

It is important to mention the near and far planes. Imagine a camera placed at some point in space, looking in certain direction: the volume of space that it sees creates a kind of infinitely long pyramid (whose "steepness" depends on the camera field of view) with its tip at the point where the camera resides. Now for the purpose of rendering we define two planes, perpendicular to the camera's viewing direction, that are defined by the distance from the camera: the near plane (not surprisingly the one that's the nearer of the two) and the far plane. For example let's say our camera will have the near plane 1 unit in front of it and the far plane 5 units in front of it. These planes will CUT OFF anything that's in front and beyond them, so that only things that are BETWEEN the two planes will be rendered (you can notice this in games with render distance -- if this is not cleverly masked, you'll see things in the distance suddenly cut off from the view). These two planes will therefore CUT OFF the viewing pyramid so that now it's a six sided, finite-volume shape that looks like a box with the front side smaller than the back side. This is called the view frustum. Nothing outside this frustum will be rendered -- things will basically be sliced by the sides of this frustum.

You may ask WHY do we establish this frustum? Can't we just leave the near and far planes out and render "everything"? Well, firstly it's obvious that having a far cutoff view distance can help performance if you have a very complex model, but this is not the main reason why we have near and far planes. We basically have them for mathematical convenience -- as we'll see for example, perspective mapping means roughly "dividing by distance from camera" and if something was to be exactly where the camera is, we'd be dividing by zero! Attempting to render things that are just very near or on the back side of the camera would also do very nasty stuff. So that's why we have the near plane. In theory we might kind of get away with not having a strict far plane but it basically creates a nice finite-volume that will e.g. allow us to nicely map depth values for the z-buffer. Don't worry if this doesn't make much sense, this is just to say there ARE good reasons for this stuff.

Now let's summarize what we have with this top-down view of our world (the coordinates here are now world space):

                Z
                :
- - - - - - - 3 + - - - - - - - - far plane
                :
                :
              2 +-----*------ quad2
                :
                :
              1 +
                :    quad1
              0 : _/
   .|.....|.....*/....|.....|. X
   -2    -1   _/:     1     2
        .    /  :       .
- - - - -'.=====:=====.'- - - - - near plane
           '.   :   .'
             '. : .'
            -2 '*' camera
                :
                :

NOW actually let's see how to in fact render this. The big picture overview is this:

  1. Get the model instances from model space to world space, i.e. transform their vertex coordinates according to each instance's transformation.
  2. Get the model instances from world space to view space (AKA camera space). View space is the coordinate system of the camera in which the camera sits in the origin (poinr [0,0,0]) and is looking straight forward along the positive Z axis direction.
  3. Get the model instances from view space to clip space. This applies perspective (deforms objects so that the further away ones become smaller as by distance) and transform everything in the view frustum to a nice cube of fixed size and with walls aligned with principal axes (unlike view frustum itself).
  4. Clip everything outside the clip space, i.e. throw away everything outside the box we've got. If some things (triangles) are partially in and partially out, we CLIP them, i.e. we literally cut them into several parts and throw away the parts that aren't in (some simpler renderers just do simpler stuff like throw away anything that sticks outside or just force push the vertices inside but it will look a bit awkward).
  5. Get everything from clip space into screen space, i.e. to actual pixel coordinates of the screen we are rendering to.
  6. Rasterize the triangles of the models between the points we have mapped to the screen now, i.e. literally fill all the triangle pixels so that we get the final image.

As seen this involves doing many transformations between different spaces. We do these using linear algebra, i.e. with vectors and matrices. Key things here are:

  • We can represent all the transformations that we need using matrices.
  • Every such transformation (translation, rotation, scale, ...) can be represented with one 4x4 matrix.
  • To apply a transformation to a model we simply multiply each of its vertices with the transformation matrix. So if we want to rotate a model by 30 degrees, we make a matrix that represents this rotation and just multiply all the model's vertices and it's done. Pretty elegant!
  • AMAZING stuff: any number of these transformations combined in ANY order can still be represented just by a single 4x4 matrix! You just take the matrix of each of the transformations, multiply them together and you get a matrix that just does all these transformation at once. Duuuude what? Yeah that's right, this is extremely awesome, isn't it? We can basically create a single matrix that combines in it all the above mentioned rendering steps and it will just do everything. This is not only elegant but also very efficient (instead of just moving, rotating and scaling points there and back many times we simply perform ONE matrix multiplication for each vertex and that's it).

You HAVE TO learn how to multiply vector with matrix and matrix with matrix (it's not hard) else you will understand nothing now.

BIG BRAIN MOMENT: homogeneous coordinates. Please DO NOT ragequit, it looks complicated as hell (it is a little bit) but it makes sense in the end, OK? We have to learn what homogeneous coordinates are because we need them to be able to do all the awesome matrix stuff described above. In essence: in 3D space we can perform linear transformations with 3x3 matrices -- linear operations are for example scaling and rotation, BUT some, most importantly translation (shifting and object, which we absolutely NEED), are not linear (but rather affine) so they cannot be performed by a 3x3 matrix. But it turns out that if we use special kind of coordinates, we CAN do affine 3D transformations with 4x4 matrices, OK? These special coordinates are homogeneous coordinates, and they simply add one extra coordinate, w, to the original x, y and z, while it holds that that multiplying all the x, y, z and w components by the same number does nothing with the point they represent. Let's show it like this:

If we have a 3D point [1,2,3], in homogeneous coordinates we can represent it as [1,2,3,1] or [2,4,6,2] or [3,6,9,3] and so on. That's easy no? So we will ONLY add an additional 1 at the end of our vertex coordinates and that's basically it.

Let's start doing this now!

Firstly let us transform quad0 from model space to world space. For this we construct so called model matrix based on the transformation that the model instance has. Our quad0 is just rotated by pi/4 radians and for this the matrix will look like this (you don't have to know why, you usually just look up the format of the matrix somewhere, but you can derive it, it's EZ):

quad0 model matrix:

       | cos(A) 0 sin(A) 0|    |0.7  0  0.7  0|
 Mm0 = | 0      1 0      0| ~= |0    1  0    0|
       |-sin(A) 0 cos(A) 0|    |-0.7 0  0.7  0|
       | 0      0 0      1|    |0    0  0    1|

Let's see if this works, we'll try to multiply the first model vertex with this matrix (notice we add 1 at the end of the vertex, to convert it to homogeneous coordinates):

                            |0.7   0  0.7 0|
                            |0     1  0   0|
                            |-0.7  0  0.7 0|
                            |0     0  0   1|

v0 * Mm0 = [-1, -1, 0, 1]   [-0.7 -1 -0.7 1]  <-- result

So from [-1,-1,0] we got [-0.7,-1,-0.7] -- looking at the top-down view picture above this seem pretty correct (look at the coordinates of the first vertex). Try it also for the rest of the vertices. Now for the model matrix of quad1 (again, just look up what translation matrix looks like):

quad1 model matrix:

       |1  0  0  0|
 Mm1 = |0  1  0  0|
       |0  0  1  0|
       |1  0  2  1|

Here you can even see that multiplying a vector by this will just add 1 to x and 2 to z, right? Again, try it.

NEXT, the view matrix (matrix that will transform everything so that it's "in front of the camera") will basically just do the opposite transformation of that which the camera has. Imagine if you shift a camera 1 unit to the right -- that's as if the camera stands still and everything shifts 1 unit to the left. Pretty logical. So our view matrix looks like this (notice it just pushes everything by 2 to the front):

view matrix:

      |1  0  0  0|
 Mv = |0  1  0  0|
      |0  0  1  0|
      |0  0  2  1|

Then we'll need to apply perspective and get everything to the clip space. This will be done with so called projection matrix which will in essence make the x and y distances be divided by the z distance so that further away things will shrink and appear smaller. You can derive the view matrix, its values depend on the field of view, near and far plane etc., here we'll just copy paste numbers into a "template" for the projection matrix, so here it is:

projection matrix (n = near plane distance, f = far plane distance, r = right projection plane distance = 1, t = top projection plane distance = 1):

      |n/r 0   0            0|   |1  0  0    0|
 Mp = |0   n/t 0            0| = |0  1  0    0|
      |0   0   (f+n)/(f-n)  1|   |0  0  3/2  1|
      |0   0   -2*f*n/(f-n) 0|   |0  0 -5/2  0|

This matrix will basically make the points so that their w coordinates will be such that when in the end we divide all components by it (to convert them back from homogeneous coordinates), we'll get the effect of the perspective (it's basically the "dividing by distance from the camera" that perspective does). That is what the homogeneous coordinates allow us to do. To visually demonstrate this, here is a small picture of how it reshapes the view frustum into the clip space box (it kind of "squishes" all in the back of the frustum pyramid and also squeezes everything to shape it into that little box of the clipping space, notice how the further away objects became closer together -- that's the perspective):

     ___________________ far plane           _____________
     \    A   B   C    /                    |     ABC     |
      \               /                     |             |
       \  D   E   F  /                      |   D  E  F   |
        \           /                       |             |
         \G   H   I/                        |G     H     I|
          \_______/ near plane              |_____________|
           :     :                          :             :
            :   :                           :   screen    :
             : :                            :             :
              * camera

At this point we have the matrices of the individual transforms, but as we've said, we can combine them into a single matrix. First let's combine the view matrix and projection matrix into a single view-projection matrix by multiplying the two matrices (WATCH OUT: the order of multiplication matters here! It defines in which order the transformations are applied):

view-projection matrix:

                 |1 0 0   0|
 Mvp = Mv * Mp = |0 1 0   0|
                 |0 0 3/2 1|
                 |0 0 1/2 2|

The rendering will begin with quad0, we'll combine its model matrix and the view-projection matrix into a single uber matrix that will just do the whole transformation for this model instance:

quad0 model-view-projection matrix:

                     |0.7  0  21/20  0.7|
 Mm0vp = Mm0 * Mvp = |0    1  0      0  |
                     |-0.7 0  21/20  0.7|
                     |0    0  1/2    2  |

Now we'll just transform all of the model's vertices by multiplying with this matrix, and then we'll convert back from the homogeneous coordinates to "normal" coordinates by dividing all components by w (AKA "perspective divide") like this:

v0: [-1, -1,  0, 1]  (matrix multiplication) =>  [-0.7, -1, -0.55, 1.3]  (w divide) =>  [-0.53, -0.76, -0.43] 
v1: [ 1, -1,  0, 1]  (matrix multiplication) =>  [ 0.7, -1,  1.55, 2.7]  (w divide) =>  [ 0.26, -0.37,  0.57]
v2: [ 1,  1,  0, 1]  (matrix multiplication) =>  [ 0.7,  1,  1.55, 2.7]  (w divide) =>  [ 0.26,  0.37,  0.57]
v3: [-1,  1,  0, 1]  (matrix multiplication) =>  [-0.7,  1, -0.55, 1.3]  (w divide) =>  [-0.53,  0.76, -0.43]

And let's also do this for quad1.

quad1 model-view-projection matrix:

                     |1 0  0    0|
 Mm1vp = Mm1 * Mvp = |0 1  0    0|
                     |0 0  3/2  1|
                     |0 0  7/2  4|

and

v0: [-1, -1,  0, 1]  (matrix multiplication) =>  [-1, -1, 3.5, 4]  (w divide) =>  [-0.25, -0.25, 0.87] 
v1: [ 1, -1,  0, 1]  (matrix multiplication) =>  [ 1, -1, 3.5, 4]  (w divide) =>  [ 0.25, -0.25, 0.87]
v2: [ 1,  1,  0, 1]  (matrix multiplication) =>  [ 1,  1, 3.5, 4]  (w divide) =>  [ 0.25,  0.25, 0.87]
v3: [-1,  1,  0, 1]  (matrix multiplication) =>  [-1,  1, 3.5, 4]  (w divide) =>  [-0.25,  0.25, 0.87]

Hmmm mkay let's draw the transformed points to an X/Y grid:

     Y
      |
[-1,1]|_______________:_______________[1,-1]
      |                               |
      |   v3 +--....                  |
      |      |      '''---+ v2        |
      |      |          .'|           |
      |      |v3 +-----:--|+ v2       |
      |      |   |   .' ..||          |
    --|      |   |  :..'  ||          |--
      |      |   | ''     ||          |
      |      |v0 +'-------|+ v1       |
      |      |  :         |           |
      |      |.'    ...---+ v1        |
      |   v0 +--''''                  |
      |_______________________________|___X
[-1,-1]               :            [1,-1]

HOLY SHIT IT'S 3D!!1! Magic! In the front we see quad0, rotated slightly around the vertical (Y) axis, behind it is quad1, non-rotated but smaller because it's further away. This looks very, very good! We're almost there.

Also notice that the points -- now nicely projected onto a 2D X/Y plane -- still have 3 coordinates, i.e. they retain the Z coordinate which now holds their depth, or distance from the camera projection plane kind of. This depth is now in range from -1 (near plane) to 1 (far plane). The depth will be important in actually drawing pixels, to decide which are more in the front and therefore visible (this is the problem of visibility). The depth value can also be used for cool effects like the distance fog and so on.

The work up until now -- i.e. transforming the vertices with matrices -- is what vertex shaders do. Now comes the rasterization part -- here we literally draw triangles (as in individual pixels) between the points we have now mapped on the screen. In systems such as OpenGL This is usually done automatically, you don't have to care about rasterization, however you will have to write the algorithm if you're writing e.g. your own software renderer. Triangle rasterization isn't trivial, it has to be not only efficient but also deal with things such as not leaving holes between adjacent triangles, interpolating triangle attributes and so on. We won't dive deeper into this, let's just suppose we have a rasterization algorithm now. For example rasterizing the first triangle of quad0 may look like this:

  _______________________________
 |                               |
 |      +--....                  |
 |      |      '''---#           |
 |      |          .##           |
 |      |   +-----####+          |
 |      |   |   .#####|          |
 |      |   |  :######|          |
 |      |   | ########|          |
 |      |   ##########+          |
 |      |  ###########           |
 |      | ############           |
 |      #######                  |
 |_______________________________|

During this rasterization process the Z coordinates of the mapped vertices are important -- the rasterizer interpolates the depth at the vertices, so it knows the depth of each pixel it creates. This depth is written into so called z-buffer (AKA depth buffer) -- basically an off-screen buffer that stores one numeric value (the depth) for each pixel. Now when let's say the first triangle of quad1 starts to be rasterized, the algorithm compares the rasterized pixel's depth to that stored on the same position in the z-buffer -- if the z-buffer value is small, the new pixel mustn't be drawn because it's covered by a previous drawn pixel already (here probably that of the triangle shown in the picture).

So the rasterization algorithm just shits out individual pixels and hands them over to the fragment shader (AKA pixel shader). Fragment is a program that just takes a pixel and says what color it should have (basically) -- this is called shading. For this the rasterizer also hands over additional info to the fragment shader which may include: the X/Y coordinates of the pixel, its interpolated depth (that used in z-buffer), vertex normals, ID of the model and triangle and, very importantly, the barycentric coordinates. These are three-component coordinates that say where exactly the pixel is in the triangle. These are used mainly for texturing, i.e. if the model we're rendering has a texture map (so called UV map) and a bitmap image (texture), the fragment shader will use the UV map and barycentric coords to compute the exact pixel of the texture that the rasterized pixel falls onto AND this will be the pixel's color. Well, not yet actually, there are more things such as lighting, i.e. determining what brightness the pixel should have depending on how the triangle is angled towards scene lights (for which we need the normals), how far away from them it is, what colors the lights have etcetc. And this is not nearly all, there are TONS and tons of other things, for example the interpolation done by rasterizer has to do perspective correction (linearly interpolating in screen space looks awkward), then there is texture filtering to prevent aliasing (see e.g. mipmapping, transparency, effects like bump mapping, environment mapping, screen space effects, stencil buffer etcetc. -- you can read whole books about this. That's beyond the scope of this humble tutorial -- in simple renderers you can get away with ignoring a lot of this stuff, you can just draw triangles filled with constant color, or even just draw lines to get a wireframe renderer, all is up to you. But you can see it is a bit bloated if everything is to be done correctly -- don't forget there also exist other ways of rendering, see for example raytracing which is kind of easier.

See Also