This commit is contained in:
Miloslav Ciz 2024-03-08 16:56:58 +01:00
parent f920115b91
commit 75f99e7b81
69 changed files with 2008 additions and 1791 deletions

View file

@ -110,7 +110,7 @@ uint32_t strHash(const char *s)
TODO: more
BONUS: Here is a kind of string *pseudohash* for identifiers made only of character `a-z`, `A-Z`, `0-9` and `_`, not starting with digit -- it may be useful for symbol tables in compilers. It is parametrized by length *n*, which must be greater than 4. It takes an arbitrary length identifier in this format and outputs another string, also in this format (i.e. also being this kind of identifier), of maximum length *n - 1* (last place being reserved for terminating zero), which remains somewhat human readable (and is the same as input if under limit length), which may be good e.g. for debugging and transpiling (in transpilation you can just directly use these pseudohashes from the table as identifiers). In principle it works something like this: the input characters are cyclically written over and over to a buffer, and when the limit length is exceeded, a three character hash (made of checksum, "checkproduct" and string length) is written on positions 1, 2 and 3 (keeping the first character at position 0 the same). This means e.g. that the last characters will always be recorded, so if input identifiers differ in last characters (like `myvar1` and `myvar2`), they will always give different pseudohash. Also if they differ in first character, length (modulo something like 64), checksum or "checkproduct", their pseudohash is guaranteed to differ. Basically it should be hard to find a collision. Here is the code: { I found no collisions in my dataset of over 5000 identifiers, for *n = 16*. ~drummyfish }
BONUS: Here is a kind of string *pseudohash* for identifiers made only of character `a-z`, `A-Z`, `0-9` and `_`, not starting with digit -- it may be useful for symbol tables in compilers. It is parameterized by length *n*, which must be greater than 4. It takes an arbitrary length identifier in this format and outputs another string, also in this format (i.e. also being this kind of identifier), of maximum length *n - 1* (last place being reserved for terminating zero), which remains somewhat human readable (and is the same as input if under limit length), which may be good e.g. for debugging and transpiling (in transpilation you can just directly use these pseudohashes from the table as identifiers). In principle it works something like this: the input characters are cyclically written over and over to a buffer, and when the limit length is exceeded, a three character hash (made of checksum, "checkproduct" and string length) is written on positions 1, 2 and 3 (keeping the first character at position 0 the same). This means e.g. that the last characters will always be recorded, so if input identifiers differ in last characters (like `myvar1` and `myvar2`), they will always give different pseudohash. Also if they differ in first character, length (modulo something like 64), checksum or "checkproduct", their pseudohash is guaranteed to differ. Basically it should be hard to find a collision. Here is the code: { I found no collisions in my dataset of over 5000 identifiers, for *n = 16*. ~drummyfish }
```
char numPseudohash(unsigned char c)
@ -173,4 +173,4 @@ Here are some example inputs and output strings:
"TPE_bodyEnvironmentResolveCollision" -> "TxMJCollisionve"
"_TPE_body2Index" -> "_TPE_body2Index"
"_SAF_preprocessPosSize" -> "_RpwPosSizecess"
```
```