Update

2023-12-23 19:56:56 +01:00 · 2023-12-23 19:56:56 +01:00 · 8312cd92c6
commit 8312cd92c6
parent b6b5090c4c
13 changed files with 139 additions and 113 deletions
--- a/hash.md
+++ b/hash.md
@ -88,7 +88,7 @@ uint16_t hash(uint16_t n)
 }
 ```

-Here is a nice string hash, works even for short strings, all bits look pretty random: { Made by me. ~drummyfish }
+Here is a nice string hash, works even for short strings, all bits look pretty random: { Made by me. Tested this on my dataset of programming identifiers, on average there was one colliding pair of strings in 1000. ~drummyfish }

 ```
 uint32_t strHash(const char *s)
@ -108,4 +108,69 @@ uint32_t strHash(const char *s)
 }
 ```

-TODO: more
+TODO: more
+
+BONUS: Here is a kind of string *pseudohash* for identifiers made only of character `a-z`, `A-Z`, `0-9` and `_`, not starting with digit -- it may be useful for symbol tables in compilers. It is parametrized by length *n*, which must be greater than 4. It takes an arbitrary length identifier in this format and outputs another string, also in this format (i.e. also being this kind of identifier), of maximum length *n - 1* (last place being reserved for terminating zero), which remains somewhat human readable (and is the same as input if under limit length), which may be good e.g. for debugging and transpiling (in transpilation you can just directly use these pseudohashes from the table as identifiers). In principle it works something like this: the input characters are cyclically written over and over to a buffer, and when the limit length is exceeded, a three character hash (made of checksum, "checkproduct" and string length) is written on positions 1, 2 and 3 (keeping the first character at position 0 the same). This means e.g. that the last characters will always be recorded, so if input identifiers differ in last characters (like `myvar1` and `myvar2`), they will always give different pseudohash. Also if they differ in first character, length (modulo something like 64), checksum or "checkproduct", their pseudohash is guaranteed to differ. Basically it should be hard to find a collision. Here is the code: { I found no collisions in my dataset of over 5000 identifiers, for *n = 16*. ~drummyfish }
+
+```
+char numPseudohash(unsigned char c)
+{
+  c %= 64;
+
+  if (c < 26)
+    return 'a' + c;
+  else if (c < 52)
+    return 'A' + (c - 26);
+  else if (c < 62)
+    return '0' + (c - 52);
+ 
+  return '_';
+}
+
+void pseudohash(char *s, int n)
+{
+  unsigned char
+    v1 = 0,     // checksum
+    v2 = 0,     // "checkproduct"
+    v3 = 0,     // character count
+    pos = 0;
+
+  const char *s2 = s;
+
+  while (*s2)
+  {
+    if (pos >= n - 1)
+      pos = 4;
+
+    v1 += *s2;
+    v2 = (v2 + 1) * (*s2);
+    v3++;
+
+    s[pos] = *s2;
+
+    pos++;
+    s2++;
+  }
+
+  if (v3 != pos)
+  {
+    s[1] = numPseudohash(v1);
+    s[2] = numPseudohash(v2);
+    s[3] = numPseudohash(v3);
+  }
+
+  s[n - 1] = 0;
+}
+```
+
+Here are some example inputs and output strings:
+
+```
+"CMN_DES"                             -> "CMN_DES"
+"CMN_currentInstrTypeEnv"             -> "CBcxrTypeEnvnst"
+"LONG_prefix_my_variable1"            -> "L4kyvariable1y_"
+"TPE_DISTANCE"                        -> "TPE_DISTANCE"
+"TPE_bodyEnvironmentResolveCollision" -> "TxMJCollisionve"
+"_TPE_body2Index"                     -> "_TPE_body2Index"
+"_SAF_preprocessPosSize"              -> "_RpwPosSizecess"
+```