The Shee's lost knowledge: Reversing and Parsing C2 allchemicals.str file format

Aright, we already covered how to obtain an up to date list of Chemical numbers and names inside our programs by extracting it from C1 game files.

Obviously this is an ability of much practical value for C2 too, so let's see how we could do that now.
I haven't found the allchemicals.str file format described anywhere so far, but the format is simple enough so reversing it should be Norn's play for us at this point :)

Our most likely candidate for that information (if you ever tried to open a couple of creature game files in a text editor) will be the "allchemicals.str" file from the game directory.
Depending on your country, you might rather be interested in the "allchemicalsDE.str" or "allchemicalsFR.str" files, containing the same information in various languages.

So what do we get by opening the file in a hex editor ? (I'm using HxD by the way, it's full of useful features such as diffing between files, or the ability to browse a process memory footprint rather than files)

The first thing we notice is the various chemical names, separated by some kind of 1 byte space.
This should look familiar to you if you followed the preceding file format parsing articles.

0A in hex is 10, which perfectly matches length of the "Loneliness" string

A quick check confirms that the 1 byte value before each readable string matches it's length.
As we've seen before, this is what the CDN calls "Cstrings" entries, and they are pretty easy to read out.
There doesn't seem to be any kind of additional ID or index by entry whatsoever, which is consistent with our knowledge of Creatures file formats.We can safely assume that those entries are read out sequentially and numbered as we go.

So, what else is in there if we consider all those text entries understood ?

If we jump to the end of the file, we can see the file ends abruptly at the Antigen7.
There doesn't seem to be any chemical description included further down in this file as was the case with the C1 chemicals.txt file.That's one less thing for us to worry about.

The batch of 0's in the middle of the file looks like a collection of empty records.
We can confirm that by firing up the C2 genetics kit, and going to the biochemistry tab.
Sure enough it shows that a whole set of chemicals are unused between "Upatrophin" and "Histamine A",which kind of confirms our guess.

...

What's left then?
If we jump right at the beginning of the file, we can see there are 2 unknown bytes of data before the first entry:

"<Nothing>" is 9 bytes long, but what's "00 01" ?

It would have been a reasonable place to look for information such as a number of entries, file size or some kind of header.

There's not many ways one could interpret "00 01" though.Whichever format you'd choose to read that as (whether it is 2 bytes, one word, little or big endian order...) it wouldn't contain any relevant information.
Also it's always the same entry between our localised variants of the files (so it's definitely not a "language" flag).
We can safely assume it's either unused or some kind of header meant to recognize the file type.
(which is also what openc2e developpers seem to have recognized this as).

Is that all ?
It seems so :)
The file doesn't seem to contain anything more than a 2 byte header, and a succession of "Cstring" type entries.( I'd still rather call those Pascal strings...but whatever...).
We already know how to read those out in python from our preceding articles, so the whole parsing of the file should be pretty quick as most of this is code we've already used:

import struct

def readword( readfromfile ):
    return struct.unpack("H",readfromfile.read(2))[0]

def readCstring (readfromfile):
    byte=readfromfile.read(1)
    if byte=="":
        return None
    strlength = struct.unpack("B", byte)[0]
    if strlength == 0xff:
        strlength = struct.unpack("H", readfromfile.read(2))[0]
    return readfromfile.read(strlength)

chemfile=open("allchemicals.str","rb")
Chemicals=[]

Header=readword(chemfile)

entry=readCstring(chemfile)
while entry != None:
        Chemicals.append(entry)
        entry=readCstring(chemfile)

for num, chem in enumerate(Chemicals):
    if str(num)!=chem and chem !="": # skip empty chemicals that might either be empty strings, or the chemical number(that's how the genkit names unknown chemicals)
        print str(num) +": "+chem

You can get the example script here.

This of course produces the expected result:

0: <nothing>
1: Pain
2: Need for Pleasure
3: Hunger
4: Coldness
5: Hotness
6: Tiredness
7: Sleepiness
8: Loneliness
9: Crowded
10: Fear
11: Boredom
12: Anger
13: Sex Drive
14: Injury
....

Comparing our results to a known reference (the C2 genetics kit) shouldn't get us any unexpected surprises:

And chemicals names can now be easily retrieved from the list by indexing them by number:

>>> print Chemicals[1]
Pain
>>> print Chemicals[72]
Glycogen
>>> print Chemicals[170]
Alcohol

It also works on "le liste des produits chimiques" und "Das Chemikalien liste":

1: Douleur        1: Schmerz
2: Besoin de plaisir    2: Genußbedürfnis
3: Faim                 3: Hunger
4: Sensation de froid   4: Kälte
5: Sensation de chaleur 5: Hitze
6: Fatigue              6: Müdigkeit
7: Endormissement       7: Schläfrigkeit
8: Solitude             8: Einsamkeit
9: Bondé                9: Beengtheit
10: Peur               10: Angst
11: Ennui              11: Langeweile
...

If you were following my last articles, you should already know where all of this is going.
We've acquired the ability to extract a list of chemical concentrations found at death inside a creature, and we can now match chemical numbers to meaningfull names from up to date game data.

Whoever guessed the next article will be about writing a C2 autopsy tool wins a carrot !

The Shee's lost knowledge

Pages

Friday, January 24, 2014

Reversing and Parsing C2 allchemicals.str file format

No comments:

Post a Comment