Writing a FUSE filesystem for viewing Ren'Py archives
Ren’Py is an open source visual novel engine used by a lot of games, including Slay the Princess and Doki Doki Literature Club.
Often, Ren’Py VNs store game assets in the engine’s proprietary
archive format, and while there exist extractors for these, I would much rather have an option to mount rpa
archives
as virtual filesystems, as even though disk space is cheap, no extra space taken is still a vastly superior
option, especially because I rarely look at assets, only when I’m curious about some details.
On Linux, FUSE provides an option to write file systems that operate in user space, so that’s what I’ll be using in this post.
First things first, I should probably familiarize myself with rpa
archives so it’s time to clone the
Ren’Py codebase.
Some git grep
ing and find
ing files later, I’m looking at launcher/game/archiver.py
which seems promising!
Ren’Py archives, which have the file extension rpa
, like most formats, start with a
magic value, which seems to be the ASCII string RPA-3.0
followed by the offset of the zlib-compressed file index and the encryption key, both of them hex values encoded to text,
and finally, a newline character. Here I should note that Ren’Py’s own documentation states that this format is meant to
prevent casual copying, but isn’t very secure.
It’s already possible to start writing some code, to parse this first part of the file:
|
|
Here is the output from running it on a random archive I have laying around, confirming it works:
RPA-3.0 0000000041fc7a26 42424242
compressed_index_offset=41fc7a26 xor_key=42424242
Next, it’s time to decompress the file index, which is compressed by Ren’Py using a plain call to
the zlib.compress()
Python function.
Since most Linux distros already come with libzlib
, I won’t try to roll
my own here, maybe in an other post. This is what I ended with after an hour or so of coding1,
based on the zlib documentation
and usage example:
|
|
This isn’t the nicest API, I should of course defer the fprintf
calls to the caller, but this works
for now.
Ren’Py actually compresses a pickle.dumps()
of the index using the highest supported
pickle protocol
of the Python version used by Ren’Py. This is nice as they can just pickle.loads(zlib.decompress(file_index))
,
but for our purposes, it means further processing. The pickled data is essentially a dictionary of lists,
where the dictionary is indexed by file names, and each list contains exactly one tuple. The tuples
are triplets of (offset ^ secret_key, file_size ^ secret_key, b"")
.
I am not sure why the last empty bytestring is needed, but we’ll have to handle that as well.
You can also see the previously extracted key is used to XOR the values.
Ren’Py 8.4.1, the latest version available at the time of writing, utilizes Python 3.12. The latest pickle data format was introduced in Python 3.8, so that is my target.
The closest I could find to a specification of the actual opcodes was
CPython’s implementation.
There’s also pickletools.dis
which prints the opcodes of a pickled buffer:
|
|
Disassembling a data dump, I can confirm that my particular archive uses protocol version 5:
|
|
However, take a look at this snippet of my data dump:
66: \x8c SHORT_BINUNICODE 'images/e1/d1_drive.webp'
91: \x94 MEMOIZE (as 5)
92: ] EMPTY_LIST
93: \x94 MEMOIZE (as 6)
94: J BININT 1111571984
99: J BININT 1111801854
104: h BINGET 3
106: \x87 TUPLE3
107: \x94 MEMOIZE (as 7)
108: a APPEND
109: \x8c SHORT_BINUNICODE 'images/e1/e1i1.webp'
130: \x94 MEMOIZE (as 8)
131: ] EMPTY_LIST
132: \x94 MEMOIZE (as 9)
133: J BININT 1112029277
138: J BININT 1111674630
143: h BINGET 3
145: \x87 TUPLE3
146: \x94 MEMOIZE (as 10)
147: a APPEND
148: \x8c SHORT_BINUNICODE 'images/e1/e1i10.webp'
170: \x94 MEMOIZE (as 11)
171: ] EMPTY_LIST
172: \x94 MEMOIZE (as 12)
173: J BININT 1112098102
178: J BININT 1112155806
183: h BINGET 3
185: \x87 TUPLE3
186: \x94 MEMOIZE (as 13)
187: a APPEND
The order of the data is always the same: file name, offset, file size. This
means we can just skip everything that isn’t a SHORT_BINUNICODE
, or BININT
,
and extract pieces of data in this exact order, which also means that
we don’t care for all opcodes, except for the ones that represent an (unsigned) integer
or a string of characters. I’ll list all the opcodes we might care about, as I haven’t
bothered to check if Ren’Py places any restrictions on asset sizes:
|
|
But of course I’m only going to implement BININT
and SHORT_BINUNICODE
3.
A BININT
opcode is followed by a 32-bit unsigned integer (least-significant byte first),
and SHORT_BINUNICODE
is just a single-byte value for the length, followed by the actual string
(these aren’t null-terminated). Anyway, here is what I came up with after two hours of digging
and coding:
|
|
The code just parses byte-by-byte, but even on an i5-3340, it feels instantaneous, since we are only talking about kilobytes of data. Of course, it might be a fun challenge to try and vectorize it. Anyway, we can confirm that it works with the following dirty snippet:
|
|
This correctly dumps an image from the archive4.
Now it’s time to turn this into a FUSE file system! FUSE file systems thankfully
do not have to support all file system operations, which is good, because we
only want to allow reading. The documentation consists of the
library’s example
directory
and its Doxygen page of its API.
I also used the manual page as a reference. The gist of
it is, we have to fill out a struct fuse_operations
with all the file system operations
that we support, and then call fuse_main
.
Now, one issue is, file systems are hierarchical, but I just stored everything in a flat array, rather than a directory tree structure, which would make actually showing a directory hierarchy rather tricky. I’ve also never had to implement a directory structure, so it’s about time I got my hands dirty. This will of course explode the number of allocations, but the few extra milliseconds at startup shouldn’t matter. Here’s how I ended up implementing the function which actually populates our directory structure:
|
|
It’s not pretty, and I’ll be honest, I did spend an hour or so with the GDB&printf
combo
debugging segmentation faults5. Each node can be either a regular file or a directory,
so as lots of C codebases do it, I used a union and a kind variable (is_dir
). The
actual algorithm simply processes each component of a file path (as read from a BINUNICODE
),
and creates non-existing nodes. All the nodes created are directories, except for the last one.
Of course we must change unpickle_index
6:
|
|
We also need a way to traverse this tree:
|
|
This function is already libfuse
-aware, as all of our file system operations will be
passed absolute paths where "/"
root is the file system’s mount point, like "/images/ch1/background.jpeg"
. The loop is mostly
a simpler version of add_node_to_tree
’s. Whenever we encounter a component that isn’t a
leaf of the current node, we return NULL
.
And now we can finally wire up libfuse
! I’ll start by writing a really basic file system,
so I can iterate on that. But first, I have to modify my argument handling. libfuse
handles
a bunch of commandline options like -o allow_root
to allow the root user to access the files
the program exposes, as well as -f
to keep the program in the foreground.
|
|
I found libfuse
’s hello example
a good reference for the argument handling. We specify the RPA file with
--archive=
, and we also get -f
to run the file system in foreground mode,
and -d
for FUSE-level debugging! Note that libfuse
only redirects stderr
to the
terminal, so when doing any sort of printf
debugging, we must use fprintf(stderr, ...)
instead.
Now onto the file system… We must fill out a structure of file operation handlers
(rpa_ops
in the above snippet). Luckily, as I’ve mentioned above, our callbacks
only get passed paths that we can already process, so that makes
our job easier. Based on the example code from the manual page,
a basic file system must implement readdir
, read
, open
, and getattr
. As far as my understanding goes,
getattr
handles stat()
calls and the like, and the rest are self explanatory :)
To be able to view directory contents, we need readdir
and getattr
. Here’s how I implemented them:
|
|
Pretty straightforward so far. Our readdir
handler gets a fuse_fill_dir_t
,
which is a function pointer that we must call to expose each directory entry.
After adding the mandatory(?) .
and ..
paths7, we just look up the directory
in our internal tree and then add all of its entries.
For getattr
, we set some attributes. I don’t know if st_nlink
must be at least 1
,
but I set it to that just to be safe, because the example code also had such a statement8.
I also store the RPA file’s block size in archive_blocksize
when stat
ing
the archive, so that programs can use it for reading chunks of
data. As our FUSE filesystem is read only, I set directory permissions so that
they are readable and executable (this is needed for stat
I believe), and
regular file permissions to be read-only. If there’s an error, we must
return negative errno
values, so for non-existing entries we return -ENOENT
.
Now, it’s time to implement the actual file opening and reading.
|
|
Our open handler only has to return 0
if all goes well, the file descriptors
are handled by the kernel. We also make sure not to allow anything but read-only
open
calls. Handling read
s is also straightforward, we just seek to the asset,
and read at most sz
bytes. And with that done, our file system works!
This way I no longer have to unpack gigabyte-large files just to inspect some assets! It was also a nice exercise in string handling, plus I’ve never written a FUSE file system before. Overall, I’m happy with this weekend hack. The program’s source code is available on Github under the GPLv3 license. Enjoy!
-
I have not written a serious C program in probably half a year, so I am quite rusty. ↩︎
-
These opcodes are taken from the CPython codebase ↩︎
-
After writing this post, I did some further testing, and my limited implementation ended up biting me. This will of course be fixed in the actual repository of the tool. ↩︎
-
Note to self: Take screenshots while writing code, not afterwards. If you read further, you’ll see that the flat array is eventually replaced with a tree-like structure, so that snippet no longer works for me, and I’m too lazy to checkout to a previous commit, so you’ll just have to take my word for it. ↩︎
-
Not sure if using Rust would have made this easier, as afaik you need
unsafe {}
for these kinds of data structures unless you want to bring in a dependency, but you’re welcome to correct me, as I’m very much a novice in Rust. ↩︎ -
I noticed some unused variables remained in the code. Oops… ↩︎
-
I’m not sure if this is the correct way to handle these paths, as I just wrote it like it was in the example code, but it seems to work. ↩︎
-
My assumption is that a file is internally a hard link to its data, but as always, Please correct me if I’m wrong. ↩︎