...making Linux just a little more fun!

Populating a Filesystem with Random Data

By René Pfeiffer

Occasionally, I test filesystems. Especially since Ext4 was taken into the Linux kernel, I've set up a couple of partitions and used them for storing data. Sometimes, you don't want to copy existing data on new filesystems when testing. Why not use random data? Now, where is the tool for doing that? I asked The Answer Gang, and so got another excuse for coding a tool.

Why Random Data?

One of the easier methods of filling a filesystem is to use download tools, extract the content of GNU/Linux distributions (do a standard installation and use the root filesystem), or simply copy parts of /dev/zero by using dd. You can even put backups on the filesystem, and thus fill it sensibly. All these methods work, but copying means you have to take it from somewhere. There's always a source and a destination. I don't want a source; I want to create data out of thin air and just use a single target. Besides, when doing testing of filesystems and something goes wrong, you probably don't want to publish parts of this filesystem in the bug report with your backup data on it. In addition, I just want to give the limits in terms of directory depth, number of files/directories, and maximum number of bytes per file and be done with it.

Using a Bash Script

Soon after my question was sent to the mailing list, Ben answered with a Bash script.

#!/bin/bash
# Created by Ben Okopnik on Wed Jul 16 18:04:33 EDT 2008

########    User settings     ############
MAXDIRS=5
MAXDEPTH=2
MAXFILES=10
MAXSIZE=1000
######## End of user settings ############

# How deep in the file system are we now?
TOP=`pwd|tr -cd '/'|wc -c`

populate() {
	cd $1
	curdir=$PWD

	files=$(($RANDOM*$MAXFILES/32767))
	for n in `seq $files`
	do
		f=`mktemp XXXXXX`
		size=$(($RANDOM*$MAXSIZE/32767))
		head -c $size /dev/urandom > $f
	done

	depth=`pwd|tr -cd '/'|wc -c`
	if [ $(($depth-$TOP)) -ge $MAXDEPTH ]
	then
		return
	fi

	unset dirlist
	dirs=$(($RANDOM*$MAXDIRS/32767))
	for n in `seq $dirs`
	do
		d=`mktemp -d XXXXXX`
		dirlist="$dirlist${dirlist:+ }$PWD/$d"
	done

	for dir in $dirlist
	do
		populate "$dir"
	done
}

populate $PWD

This gets the job done. It uses Bash's built-in pseudo-random-generator (PRNG), then recursively creates and descends into the directory until the maximum depth is reached. The shell is perfectly suited for this. You can also do this in Perl, Python, or any other scripting language.

Using C++

The reasons for writing a C++ version of the script are purely educational. It doesn't hurt to write code. Step by step, you can learn things and improve, which isn't a bad thing. My main intention was to try the program option parsing library of the Boost project. The algorithm is already laid out in the shell script, so all we have to do is to get the user's options, and create the files and directories.

Program Options

The command-line options usually consist of a couple of switches. Some take an argument, some don't. Very often, there's a help option presenting a little help text, describing all possible options and their defaults. The Boost program options library lets you do this very easily. You create a options_description object, and simply describe what you want to have.

popt::options_description desc("Options for fspopulate");
desc.add_options()
    ("debug,d", popt::value<unsigned int>(&opt_debug)->default_value(0),
     "Set debug level of the code.")
    ("dirlevels,D", popt::value<unsigned int>(&opt_dirlevels)->default_value(16),
     "Set maximum number of directory levels.")
    ("help,h", "Show a short help message with explanations.")
    ("logfile,l", popt::value(&opt_logfile), "Log file for recording information.")
    ("maxfilesize", popt::value<unsigned int>(&opt_max_filesize)->default_value(INT_MAX/2),
     "Maximum size of created files.")
    ("maxnamelen", popt::value<unsigned short>(&opt_max_namelength)->default_value(100),
     "Maximum length of file and directory names.")
    ("numdirs", popt::value<unsigned int>(&opt_number_directories)->default_value(512),
     "Maximum number of created directories.")
    ("numfiles", popt::value<unsigned int>(&opt_number_files)->default_value(1024),
     "Maximum number of created files.")
    ("path,p", popt::value(&opt_path), "Path to directory that is to be populated.")
    ("quiet,q", popt::bool_switch(&opt_quiet), "Be quiet about the whole operation.")
;

This sets all options, default values, and data types, and provides a little help string literal. As you can see, it is possible to provide multiple switches such as --debug and -d for setting the debug level. All we have to do now is to parse the command-line vector. You do this by creating a variables_map object and using the store/notify methods.

popt::store(popt::parse_command_line(argc, argv, desc), vm);
popt::notify(vm);

That's it. Now you can access the command-line options through the vm object. Checking for the presence of an option and extracting the value can be done like this:

// Check for help option.
if ( vm.count("help") > 0 ) {
    cout << desc << endl;
    return(rc);
}
// Extract debug level
if ( (vm.count("debug") > 0) or (vm.count("d") > 0) ) {
    opt_debug = vm["debug"].as<unsigned int>();
}

In addition, the Boost program options library allows you to parse the options from a configuration file.

Filenames and Paths

Filenames and paths are basically strings with some extra properties. This is no problem, as long as you write code for a single system. If you are interested in writing portable code, you have to deal with directory separators ("/" or "\"), drive letters, and other peculiarities. The Boost filesystem library helps us to handle filenames and paths transparently. You just use a path object, and the library takes care of the local path translation.

The object path represents a path. The object is aware of the valid directory separator your operating system uses. It knows about concepts such as root directory, and you can easily extract the base name of a file by calling the leaf() method. (You can do more, there is a list showing the path decomposition table that shows what you can extract and how you do it.) You can also use iterators to recursively browse a filesystem. In addition, the library overloads operators, so that constructing path objects is as easy as writing newpath = arg_path / "foo" / "bar". People using script languages will probably yawn at this point. Well, now you can do the same in C++.

Twisting Randomness

Now, we need a decent random source for our code. Using /dev/random is out of the question; this resource is too precious to create random numbers, since the size of the entropy pool is limited. /dev/urandom sounds better, but reading from it also depletes the kernel's entropy pool, albeit much more slowly. Imagine we populate a filesystem with gigabytes of data. This would leave next to no entropy in the pool, although /dev/urandom doesn't block and continues to supply random numbers. If an application simultaneously reads from /dev/random, it has to wait for the entropy pool to fill again. There has to be another way. We really don't need entropy for our tasks, since pseudo-randomness is more than enough for our case. We could use C's lrand48() function, but there are other options. The Mersenne Twister PRNG (=Pseudo RaNdom Generator) is quite fast, and features implementations in C. There's even a version that takes advantage of the hardware optimisations of certain CPUs. Let's try that.

We link our C++ code with the code from the SIMD-oriented Fast Mersenne Twister implementation. It's very easy to do. The code provides the usual functions for initialising the PRNG state and extracting random numbers in 32-bit/64-bit integer or floating point format.

The include file SFMT/SFMT.h in the sample code lists all functions. When using fill_array32() or fill_array64(), you have to be careful. If you use one of the gen_randXX() functions before the fill_arrayXX() functions, you have to reseed the PRNG state, or the code will abort with an exception. This is due to the algorithm, but it took me one hour to spot the hint in the source code comments.

The code can take advantage of the Streaming SIMD Extensions 2 (SSE2) instruction set of modern CPUs. SIMD means Single Instruction, Multiple Data, and its commands can use 128-bit registers that allow faster vector operations. The SSE2 instructions are also very useful if you do floating point arithmetic, since they handle the floating point data differently than does the CPU's FPU. The Mersenne Twister uses vectors, so SSE2 can speed up the internal computation. You just have to compile the code with -DSSE2 if your CPU has SSE2. (You can check this by inspecting /proc/cpuinfo.) If you deal with SIMD code, be careful to align your data structures properly. In SSE2 mode, all bitwise block pointers must be 16-byte aligned to avoid crashes. That's why we use memalign() for allocating memory.

The Mersenne Twister algorithm can use different periods for its pseudo-random numbers. Usually the period is given by the Mersenne exponent (MEXP in the C source) based on a Mersenne prime number. The prime is represented as 2MEXP-1. If you inspect the C source, you will see a list of possible Mersenne primes. They can be chosen at compile time, by setting MEXP accordingly. (SFMT-params.h looks for the symbol and includes the appropriate header file.)

Yes, clearly this is overkill, but, again, this is a nice opportunity to create example code, in case you ever really need these features in other projects.

Recursively Populating

The core of the C++ code is a single function that basically does the same as does Ben's shell script. The only parameters are the initial path and the directory depth. The function calls itself recursively, when entering a new subdirectory. It keeps track of the depth level, and returns as soon as the limit is reached.

namespace fs = boost::filesystem;

void populate( fs::path path, unsigned int level ) {
    unsigned int i;
    unsigned int depth_level;
    string dirname;
    string fullpath;
    unsigned int nr_directories;
    unsigned int nr_files;
    unsigned int size;

    fullpath = path.root_directory() + path.relative_path().string();
    if ( opt_debug > 0 ) {
        cout << "DEBUG: Entering <" << fullpath.c_str() << ">" << endl;
    }
    if ( chdir( fullpath.c_str() ) != 0 ) {
        cerr << "ERROR: Cannot chdir into directory <" << fullpath.c_str()
             << "> (" << strerror(errno) << ")" << endl
             << "ERROR: Level " << level << endl;
        return;
    }
 
    // Keep track of directory depth level
    depth_level = level+1;

    // Create files in current directory
    nr_files = (gen_rand32() % opt_number_files)+1;
    for ( i=1; i<=nr_files; i++ ) {
        size = gen_rand32() % opt_max_filesize;
        if ( ! create_random_file( size, opt_max_namelength ) ) {
            cerr << "ERROR: Cannot create file (" << strerror(errno)
                 << "). Aborting." << endl;
            return;
        }
    }

    // Check for depth, we only create directories when not scratching the depth limit.
    if ( depth_level < opt_dirlevels ) {
        // Create random number of directories
        nr_directories = (gen_rand32() % opt_number_directories)+1;
        for ( i=1; i<=nr_directories; i++ ) {
            // Create name and directory
            dirname = create_random_string(opt_max_namelength);
            if ( mkdir( dirname.c_str(), S_IRWXU|S_IRWXG|S_IROTH|S_IXOTH ) != -1 ) {
                // Populate directory
                fs::path newpath = path / dirname;
                if ( opt_debug > 0 ) {
                    cout << "DEBUG: New path <"
                         << newpath.root_directory() + newpath.relative_path().string()
                         << ">" << endl;
                }
                populate( newpath, depth_level );
                // Change to upper directory again. This is important since populate() chdirs into
                // a deeper directory and we can't climb up again if we don't do a second chdir()
                // after the function returns.
                if ( chdir( fullpath.c_str() ) != 0 ) {
                    cerr << "ERROR: Cannot chdir into directory <"
                         << fullpath.c_str() << "> (" 
                         << strerror(errno) << ")" << endl;
                }
            }
            else {
                cerr << "ERROR: Cannot create directory (" << strerror(errno) 
                     << "). Aborting." << endl;
                return;
            }
        }
    }

    return;
}

You can find the full source along with a Makefile and the SFMT code in a downloadable archive. Please study the source, since I have only given the key ideas on the code. The Makefile might need some tweaking, because I used the latest Boost library (1.35 at the time of writing this article). I only used basic features, so you should be fine with older Boost libraries. I added some variants for the CFLAGS and LDFLAGS to the Makefile.

Be careful about the limits when trying the code. Being too generous results in really large amounts of data written to the filesystem. Consider the directory limit. Ext3 has a subdirectory limit of 32768, so it's probably not wise to test-drive the full range of the directory limit option, unless you have some terabytes to spare.

Notes

Of course, the shell script is fine, and you really don't need SSE2 for this task. But TMTOWTDI isn't the privilege of Perl alone. ;-)

And please test Ext4. It's a great filesystem, and it needs more feedback: Without feedback and users, code cannot improve. Maybe fspopulate can help you with testing.

Useful Resources


Talkback: Discuss this article with The Answer Gang


Bio picture

René was born in the year of Atari's founding and the release of the game Pong. Since his early youth he started taking things apart to see how they work. He couldn't even pass construction sites without looking for electrical wires that might seem interesting. The interest in computing began when his grandfather bought him a 4-bit microcontroller with 256 byte RAM and a 4096 byte operating system, forcing him to learn assembler before any other language.

After finishing school he went to university in order to study physics. He then collected experiences with a C64, a C128, two Amigas, DEC's Ultrix, OpenVMS and finally GNU/Linux on a PC in 1997. He is using Linux since this day and still likes to take things apart und put them together again. Freedom of tinkering brought him close to the Free Software movement, where he puts some effort into the right to understand how things work. He is also involved with civil liberty groups focusing on digital rights.

Since 1999 he is offering his skills as a freelancer. His main activities include system/network administration, scripting and consulting. In 2001 he started to give lectures on computer security at the Technikum Wien. Apart from staring into computer monitors, inspecting hardware and talking to network equipment he is fond of scuba diving, writing, or photographing with his digital camera. He would like to have a go at storytelling and roleplaying again as soon as he finds some more spare time on his backup devices.


Copyright © 2008, René Pfeiffer. Released under the Open Publication License unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 153 of Linux Gazette, August 2008

Tux