Tux

...making Linux just a little more fun!

Tool to populate a filesystem?

René Pfeiffer [lynx at luchs.at]


Wed, 16 Jul 2008 23:46:06 +0200

Hello!

I am playing with 2.6.26, e2fsprogs 1.41 and ext4, just to see what ext4 can do and what workloads it can handle. Do you know of any tools that can populate a filesystem with a random amount of files filled with random data stored a random amount of directories? I know that some benchmarking tools do a lot of file/directory creation, but I'd like to create a random filesystem layout with data, so I can use the usual tools such as tar, rsync, cp and others to do things you usually do when setting up, backuping and restoring servers.

Curious, René.


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 16 Jul 2008 18:59:37 -0400

On Wed, Jul 16, 2008 at 11:46:06PM +0200, René Pfeiffer wrote:

> Hello!
> 
> I am playing with 2.6.26, e2fsprogs 1.41 and ext4, just to see what ext4
> can do and what workloads it can handle. Do you know of any tools that
> can populate a filesystem with a random amount of files filled with
> random data stored a random amount of directories? I know that some
> benchmarking tools do a lot of file/directory creation, but I'd like to
> create a random filesystem layout with data, so I can use the usual
> tools such as tar, rsync, cp and others to do things you usually do when
> setting up, backuping and restoring servers.

That would be a fairly easy shell script: a loop that a) creates a random number of files of random size, b) creates a random number of directories, c) dives into all the subdirectories that were created, and d) repeats the process. The only thing is that you would have to set some hard limits: 1) how deep do you want to go, 2) max number for directories and files, and 3) max file size. Recursive functions of this sort will run away from you very quickly if you don't watch them.

Here's a quick hack that should do it (not very well tested, but should work OK). Again, you'll need to set the vars at the top as per your requirements.

#!/bin/bash
# Created by Ben Okopnik on Wed Jul 16 18:04:33 EDT 2008
 
########    User settings     ############
MAXDIRS=5
MAXDEPTH=2
MAXFILES=10
MAXSIZE=1000
######## End of user settings ############
 
# How deep in the file system are we now?
TOP=`pwd|tr -cd '/'|wc -c`
 
populate() {
	cd $1
	curdir=$PWD
 
	files=$(($RANDOM*$MAXFILES/32767))
	for n in `seq $files`
	do
		f=`mktemp XXXXXX`
		size=$(($RANDOM*$MAXSIZE/32767))
		head -c $size /dev/urandom > $f
	done
 
	depth=`pwd|tr -cd '/'|wc -c`
	if [ $(($depth-$TOP)) -ge $MAXDEPTH ]
	then
		return
	fi
 
	unset dirlist
	dirs=$(($RANDOM*$MAXDIRS/32767))
	for n in `seq $dirs`
	do
		d=`mktemp -d XXXXXX`
		dirlist="$dirlist${dirlist:+ }$PWD/$d"
	done
 
	for dir in $dirlist
	do
		populate "$dir"
	done
}
 
populate $PWD
-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


René Pfeiffer [lynx at luchs.at]


Thu, 17 Jul 2008 01:16:56 +0200

On Jul 16, 2008 at 1859 -0400, Ben Okopnik appeared and said:

> On Wed, Jul 16, 2008 at 11:46:06PM +0200, René Pfeiffer wrote:
> > [...]
> > I am playing with 2.6.26, e2fsprogs 1.41 and ext4, just to see what ext4
> > can do and what workloads it can handle. Do you know of any tools that
> > can populate a filesystem with a random amount of files [...]
>=20
> That would be a fairly easy shell script: a loop that a) creates a
> random number of files of random size, b) creates a random number of
> directories, c) dives into all the subdirectories that were created, and
> d) repeats the process.

I thought you would say something like that. :) It seems I was rather thinking in tools than in what one would need to do to solve this problem. I'll try your script tomorrow and see what it does. I am curious to see how long it takes for a Bash script to create the files.

> The only thing is that you would have to set some hard limits: 1) how
> deep do you want to go, 2) max number for directories and files, and
> 3) max file size. [...]

Yes, and it's wise to start with low numbers. As initial defaults 32768 came to my mind, but 32768^2 if you use it for both max number for directories and files is a bit of an overkill. :)

Thanks, René.


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 16 Jul 2008 19:48:48 -0400

On Thu, Jul 17, 2008 at 01:16:56AM +0200, René Pfeiffer wrote:

> On Jul 16, 2008 at 1859 -0400, Ben Okopnik appeared and said:
> > On Wed, Jul 16, 2008 at 11:46:06PM +0200, René Pfeiffer wrote:
> > > [...]
> > > I am playing with 2.6.26, e2fsprogs 1.41 and ext4, just to see what ext4
> > > can do and what workloads it can handle. Do you know of any tools that
> > > can populate a filesystem with a random amount of files [...]
> > 
> > That would be a fairly easy shell script: a loop that a) creates a
> > random number of files of random size, b) creates a random number of
> > directories, c) dives into all the subdirectories that were created, and
> > d) repeats the process.
> 
> I thought you would say something like that. :) 

You know me so well. :)

> It seems I was rather
> thinking in tools than in what one would need to do to solve this
> problem. I'll try your script tomorrow and see what it does. I am
> curious to see how long it takes for a Bash script to create the files.

Running it repeatedly with

MAXDIRS=5
MAXDEPTH=5
MAXFILES=100
MAXSIZE=1000

gives me times ranging from 1/3 of a second (no directories created) to 13.5 seconds. The maximum for that depth would probably still be under a minute. The shell isn't the fastest thing in the world, but there's not a whole lot of calculation going on here.

> > The only thing is that you would have to set some hard limits: 1) how
> > deep do you want to go, 2) max number for directories and files, and
> > 3) max file size. [...]
> 
> Yes, and it's wise to start with low numbers. As initial defaults 32768
> came to my mind, but 32768^2 if you use it for both max number for
> directories and files is a bit of an overkill. :)

[laugh] And would take a good long while, as well - no matter what language you used. A tree that's 32768 directories deep, with a max of 32768 subdirectories and the same max of files in each one would be... ridiculously large - 32768^32768 + 32768^32767 and so on (or something of that sort. It's aggravating to be this stupid at math, but you either use it or you lose it.) A structure that's 5 dirs deep and 10 wide would have a maximum of over 100k directories - 100k+10k+1k+100+10 - and that's before even counting the files.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


René Pfeiffer [lynx at luchs.at]


Thu, 17 Jul 2008 14:49:58 +0200

On Jul 16, 2008 at 1948 -0400, Ben Okopnik appeared and said:

> [...]
> Running it repeatedly with
>
> MAXDIRS=5
> MAXDEPTH=5
> MAXFILES=100
> MAXSIZE=1000
>
> gives me times ranging from 1/3 of a second (no directories created) to
> 13.5 seconds. The maximum for that depth would probably still be under a
> minute. [...]

Sounds reasonable. I don't think I will create lots of filesystems once I have one for testing.

> > [...]
> > Yes, and it's wise to start with low numbers. As initial defaults 32768
> > came to my mind, but 32768^2 if you use it for both max number for
> > directories and files is a bit of an overkill. :)
>
> [laugh] And would take a good long while, as well - no matter what
> language you used. A tree that's 32768 directories deep, with a max of
> 32768 subdirectories and the same max of files in each one would be...
> ridiculously large - 32768^32768 + 32768^32767 and so on (or something
> of that sort. It's aggravating to be this stupid at math, [...]

No, it's not. Physicists usually assume (pi)=e=3 unless they need exacter results. ;)

> [...] A structure that's 5 dirs deep and 10 wide would
> have a maximum of over 100k directories - 100k+10k+1k+100+10 - and
> that's before even counting the files.

Well, big filesystems reach this size. I just counted the content of our backup server that holds copies of 8 productions systems. It uses 132 GB out of 1.2 TB (freshly installed in May). The filesystem holds 1,419,834 files and 88,564 directories. One of the external disks that hold incremental backups (realised through hard links and separate directories) holds even more files and directories. So it's not that rare to run into big filesystems, at least for backup servers. $(find . -type f | wc -l) takes ages (I waited for 30 minutes now and I still have no files count).

It's a shame people don't delete data anymore. ;)

Best, René.


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Thu, 17 Jul 2008 13:06:16 -0400

On Thu, Jul 17, 2008 at 02:49:58PM +0200, René Pfeiffer wrote:

> On Jul 16, 2008 at 1948 -0400, Ben Okopnik appeared and said:
> > [...]
> > Running it repeatedly with
> > 
> > MAXDIRS=5
> > MAXDEPTH=5
> > MAXFILES=100
> > MAXSIZE=1000
> > 
> > gives me times ranging from 1/3 of a second (no directories created) to
> > 13.5 seconds. The maximum for that depth would probably still be under a
> > minute. [...]
> 
> Sounds reasonable. I don't think I will create lots of filesystems once
> I have one for testing.

Since I used 'mktemp' to create the files and the directories, you can run this script multiple times to give yourself even more granular control. It'll "back off and redo" if the directory or filename already exist.

> > [...] A structure that's 5 dirs deep and 10 wide would
> > have a maximum of over 100k directories - 100k+10k+1k+100+10 - and
> > that's before even counting the files.
> 
> Well, big filesystems reach this size. 

That was sorta my point, backwards. :) Setting 'MAXDEPTH' to much more than 5 will give you ridiculous runtimes and numbers.

> I just counted the content of our
> backup server that holds copies of 8 productions systems. It uses 132 GB
> out of 1.2 TB (freshly installed in May). The filesystem holds 1,419,834
> files and 88,564 directories. 

So, maybe a MAXDEPTH=5 and MAXDIRS=11? That gives you a maximum of 177155 or an average of 88577 directories. For files, you could do MAXFILES=32 and MAXSIZE=200000.

> One of the external disks that hold
> incremental backups (realised through hard links and separate
> directories) holds even more files and directories. So it's not that rare to run
> into big filesystems, at least for backup servers. $(find . -type f |
> wc -l) takes ages (I waited for 30 minutes now and I still have no files
> count).

I suspect that 'du -s' would be a lot faster.

> It's a shame people don't delete data anymore. ;)

[laugh] Drive space is cheap, right? And Shakespeare's entire life's work fits into 5MB or so. What's the big deal?

Yeah, we've got serious information pollution going on - and no way to really stop it, given the associated psychological factors. We'll just have to scrape along somehow... just like we've managed for the past few million years.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Fri, 18 Jul 2008 08:35:56 -0400

On Sat, Jul 19, 2008 at 12:42:18AM +1200, Chris Bannister wrote:

> On Thu, Jul 17, 2008 at 01:16:56AM +0200, René Pfeiffer wrote:
> > On Jul 16, 2008 at 1859 -0400, Ben Okopnik appeared and said:
> > > On Wed, Jul 16, 2008 at 11:46:06PM +0200, René Pfeiffer wrote:
> > > > [...]
> > > > I am playing with 2.6.26, e2fsprogs 1.41 and ext4, just to see what ext4
> > > > can do and what workloads it can handle. Do you know of any tools that
> > > > can populate a filesystem with a random amount of files [...]
> > > 
> > > That would be a fairly easy shell script: a loop that a) creates a
> > > random number of files of random size, b) creates a random number of
> > > directories, c) dives into all the subdirectories that were created, and
> > > d) repeats the process.
> > 
> > I thought you would say something like that. :) It seems I was rather
> 
> I was expecting a Perl script. :-)

[laugh] I generally try to avoid driving screws with a hammer, or pounding nails in with a screwdriver. My rule of thumb is, "if it's got a lot of filesystem operations and not a lot of calculation or processing, it's a shell script; otherwise, it's Perl."

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Chris Bannister [mockingbird at earthlight.co.nz]


Sat, 19 Jul 2008 00:42:18 +1200

On Thu, Jul 17, 2008 at 01:16:56AM +0200, René Pfeiffer wrote:

> On Jul 16, 2008 at 1859 -0400, Ben Okopnik appeared and said:
> > On Wed, Jul 16, 2008 at 11:46:06PM +0200, René Pfeiffer wrote:
> > > [...]
> > > I am playing with 2.6.26, e2fsprogs 1.41 and ext4, just to see what ext4
> > > can do and what workloads it can handle. Do you know of any tools that
> > > can populate a filesystem with a random amount of files [...]
> > 
> > That would be a fairly easy shell script: a loop that a) creates a
> > random number of files of random size, b) creates a random number of
> > directories, c) dives into all the subdirectories that were created, and
> > d) repeats the process.
> 
> I thought you would say something like that. :) It seems I was rather

I was expecting a Perl script. :-)

-- 
Chris.
======
"One, with God, is always a majority, but many a martyr has been burned
   at the stake while the votes were being counted."  -- Thomas B. Reed


Top    Back


René Pfeiffer [lynx at luchs.at]


Fri, 18 Jul 2008 18:15:48 +0200

> On Sat, Jul 19, 2008 at 12:42:18AM +1200, Chris Bannister wrote:
> > [...]
> > I was expecting a Perl script. :-)
>
> [laugh] I generally try to avoid driving screws with a hammer, or
> pounding nails in with a screwdriver. My rule of thumb is, "if it's got
> a lot of filesystem operations and not a lot of calculation or
> processing, it's a shell script; otherwise, it's Perl."

What about awk? ;)

So, now I checked your Bash script. It works and it was a wonderful template for getting slightly insane and coding a population tool in C++ with a SSE2-capable Mersenne Twister PRNG linked in for some extra fast randomness. The latest test run already filled a NFS share used for temporary data. :) I'll polish the source, write some text glue and submit it for LG's "Getting Decently Distracted with Simple Problems, Screws and Hammers" column.

Best, René.

P.S.: You might have guessed that I tried avoiding some real problems, such as repairing a site with PHP code that can't properly deal with=20 Unicode strings. ;)


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Fri, 18 Jul 2008 16:55:36 -0400

On Fri, Jul 18, 2008 at 06:15:48PM +0200, René Pfeiffer wrote:

> On Jul 18, 2008 at 0835 -0400, Ben Okopnik appeared and said:
> > On Sat, Jul 19, 2008 at 12:42:18AM +1200, Chris Bannister wrote:
> > > [...]
> > > I was expecting a Perl script. :-)
> > 
> > [laugh] I generally try to avoid driving screws with a hammer, or
> > pounding nails in with a screwdriver. My rule of thumb is, "if it's got
> > a lot of filesystem operations and not a lot of calculation or
> > processing, it's a shell script; otherwise, it's Perl."
> 
> What about awk? ;)

There's supposed to be an actual use for that stuff??? :)

Yeah, I use it once in a very rare while - when I have a very specific and very narrow task to do (i.e., printing a 'field' on a line that matches a regex) and feel too lazy to type the few extra characters.

# Print the 3rd field in colon-delimited file 'foo' where the line
# begins with 'X'
awk -F: '/^X/{print $3}' foo
perl -F: -walne'print $F[2] if /^X/' foo

Other than that... well, keeping backup tools in your mental toolbox is the Unix way. Besides, I have to teach people about the stuff.

> So, now I checked your Bash script. It works and it was a wonderful
> template for getting slightly insane and coding a population tool in C++
> with a SSE2-capable Mersenne Twister PRNG linked in for some extra fast
> randomness. The latest test run already filled a NFS share used for
> temporary data. :) I'll polish the source, write some text glue and
> submit it for LG's "Getting Decently Distracted with Simple Problems,
> Screws and Hammers" column.

See, this is why I like robust tools. "...so then, I used these Vise-Grips to hammer on the chain coupler and broke it loose, which prevented the fission reaction and saved the world from a nuclear winter." You never know how someone is going to creatively misuse your code. :)

> P.S.: You might have guessed that I tried avoiding some real problems,
> such as repairing a site with PHP code that can't properly deal with 
> Unicode strings. ;)

PHP and Unicode strings. Isn't this like that door in the movie, with all the chains and warnings and skulls/crossbones on it that the "spear carrier" type decides to open while everybody in the audience is going "NO! NO! DON'T DO IT!!!"... only to hear gristly crunching and squishing sounds a moment later?

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


René Pfeiffer [lynx at luchs.at]


Fri, 18 Jul 2008 23:18:57 +0200

> On Fri, Jul 18, 2008 at 06:15:48PM +0200, René Pfeiffer wrote:
> > [...]
> > What about awk? ;)
>
> There's supposed to be an actual use for that stuff??? :)
>
> Yeah, I use it once in a very rare while - when I have a very specific
> and very narrow task to do (i.e., printing a 'field' on a line that
> matches a regex) and feel too lazy to type the few extra characters.

That's about the only occasion when I use awk, too.

> [...]
> > I'll polish the source, write some text glue and
> > submit it for LG's "Getting Decently Distracted with Simple Problems,
> > Screws and Hammers" column.
>
> See, this is why I like robust tools. "...so then, I used these
> Vise-Grips to hammer on the chain coupler and broke it loose, which
> prevented the fission reaction and saved the world from a nuclear
> winter." You never know how someone is going to creatively misuse your
> code. :)

Where's your sense of adventure? Do you know why engineers and no physicists run nuclear power plants? The physicists would only replace the circuit breakers with copper rods and start the experiments. :)

> [...]
> PHP and Unicode strings. Isn't this like that door in the movie, with
> all the chains and warnings and skulls/crossbones on it that the "spear
> carrier" type decides to open while everybody in the audience is going
> "NO! NO! DON'T DO IT!!!"... only to hear gristly crunching and squishing
> sounds a moment later?

Exactly, and after I read http://www.phpwact.org/php/i18n/utf-8 I gave up all hope. It's a mess. I think the PHP developers shouldn't run nuclear power plants either.

Best, René.


Top    Back


Chris Bannister [mockingbird at earthlight.co.nz]


Sun, 20 Jul 2008 11:22:51 +1200

On Fri, Jul 18, 2008 at 08:35:56AM -0400, Ben Okopnik wrote:

> On Sat, Jul 19, 2008 at 12:42:18AM +1200, Chris Bannister wrote:
> > On Thu, Jul 17, 2008 at 01:16:56AM +0200, René Pfeiffer wrote:
> > > On Jul 16, 2008 at 1859 -0400, Ben Okopnik appeared and said:
> > > > On Wed, Jul 16, 2008 at 11:46:06PM +0200, René Pfeiffer wrote:
> > > > > [...]
> > > > > I am playing with 2.6.26, e2fsprogs 1.41 and ext4, just to see what ext4
> > > > > can do and what workloads it can handle. Do you know of any tools that
> > > > > can populate a filesystem with a random amount of files [...]
> > > > 
> > > > That would be a fairly easy shell script: a loop that a) creates a
> > > > random number of files of random size, b) creates a random number of
> > > > directories, c) dives into all the subdirectories that were created, and
> > > > d) repeats the process.
> > > 
> > > I thought you would say something like that. :) It seems I was rather
> > 
> > I was expecting a Perl script. :-)
> 
> [laugh] I generally try to avoid driving screws with a hammer, or
> pounding nails in with a screwdriver. My rule of thumb is, "if it's got
> a lot of filesystem operations and not a lot of calculation or
> processing, it's a shell script; otherwise, it's Perl."

Nice tip.

I've been marvelling over:

http://www.stonehenge.com/merlyn/PerlJournal/ http://www.stonehenge.com/merlyn/UnixReview/ http://www.stonehenge.com/merlyn/WebTechniques/

Good reading.

-- 
Chris.
======
"One, with God, is always a majority, but many a martyr has been burned
   at the stake while the votes were being counted."  -- Thomas B. Reed


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Sat, 19 Jul 2008 22:13:13 -0400

On Sun, Jul 20, 2008 at 11:22:51AM +1200, Chris Bannister wrote:

> On Fri, Jul 18, 2008 at 08:35:56AM -0400, Ben Okopnik wrote:
> > 
> > [laugh] I generally try to avoid driving screws with a hammer, or
> > pounding nails in with a screwdriver. My rule of thumb is, "if it's got
> > a lot of filesystem operations and not a lot of calculation or
> > processing, it's a shell script; otherwise, it's Perl."
> 
> Nice tip.
> 
> I've been marvelling over:
> 
> 	http://www.stonehenge.com/merlyn/PerlJournal/
> 	http://www.stonehenge.com/merlyn/UnixReview/
> 	http://www.stonehenge.com/merlyn/WebTechniques/
> 
> Good reading.

Oh, Randal is one of the amazing people in the Perl community. Nice guy, brilliant coder, excellent teacher... one of the pillars [no pun intended], really. I've got a whole lot of respect for the man.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back