The Answer Guy 33: Shuffling Lines in a File
"Linux Gazette...making Linux just a little more fun!"
Shuffling Lines in a File
From David Stanaway on the
Linux Programmers Support Team mailing list
on 20 Sep 1998
Now I'm trying to shuffle the order of the lines in a text file
without reading in the whole file... Does anyone have any advice, code,
etc on this? If I can read in the whole file, this is simple, but I might
want to shuffle a file several megs long.
What do you mean by shuffle?
I think he means something like: randomly or arbtrarily
reorder the lines of the file without reading the whole
thing into RAM/core.
I think the approach I'd take is to lock the file from
access by whatever programs and/or processes are intended to
read the data out of it.
Then I'd "index" the file --- search through it finding all
of the line boundary offsets and their lengths. I'd then
use an standard shuffling techniques on that index file.
The problem with "shuffling" a normal text file on line
boundaries is the variable record lengths. So we create a
table of offsets and lengths to those --- and all of the
offset/length pairs are of a fixed size.
So I could use the index file and "shuffle" it with the
following psuedo code:
- open index file
- while read index file entry (readbuf)
- pick a random place to put it
load the "place to put it" entry (writebuf)
swap these entries in read and write buf.
write both buffers
If the intent is to shuffle the files by some other criteria
(arbitrary vs. random) when you'd modify the above algorithm
accordingly. If the criteria for resequencing has to do
with the data in the files (i.e. your "sorting" the file)
you'd have a bit more work ahead of you.
... actually I'd optimize this a bit by read x entries into
a buffer, for looping through that, and maintain a few write
bufs into random locations into the file. For example I
might load 100 entries in the read buffer and up to ten
unique randomly selected write buffers. For each of the 100
read buffer entries I'd randomly select among the open write
buffers (1 to 10) and randomly select a place in that buffer
to put it). At the end of the for loop I'd write everything
back out, read the next read buff, select more write buffs,
and so on until the end of the file.
Every entry in the index file will have been exchanged with
some random entry at least once --- and the average will be
two. There is a small chance that a given entry would be
swapped out of and back into the same location (which is
usually a good feature of a shuffling algorithm).
Then I'd open the original text file and the shuffled index
file and I'd walk through the shuffle file sequentially
reading offset/length pairs and using them to seek into the
text file and copy to a new file. After each seek I'd do
one sanity check --- it there should be a newline there, and
as I was copying I'd do another, there should be no newlines
between my offset and the end of my length. I'd abend with
an error message if either sanity check failed, or if any
seek failed (the original file was shortened while I was
shuffling).
Finally I'd mv the new file back into place.
This algorithm assumes that you have files with variable
length records delimited by newlines. It also assumes that
you are not disk space constrained (that you have at least
enough room to make one full copy of the file to be shuffled
+ enough for an index file. Oddly enough the index file
could, in some degenerate circumstances be several times the
size of the original file. (that happens if all of the
lines in the old file were only zero or one characters long
and that your offsets and lengths are 32 bits each.
Note that I chose to use a file for the index rather than
RAM. If I'm guaranteed that the file will have a
"reasonable" number of lines I can build that in memory ---
thus simplifying the code somewhat. I chose the method that
I describe so that you could as easily shuffle
multi-gigabyte files as multi-megabyte.
The whole program could probably run in less than a 100K and
work on any size file that was supported by your OS.
You could also look at the sources for the GNU 'sort'
utility. I handles arbitrarily large inputs (using
sequences of temp files which then merged together).
If you open a file for reading, the only space it takes up is the read
buffer, so if you read a line at a time, the memory usage depends on how
you are shuffling.
If you wanted to reverse the file, you could jsut be writing the lines
you read to another file.
[deletia]
Then you may like to read the source file from the tail first. I don't
know how to do this in C, or C++, but it is possible in Java.
There is a program called tac ("cat" backwards) which does
exactly this. I'm sure it's written in C and the sources
can be found at any good GNU or BSD software archive.
You really need to say more about what you mean by <Shuffle>
David Stanaway
I think the term is sufficiently unambiguous.
Shuffle: to resequence. to place a group of objects into
some arbitrary or random order.
The problem at hand is a classic CS homework assignment. It
has quite a bit to do with the variable length nature of the
objects to be sorted. We can't do this with "in place"
editing (arbitrary seeks and writes into the orginal file)
because the record we're trying to move might overwrite two
or more record fragments at its destination.
When you are editing a file (the whole thing being in
memory) there are ways that the editor's buffer handling
handles the issue --- look at the sources to 'vi' or
some other smaller, simpler editor and find out how they
"delete a line" in terms of their internal data structures.
These don't work well for files since you might end up
re-writing from the current offset to the end of the file
for each replacement.
If the lines are of a fixed length it is much easier, we can
skip the indexing step and we can, if we wish, shuffle the
file "in place" --- without the copying. Naturally we'll
still want to lock the file (or move it to someplace where
other processes and programs won't be giving us concurrency
fits).
Copyright © 1998, James T. Dennis
Published in Linux Gazette Issue 33 October 1998