Home Page Home Page
 Home | Linux Administration | Corporate Services | Resources | About Us Support Center
Monthly Server Management One-time Server Services Other Services
Network Administration Network Monitoring Network Security High Availability Load Balancing Data Backup and Recovery
Linux HOWTOs Linux Guides Linux Articles New RFCs Vulnerability list Linux Journal
Testimonials Partners Careers Contact Us Site Map
Re-compress your gzipp'ed files to bzip2 using a Bash script (HOWTO) LG #123

...making Linux just a little more fun!

Re-compress your gzipp'ed files to bzip2 using a Bash script (HOWTO)

By Dave Bechtel

If you were incredibly lucky (like me), perhaps you received an external USB hard drive for Christmas. Or perhaps you have one lying around already, with plenty of free space. And perhaps you also read the recent Slashdot article about compression software and have lots of fairly sizable gzipped files laying about.

After reading the comments in that article, I was dismayed to learn that my favorite compression tool of choice (gzip) has no error-correction capabilities. While I deem it to be the best all-around for quick backups with a decent compression ratio, gzip will choke if it gets a data error on restore - and there's something to be said for data integrity.

So, having this nice shiny new USB external drive and some time on my hands, I wrote a Bash utility script to re-compress gzip files to bzip2, using the external drive. It takes an order of magnitude longer to compress, but at least I'll save some space and have a hope of recovering the compressed data if things go wrong... Right??

My particular external drive is a 120-gig that came factory-formatted as a single FAT32 partition. Now, any Linux guru worth their salt knows that this thing practically begs to be customized, since Fat32 has a 2GB(Linux) or 4GB(Windows) filesize limit - depending on who's writing to it.

So, I fired up my Knoppix HD install and repartitioned it. Nothing fancy, just good old fdisk.

Here's how it looks now:

$ fdisk -l /dev/sdb

Disk /dev/sdb: 255 heads, 63 sectors, 14593 cylinders
Units = cylinders of 16065 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/sdb1   *         1         1      8032   83  Linux
/dev/sdb2             2     14593 117210240    f  Win95 Ext'd (LBA)
/dev/sdb5             2        18    136552   82  Linux swap
/dev/sdb6            19      4999  40009882    c  Win95 FAT32 (LBA)
/dev/sdb7          5000      5622   5004247   83  Linux
/dev/sdb8          5623     14593  72059557   83  Linux

(I did make a note of the fact that the factory-default was one big type "c", in case I needed to go back to that.)

Notice the 40GB Fat32 partition. In my other life (sssshhh!) I run Windows 2000 Professional - and was forcibly reminded that everything after Windows ME has a 32GB partition size limit for formatting Fat32. Note that the limitation is on formatting - not accessing - this is by design, and Microsoft has publically admitted it.

After going through several free Windows tools for formatting and repartitioning (and running into a brick wall), I eventually gave up on Windows 2000 formatting the thing. The vendor has a utility on their website to restore the drive to factory-default partitioning, but that doesn't really help my intended use of the drive. I could have formatted it in Windows 98, but that's no fun - and it would need a separate driver for the OS to recognize the drive.

So, rather than give up a perfectly usable 8GB, good old Linux to the rescue again:

$ mkdosfs -F 32 -v -n wdfat40 /dev/sdb6

and reboot.

Presto! Windows 2000 recognizes the drive just fine now, and it passes all the chkdsk tests. And for all you dual-booters out there, a wonderful utility exists called Ext2IFS ( http://www.fs-driver.org/ ). This allows NT-based systems like Windows 2000 to access ext2/ext3 partitions just like a regular drive - read/write, so no need for NTFS!

The Linux partitions were formatted like so:

mke2fs -j -c -m1 -v /dev/sdbX

Here are the /etc/fstab entries I created for the drive, BTW:

/dev/sdb6  /mnt/wdfat40  auto
defaults,noauto,noatime,user,suid,noexec,uid=dave 0 0
/dev/sdb7  /mnt/wdlinux  ext3 defaults,noauto,noatime,rw 0 0
/dev/sdb8  /mnt/wdvast  ext3 defaults,noauto,noatime,rw 0 0

Note the "uid=dave" in that first line. That's so my non-root user account will have write access to the drive by default.

Now onto the good part - the "rezip" Bash script.

At first, I started out by writing a fairly basic script with a simple function call and manually-entered filenames. Then I sat down and took another look at it - and practically rewrote it from scratch, with some features that occurred to me after several test runs.

rezip Currently Features:

  • Uses a simple text file of paths and filenames for input -- so you can save the results of "find" to a file, run rezip, and the files will be re-compressed one at a time, with a running log and no user intervention (as long as there's free space on the destination drive.) Example:
    $ find /mnt/bkps -name \*.gz > ~/rezipp-files.txt && rezip
    
  • Automatically sorts the files to process by size, so the biggest files are last. This allows more work to get done up front. (Believe me, this is a consideration when your fastest computer is a 900MHz AMD Duron)
  • Skips files less than 50MB in size (user defined)
  • Recreates existing directory structure on the external drive and leaves the original .gz file in place
  • By default, does not overwrite existing .bz2 files so previous work doesn't get run over. This feature was added after I found a bug where ^C won't stop the script right away, and several hours of .bz2 output were lost. :(

    Note: if you abort the script and then re-run it, you have to manually delete the last (partial) .bz2 file it was working on, or that will be skipped as well. This is where the log comes in handy. :)

  • Heavily commented and fairly easy-to-understand (I hope!) source code
  • Generates a log file, including start/end times per-file
  • ...And last but not least, rezip is released under a GPL license. :)

-- KNOWN BUG(s):

  • The PROPER way to kill "rezip" when it is running, is to press Ctrl-Z, then type
    $ kill %jobnumber
    
    -- Example:
     ^Z
    [1]+  Stopped                 rezip
    '  kill %1 '
     [1]+  Terminated              rezip
    

    If you DON'T do it that way, trust me - wacky things can happen. I.e., it will skip to the next file, and gzip/bzip2 will still be running in the background. Don't use ^C.
  • The logger function (logecho) has trouble echoing stars (" * "), even when they are quoted.
  • The log file can get fairly large after several runs. If you want to reset it, either "rm" it or
    $ >rezip.log
    
    will reset it to 0 length.
    WARNING: If .gz files that were listed in rezipp-files.txt are deleted/moved between runs, you MUST re-do the "find" before re-running. Otherwise, unexpected results will probably occur.
  • Tried adding a feature to log if a recompress failed, after a test run encountered a bad .gz file. (This was a pain, and required several re-runs with a short, known-bad gzip file, looking up things in the bash man page, and much experimentation. It logs the error now, but fails to notify the user that the job failed.)
  • To create a known-bad .gz file of your own to test:
    $ dd if=any-gz-file-more-than-20MB.gz of=KNOWNBAD.gz bs=1M count=21
    
    and redo your "find" to include it.
    This creates a .gz file that is a partial copy of the complete one, and will cause gzip to abend with "Unexpected end of file." Set the "skipsize" variable to 20000 and run rezip, and it should log the error. If you can fix the script so that it notifies the user as well, let me know. ;-)

During the course of writing the script, I had hard-coded most of the defaults, such as the size of files to skip, the log file name, etc. These were eventually changed to be variables before the script was published for LG - so that you, the end-user, can have More Control (TM) over its actions. ;-)

I encourage everyone to READ THE SOURCE CODE before running rezip. You may find it handy to view it in an editor that colorizes or highlights executable syntax, such as ' mcedit ' or ' jstar '.

Comments, feature requests, bug reports, etc., are welcome.

( Don't forget to ' chmod +x rezip ' and put it somewhere in your $PATH - /usr/local/bin is suggested. )

Talkback: Discuss this article with The Answer Gang


[BIO]

Bio: Born in 1972, Dave Bechtel grew up programming in Basic with Apple ][e's, TI99 4/A, IBM PC (640K!) and a Tandy 1000SX, none of which actually had hard drives -- 360K floppy only. And we LIKED IT! ;-)

Eventually left BASIC behind, and moved on to programming in REXX and Bash.

Got interested in Linux around 1997. Started with Red Hat and went on to SuSE, tried several other distros and a *BSD or two, and has now settled on Knoppix/Debian/Ubuntu, in roughly that order. Currently living in Lake Zurich, IL.

Likes: Computers, motorcycles, Linux, reading and watching sci-fi (currently Star Trek TOS, Stargate, and Battlestar Galactica)


Copyright © 2006, Dave Bechtel. Released under the Open Publication license unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 123 of Linux Gazette, February 2006

Tux