towards better estimations of "size in iso file" (free code)

Posts: 148
figosdev
Joined: 29 Jun 2017
#1
antix 17.b1"full" has an iso that is 847,249,408 bytes in size. there is debate about whether it could be smaller. my interest in file sizes should make it obvious which side of that im on; but this is a technical post, not a political one.

of those 847,249,408 bytes, all but 22m are in a squashfs file called linuxfs. so this is really about that file, and the files that are in it.

if you decompress the file, you will find out what size everything in it would be installed on your machine. the installation doesnt technically need to copy every file from there; but if it did those files would take up 2,584,420,683 bytes or 2.41g. squashfs is getting the size of the compressed file just under 1/3 the actual size.

but not every file compresses the same; you cant (necessarily) pick a 100 files at random, divide by 3 and take that off the cd size.

you can remove some files, compress the whole thing using mksquashfs and then you will have the actual size, but every time you do that it takes 15-25 minutes to compress everything. wouldnt it be nice if we could do better? ive spent a few hours (more than 5) trying to figure out how. perhaps searching the internet would be a better use of that time, but i enjoy experimenting-- i learn stuff that i can sometimes put to immediate use.

i actually use mksquashfs, and when i do i use xz compression; i think thats typical. my first idea was to find out if compressing the files with xz would let us use tar or xz or unxz with --list...

nope. it tars ALL the files, then compresses that-- we only get the compressed size of the entire thing, and we already have that information; its the size of linuxfs (or basically the size of the iso.)

unsquashfs has -lls to list files without decompressing, but that doesnt help either. the size of libxul.so from -lls:

73,274,760

and the size if we mount linuxfs and du -b the same file:

73,274,760 ...thats the uncompressed size again.


now suppose we mounted the iso, then mounted linuxfs, and used xz (with the default settings) to compress each file in it INDIVIDUALLY. how long would that take?

about an HOUR AND a HALF. thats on an i-series intel processor, not a core2. it might go faster on a solid-state drive, or loading it all into ram first.

the actual size of the linuxfs file: 824,418,304 bytes

the size of each actual-size-file inside it, run through xz -c | wc -c to get a compressed bytecount and tallied?

886,843,744. using similar compression of INDIVIDUAL files, our total is off by 59.5m-- not bad; at first i thought id compressed with the wrong setting. (the fairly safe xz default is -6... it can go higher.)

i havent tried compressing each file individually with mksquashfs. however, i did try running tar -cvJf on the whole thing, to find out if that was more efficient. i figured it would be (but it doesnt give us the information we need.)

the same files tarred and compressed for comparison: 698,450,164. now someones wondering what we get if we compress the antix 17 iso for a faster download. ok, i will do that for you:

xz antix17.iso (a copy of the antix 17 iso) ; du -b antix17.iso.xz # 834,423,476

a savings of only 12.2m. weird, huh? so to review:

linuxfs: 824,418,304

linuxfs unsquashed and tarred into one file, then compressed with xz: 698,450,164 (an entirely useless number for this.)

linuxfs unsquashed and then each file processed individually with xz (so we can estimate compressed size in linuxfs): 886,843,744

in linuxfs there are 131,430 files. on average, we have OVER-estimated the size of each files compressed size by nearly 475 bytes.

practically speaking, that means that if we take our table of estimated compressed sizes and SUBTRACT the sizes of files we intend to delete, we OVER-estimate the savings and underestimate the final compressed size.

but it also suggests that if we take our table of estimated compressed sizes and ADD the remaining sizes (other than the ones we intend to delete) then we UNDER-estimate the savings and overestimate the final compressed size.

so if youre looking for a MUCH faster, more accurate way to estimate the iso size after compression, take the number 22m (23068672) and add the files from this table that you plan to keep in the iso:

Code: Select all

compressed      uncompressed    compressed total
493132          1265272         493132           /root/squashdu/linuxfs/bin/bash
261604          621700          754736           /root/squashdu/linuxfs/bin/btrfs
130780          297444          885516           /root/squashdu/linuxfs/bin/btrfs-calc-size
142084          326180          1027600          /root/squashdu/linuxfs/bin/btrfs-convert
130928          297444          1158528          /root/squashdu/linuxfs/bin/btrfs-debug-tree
129164          293348          1287692          /root/squashdu/linuxfs/bin/btrfs-find-root
140940          322084          1428632          /root/squashdu/linuxfs/bin/btrfs-image
130208          297444          1558840          /root/squashdu/linuxfs/bin/btrfs-map-logical
128684          293348          1687524          /root/squashdu/linuxfs/bin/btrfs-select-super
131236          301764          1818760          /root/squashdu/linuxfs/bin/btrfs-show-super
128748          293348          1947508          /root/squashdu/linuxfs/bin/btrfs-zero-log
130588          297444          2078096          /root/squashdu/linuxfs/bin/btrfstune
13212           34480           2091308          /root/squashdu/linuxfs/bin/bunzip2
335072          625828          2426380          /root/squashdu/linuxfs/bin/busybox
13212           34480           2439592          /root/squashdu/linuxfs/bin/bzcat
988             2140            2440580          /root/squashdu/linuxfs/bin/bzdiff
2092            4877            2442672          /root/squashdu/linuxfs/bin/bzexe
1688            3642            2444360          /root/squashdu/linuxfs/bin/bzgrep
13212           34480           2457572          /root/squashdu/linuxfs/bin/bzip2

and you should get a pretty accurate guess there.

is this overkill? it depends on how many times youve created a squashfs file only to have too large an iso. if you want to guess how many files you need to get rid of to make an iso fit on a cd, this could get you closer, faster.

but it still takes one person / machine running this script for an hour and a half or more, to get the table.

and no, i dont have the entire table; the first time i ran it, i didnt direct it to a file. but i timed it!

and here is the script: feel free to add newlines, i like one-liners. i put them in the bash history and grep as needed.

Code: Select all

iso="antiX-17.b1_386-full.iso" ; mkdir /root/squashdu ; mkdir /root/squashdu/linuxfs ; mkdir /root/squashdu/iso ; mount"$iso" /root/squashdu/iso/ ; mount /root/squashdu/iso/antiX/linuxfs /root/squashdu/linuxfs/ ; d=$(date) ; tot=0 ; echo -e"compressed\tuncompressed\tcompressed total" ; for p in $(find /root/squashdu/linuxfs -type f | cat -A | tr ' ' '^' | sed"s/\$$//g") ; do pf="$(echo $p | tr '^' ' ')" cs=$(xz"$pf" -c 2> /dev/null | wc -c 2> /dev/null) ; uc=$(du -b"$pf" 2> /dev/null | cut -f 1 2> /dev/null) ; tot=$(($tot+cs)) ; echo -e"$cs\t\t$uc\t\t$tot\t\t$pf"  ; done ; echo ; echo"start: $d" ; echo -n"complete:" ; date ; umount /root/squashdu/linuxfs/ ; umount /root/squashdu/iso/ ; rmdir /root/squashdu/linuxfs/ /root/squashdu/iso/ ; rmdir /root/squashdu/ #### license: creative commons cc0 1.0 (public domain) http://creativecommons.org/publicdomain/zero/1.0/ 

this entire post is in the public domain, if you want to flatter me by posting it online somewhere.

oh, and its entirely possible someone else has done this. i havent checked, but i tried a lot of obvious alternatives. if you know another, post it here!
Posts: 1,308
BitJam
Joined: 31 Aug 2009
#2
I'm very impressed with your work. Your results look useful.

As you've noted elsewhere, compressing something that has already been compressed is usually a net loss. I saved space in the live initrd.gz by using uncompressed fonts instead of the standard compressed ones. One way of seeing this is algorithmic information theory tells us that once something is perfectly compressed then (of course) it cannot be further compressed. On top of that, most forms of compression involve the construction of a"code table" which is overhead space-wise. So when you recompress something, you don't make significant savings on the data but you still add a new code table which is why you often lose.

The reason using tar resulted in much better compression than mksquashfs is because tar can use just one code table. if the mksquashfs did this then the resulting file system would be much slower. Tar saves space by doing things wholesale using one code table while squashfs is better at doling out files one by one on a retail basis by using many code tables spread out through the file.

I'm surprised that your estimate by compressing each file individually is as close as it is to the size of the entire squashfs file. The reason squashfs wins (size-wise) compared to individual compression again goes back to the code tables. It needs to put code tables throughout the squashfs file (for retail speed) but it is not forced to make a code table for every file.

BTW: the construction of code tables also affects the speed of compression. Making a code table is usually very slow compared to compressing or decompressing data once a code table has been generated (which can be very fast). This is why compressing each file individually takes so much longer. I think the settings for better compression translate to working harder at making a better code table.

Something else to look at, perhaps, is compressing each"leaf" directory (a directory that does not contain any subdirectories). I wonder if the sum of theses results would end up being closer to the size of the squashfs file. Of course, this won't be very informative for /usr/bin or /usr/lib but it may give better information for files and directories under /usr/share which often have a lot of smaller, highly compressible files.

PS: I recently built a table with the installed size of every package in Debian/antiX. This took over 6 hours to run!
Posts: 148
figosdev
Joined: 29 Jun 2017
#3
I'm very impressed with your work. Your results look useful.
im flattered and i hope youre right.

As you've noted elsewhere, compressing something that has already been compressed is usually a net loss. I saved space in the live initrd.gz by using uncompressed fonts instead of the standard compressed ones.
thats awesome and makes perfect sense.

One way of seeing this is algorithmic information theory tells us that once something is perfectly compressed then (of course) it cannot be further compressed. On top of that, most forms of compression involve the construction of a"code table" which is overhead space-wise. So when you recompress something, you don't make significant savings on the data but you still add a new code table which is why you often lose.
0 byte file uncompressed: 0 bytes. compressed? 32 bytes!

The reason using tar resulted in much better compression than mksquashfs is because tar can use just one code table.
i figured. i learned about this stuff while it was running, and i was reading the man page/options for xz. it explains more than any man page ive looked at before, it does it clearly without going too far i think, and id like to shake the hand of the person that wrote it.

btw even though i say"i figured" i wasnt sure, and its nice to have confirmation. also 6 hours! woof.