Classic Shell Scripting - Arnold Robbins [163]
Example 10-2. Finding matching file contents
#! /bin/sh -
# Show filenames with almost-certainly identical
# contents, based on their MD5 checksums.
#
# Usage:
# show-identical-files files
IFS='
'
PATH=/usr/local/bin:/usr/bin:/bin
export PATH
md5sum "$@" /dev/null 2> /dev/null |
awk '{
count[$1]++
if (count[$1] = = 1) first[$1] = $0
if (count[$1] = = 2) print first[$1]
if (count[$1] > 1) print $0
}' |
sort |
awk '{
if (last != $1) print ""
last = $1
}'
Here is what its output looks like on a GNU/Linux system:
$ show-identical-files /bin/*
2df30875121b92767259e89282dd3002 /bin/ed
2df30875121b92767259e89282dd3002 /bin/red
43252d689938f4d6a513a2f571786aa1 /bin/awk
43252d689938f4d6a513a2f571786aa1 /bin/gawk
43252d689938f4d6a513a2f571786aa1 /bin/gawk-3.1.0
...
We can conclude, for example, that ed and red are identical programs on this system, although they may still vary their behavior according to the name that they are invoked with.
Files with identical contents are often links to each other, especially when found in system directories. show-identical-files provides more useful information when applied to user directories, where it is less likely that files are links and more likely that they're unintended copies.
Digital Signature Verification
The various checksum utilities provide a single number that is characteristic of the file, and is unlikely to be the same as the checksum of a file with different contents. Software announcements often include checksums of the distribution files so that you have an easy way to tell whether the copy that you just downloaded matches the original. However, checksums alone do not provide verification: if the checksum were recorded in another file that you downloaded with the software, an attacker could have maliciously changed the software and simply revised the checksum accordingly.
The solution to this problem comes from public-key cryptography, where data security is obtained from the existence of two related keys: a private key, known only to its owner, and a public key, potentially known to anyone. Either key may be used for encryption; the other is then used for decryption. The security of public-key cryptography lies in the belief that knowledge of the public key, and text that is decryptable with that key, provides no practical information that can be used to recover the private key. The great breakthrough of this invention was that it solved the biggest problem in historical cryptography: secure exchange of encryption keys among the parties needing to communicate.
Here is how the private and public keys are used. If Alice wants to sign an open letter, she uses her private key to encrypt it. Bob uses Alice's public key to decrypt the signed letter, and can then be confident that only Alice could have signed it, provided that she is trusted not to divulge her private key.
If Alice wants to send a letter to Bob that only he can read, she encrypts it with Bob's public key, and he then uses his private key to decrypt it. As long as Bob keeps his private key secret, Alice can be confident that only Bob can read her letter.
It isn't necessary to encrypt the entire message: instead, if just a file checksum is encrypted, then one has a digital signature. This is useful if the message itself can be public, but a way is needed to verify its authenticity.
Several tools for public-key cryptography are implemented in the GNU Privacy Guard[13] (GnuPG) and Pretty Good Privacy[14] (PGP) utilities. A complete description of these packages requires an entire book; see the Chapter 16. However, it is straightforward to use them for one important task: verification of digital signatures. We illustrate only GnuPG here, since it is under active development and it builds more easily and on more platforms than PGP.
Because computers are increasingly under attack, many software archives now include digital signatures that incorporate