The first step is to generate some useful statistics regarding the size and distribution of messages on a typical mail system. To this end, I analyzed the logs from over 5 million messages generated or received at an ISP. The message sizes from those logs (as recorded in the "info" lines written by qmail) were grouped into bins based on the base-10 log of their size, rounded to the nearest tenth. The following graph shows the resulting distribution.

The spike that appears at 5.3-5.4 corresponds to messages carrying the SirCam virus. The spike that appears at 2.7 is caused by a program run by a client of the ISP that generates frequent short status messages. As such, both of these are atypical outliers and are ignored for statistical purposes.
The random function which most closely models the resulting distribution is as follows:
size = mean / -log(1 - random)where random is a uniform random function producing results greater than or equal to 0 and less than 1, and mean is the desired mean of the distribution. The following graph overlays the above raw data with simulated data using the above function with mean := 2000:

The raw numbers extracted from the qmail log files can be downloaded here. Only the "bytes" column from the "info msg" lines have been reproduced here, for brevity and confidentiality of our clients, and the results have been sorted to increase the compression.