I was recently asked the question, Can eDiscovery systems prove the integrity of the email that they discover? If so, how?
Email differs from loose files in that emails are contained within a database such as Microsoft Exchange or IBM Lotus Notes. Loose files are typically hashed using a mathematical algorithm that can show that the data on a system and the data captured are the same but each message in a mailbox does not directly correlate to a single file that could be hashed. In some cases, mailboxes are contained in a single file such as a PST when exported or when mailbox segmentation is implemented on an IBM Lotus Notes mailbox database but most often, one or more databases contains the messages for many different users. These databases are binary files that cannot be opened using a text editor. Furthermore, many systems deduplicate their databases so that multiple copies of emails are stored as one with pointers to others.
So, to answer the question, systems that harvest email must interface with the email system database. They preserve the integrity of email by querying the database for mailbox metadata such as the byte count and message count and then this data is compared to what is retrieved from the database to ensure that all relevant messages were retrieved and that byte counts match. Additionally, each message contains a header that identifies the source, destination, timestamp, sender, recipient, and the path the email followed from source to destination. This information is also helpful in validating the email and can be compared to messages other messages to confirm the content. For example, a recipient message could be compared to the version on the sender’s machine or a message sent to multiple recipients could be compared to messages retrieved from each recipient mailbox.