Disk forensics is the science of extracting forensic information from hard disk images. There are a number of standard techniques which PyFlag supports. We use the file pyflag_stdimage_0.1.gz for this tutorial. This file is included in the standard tutorial samples.
One of the most powerful techniques in forensic disk analysis is Hash comparisons. Typically we load a hash database into pyflag to quickly classify files. The largest public hash database is the NSRL database maintained by NIST (National Institute for Standards and Technology). NIST makes periodic updates to the database and distributes the data in ISO format.
PyFlag has a utility script which will load the NSRL database into the PyFlag MySql database:
mic@dell:~/pyflag$ ./utilities/nsrl_load.sh Usage: nsrl_load.py path_to_nsrl_directory An NSRL directory is one of the CDs, and usually has in it NSRLFile.txt,NSRLProd.txt.
To load hashes into the database, mount the ISO somewhere and point this script into the location. Note that currently the NSRL contains over 25 million entries, and takes several hours to load into PyFlag.
NoteThis feature is optional, and skipping this step may change some of the following examples. It is certainly possible to run PyFlag without loading the NSRL. You might want to skip this step if downloading the NSRL poses a problem due to its size.
The acquisition phase is usually where the image is first obtained during the execution of a warrant, or the incident response phase. The most common method for acquiring images is to boot the target machine into a Linux operating system, for example using Knoppix, or Helix. The Linux kernel will identify the device and make it available via a device node in the /dev/ filesystem. A full discussion of forensic acquisition using a linux system is outside the scope of this document.
The most common case is when the target disk is an IDE HDD. The user then needs to identify which raw device node the drive is attached. The following example shows a machine with a CD-RW drive on /dev/hda and an IDE disk on /dev/hdc, The IDE drive is identified by the kernel to have 6 partitions accessible via /dev/hdc1 to /dev/hdc6:
mic@dell:~/pyflag$ dmesg ... hda: QSI CD-RW/DVD-ROM SBW242U, ATAPI CD/DVD-ROM drive Using anticipatory io scheduler ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hdc: IC25N060ATMR04-0, ATA DISK drive ide1 at 0x170-0x177,0x376 on irq 15 hdc: max request size: 1024KiB hdc: 117210240 sectors (60011 MB) w/7884KiB Cache, CHS=16383/255/63, UDMA(100) /dev/ide/host0/bus1/target0/lun0: p1 p2 p3 p4 < p5 p6 > ...
The kernel will allow access to the raw HDD by accessing the device nodes. In this example we provide a test HDD image to work on, but in practice the same steps may be taken using /dev/hda in place of pyflag_stdimage_0.1.
Typically hard disk drives are extremely large, with sizes in the several hunderd Gigabytes common. Manipulating such large images is typically very inconvenient, particularly since most systems never use much of the available space, leaving most of the disk full of runs of zeros. Most forensic packages provide for an image file format which provides some form of compression for this reason 1.
PyFlag supports a convenient format called sgzip (Seekable gzip, based on the gzip compressor. This format is designed to be used in a variety of forensic applications. In this example, we will use sgzip to acquire the image:
mic@dell:~$ ~/pyflag/bin/sgzip < pyflag_stdimage_0.1 > /var/tmp/demo/image.sgz Wrote 300 blocks of 32768 bytes = 9 Mb total
There are many variation on the above command line, e.g.:
dd if=pyflag_stdimage_0.1 | sgzip > /var/tmp/demo/image.sgz
sgzip pyflag_stdimage_0.1 && mv pyflag_stdimage_0.1.sgz /var/tmp/demo/
Note that it is still possible to use sgzip to image over the network. Imaging over the network is a useful technique when the target machine is located remotely or it is not possible to physically remove its disks. The standard way for remote imaging is:
ssh root@target dd if=/dev/hdc > image.dd
Where target is the remote machine to be imaged, with the desired disk being /dev/hdc. The image will be encrypted over the ssh tunnel and be written to a file called image.dd.
The sgzip format is a stream format which can be sent over pipes. It is most useful in conjunction with ssh, to compress the image as it is being acquired remotely:
ssh root@target dd if=/dev/hdc | ~/pyflag/bin/sgzip > image.sgz
The above technique does not require anything special to be installed on the remote machine, other than a functioning Unix like kernel which provides access to the raw device 2. The sgzip compression operation is performed on the local machine (The one initiating the ssh operation).
Once the image had been acquired, it may be loaded into PyFlag. The loading of an image performs some initial anaysis on it.
In this fictitious example, we suspect the suspect (Tony Pistone) of killing Don Vitto - the famous godfather. Here is what we know:
First we create a new case to store the analysis data in.
We now want to let PyFlag know that the image we wish to use is an sgzip image, located in the upload directory. PyFlag can handle many different types of hard disk images natively, without needing to convert them to a single native format. PyFlag has a number of drivers for different formats, which present a unified data source to the application as a whole. The concept of a data source is central to PyFlag.
After selecting the type of the datasource, and the parameters required, PyFlag allows the user to name the source. Later, during the analysis, it is possible to always refer to the image by that name.
Note that we select the image driver as sgzip, and the image file is taken from the upload directory. The offset in this image is 0 since the image is that of a partition. If the image was of the entire hard disk, the offset would need to be calculated from the partition table.
Finally we name the source as test.
When PyFlag loads a new image it does the following:
In the above figure we can see PyFlag's Load FileSystem menu. We are able to choose the IO Source to use, the scanners that will be invoked, and finally the filesystem driver that may be used. PyFlag uses the magic signature of the filesystem to hint which filesystem driver is most approriate (in this case the ext2 driver is most appropriate).
NoteThe filesystem hint is useful for indicating whether PyFlag is given a valid IO source. For example, if we have entered the offset incorrectly when selecting an IO source, the magic will not match any known filesystems and PyFlag's hint would be data.
The following is a brief overview of some of the more important available Scanners. For a full discussion of each scanner, consult the PyFlag manual:
For our example, we shall choose to use the Linux ext2 driver, and for now, choose the default scanner to be run. Once we submit this form, the terminal will display some detailed progress information:
Current thread is Thread-1 Set file to read from as /var/tmp/demo/test_image.dd.sgz Will shell out to run /home/mic/pyflag/pyflag/..//bin//dbtool -t test -d create blah Will shell out to run /home/mic/pyflag/pyflag/..//bin//iowrapper -i sgzip -o filename=/var/tmp/demo/test_image.dd.sgz,offset=0 /home/mic/pyflag/pyflag/..//bin//dbtool -t test -f linux-ext2 test found thread Thread-1 found thread Thread-1 Set file to read from as /var/tmp/demo/test_image.dd.sgz Loading Directory Entries Loading Inode Entries Loaded 200 of 2560 Inodes ... Loaded 2200 of 2560 Inodes Loaded 2400 of 2560 Inodes f?case=pyflag Debug: Will invoke the following scanners: [<TypeScan.TypeScan instance at 0x405b19ec>, <Unallocated.DeletedScan instance at 0x405b15ec>, <HashComparison.MD5Scan instance at 0x405b132c>, <IEIndex.IEIndex instance at 0x405b152c>, <RegScan.RegistryScan instance at 0x405b136c>, <PstFile.PstScan instance at 0x405b1eec>, <LogicalIndex.Index instance at 0x405b1cac>] Will shell out to run /home/mic/pyflag/pyflag/..//bin//pasco -t test -g create Debug: Handling inode D12 = /NTUSER.DAT, mime type: application/x-winnt-registry, magic: Windows NT registry file Will shell out to run /home/mic/pyflag/pyflag/..//bin//regtool -f /var/tmp/results//case_demo/test_D12 -t reg_test -p '/NTUSER.DAT'
We can see PyFlag loading the filesystem Inodes first off, then reporting which scanners will be used. The Inodes which are handled are then shown with their types. We can see in the above example, that as we encountered a "Windows NT registry file", PyFlag is invoking the reg_tool to load the registry in that file.
Once the loading is finished, we are able to "Analyse this data".
First we would like to browse the file system to see what PyFlag has deduced about it. The figure below shows the virtual filesystem as it initially appears.
We can see a number of interesting files, the first file taking our interest is called "DonVittos_private_key.txt". We wish to see the content of this file, so we click on the inode link (D23). Sometimes it is better to right click the link and select "Open in new tab" to follow a line of investigation without disturbing the original line. This way the new tabs can be closed when we finished with them and not lose our place in the current tab. We note also that we can highlight entries in tables by clicking on them, which could be used to indicate which files were looked at.
The "View File" screen is divided into tabs, with the first tab showing statistics about the file. We wish to see the contents of the file, hence we click on the Hexdump tab. This is shown below.
The file appears to be a DSA private key. It is likely the Tony used this key to gain access to Don Vittos Linux laptop and was able to view his calendar. This explains how Tony knew about the family meeting. To confirm this is the Don's private key we can download the file and take it to the Dons laptop to check. By clicking on the download tab, our browser allows us to save the file onto our USB key.
We notice some more files, mainly image files which we can view. We find an interesting photo of Linus Torvalds trying on some virtual reality gear, as well as many disturbing photos of Linus getting dunked. This discovery is telling us that Tony is a Linux fan as well, and a collector of quality photography.
We want to closely examine all documents on the system. We notice that if we click on the folder icon we can navigate the tree, however, by clicking on one of the directory names, we are taken to a table view of the filesystem. Although the tree view is more intuitive for browsing the filesystem, the table view is far more powerful - since it allows us to search the filesystem:
The table view is divided into a number of interesting areas. At the top of the table we see the configure and save icons. These icons allow us to hide some columns in the table (in case its too wide to fit in the screen at once), and the save icon allows us to export the contents of the table in CSV format. This is most useful after applying some filters, so we are left with a small subset of the total data.
We see that it is possible to enforce filters on the data, to restrict the view to a small subset of the total data. In the above case we only see files which have "/Documents and Settings" in their path. Note that the % character is the wildcard.
We can click on column headers to sort by that column. Note the sorted by column is highlighted. We note that each row has either a white or gray background, with those rows with a common value for the sorted column in the same colour. This makes it easy for use to visully inspect similar rows.
Scrolling down to the bottom of the screen shows group by and search text entree boxes. The group by links allow us to count the number of unique entries of a given column, while the text searches allow us to add a new filter condition to table.
Since we are searching for images, we add a filter only displaying filenames ending with .doc by typing .doc in the text entry below the Filename column. We may also clear the original filter (requiring the files to have "documents and settings" in the path) by clicking on the filter condition.
We find a single hit. The file we find has a name of /Documents and Settings/Administrator/outlook.pst/Sent Items/Email:2097604/document.doc and an inode of D1285|P2097604:1.
What does this strange inode mean? PyFlag has a powerful recursive virtual filesystem (VFS) model. This means that inodes are strings which indicate how a file should be read. PyFlag has many drivers for different types of virtual files, and the output from each driver can be passed (piped) to the input of other drivers.
When PyFlag encounters an inode such as that shown above, the following steps are done to be able to read this file:
The overall result is that we are able to transparently view the word attachment which was sent in an email stored in the "Sent Items" folder in the "outlook.pst" file. Since this looks like a regular file to PyFlag, we are able to do whatever we want with it transparently - keyword index it, open it, virus scan it etc...
In this case we wish to view this word document. We click on download and view the document in OpenOffice.org. We found Tony's secret document he mailed his accomplice.
We have just seen that PyFlag can treat virtual files as real files, and we can operate on them as though they were really there. The question posed now is how do these files get there in the first place.
We PyFlag loads a case, it scans the filesystem with a variety of scanners. The scanners discover new virtual files as they analyse specific files, and add those to the VFS. After they are added, it is possible to search those virtual files transparently.
We suspect that Tony is hiding things inside compressed files. Normally scanning of compressed files is disabled, due to the processing overhead it may impose, but since this is a small filesystem, we would like to scan the entire filesystem again, this time for zip files and viruses.
We browse the filesystem again, and click on the root directory (/). We now click on the magnifying glass symbol at the top, and a popup window appears asking up which scanners we would like to run on this directory and subdirectories thereof?
In this case, we choose to scan for viruses, zip and gz files. The screen refreshes and we see an updated version of the filesystem.
The updated view is showing virtual directories for the zip files, and the files within those as virtual files. Note the special Inodes denoted to the virtual files are similar to D15|Z1 indicating that the Sleuthkit inode 15 is passed through the Zip VFS driver and we extract file 1 from it. It is now possible to view individual files in the archive as if they were real files.
We know that Tony has many secrets. We need to find all occurances of such secrets in the image. A common forensic technique is keyword searching to locate the keyword secret.
Since keyword searching is a separate report, we click on the home button to get back to the main menu, and select Search Indexed Keywords. We type the word "secret" and assuming we had that word in our dictionary (see preparation section above), we find these hits:
Note that hits were found in very unusual places. For example, we find a hit in Inode D1285|P2097316:0 which represents the body of an email. We also find it in D1285|P2097412:1|Z0 representing a zip file which was attached to an email in Tony's Inbox.
We can see that the VFS model gives us recursiveness and a great reach. The reach allows us to reach deep into successively packed files and decode deeply nested content. If we were to simply perform a grep on the image we would not have made any of these hits at all. This is because the PST file is encoded, while the rest of the data is compressed.
PyFlag uses ClamAV as a virus scanner. If we chose the virus scanner, we will now be able to see the hits made for viruses.
NoteThis depends on having an up to date virus pattern file. ClamAV can be updated by running the Freshclam script periodically. This script may be in a separate package to the main engine - check your distribution's packages.
The result of running the virus scan shows that one virus was detected being "Trojan.NTRootKit.044" a windows NT rootkit. Clearly Tony indulges in the practice of hacking in his spare time!!!
Internet Explorer maintains a list of history files and URLs users browsed to. This list help determine users browsing habits. When PyFlag comes across an IE History file, it loads it into a central table, which makes browsing of IE history easy.
In this case we shall look at the history files found on this image. We click on the home link, and then select IE history. We can find Eddie's (Tony's brother) history file.
In order to see what terms Eddie has searched for, we can restrict the URL to those with the word google in them. As can be seen in the figure below, we can see Eddie searched for real vnc, possibly as a means of hacking into Don Vittos computer.
A very powerful forensic technique is hash comparison. This technique involves comparing the MD5 hash of each file with a large database of hashed (in our case NSRL). This comparison allows for the positive identification of each file.
The NSRL is simply a means of identifying each file. There is no determination of whether the file is good or bad. That determination is purely subjective and depends on the case.
For example, in some cases having the pgp program installed on a machine may indicate evil doings. Or having MSVC++ installed on a SOE may indicate it was used to compile exploits etc.
The determination of which software is appropriate needs to be made with the case in mind. Do not assume that just because back orific is installed the suspect did anything wrong - they may have been using it legitimately.
The usual hash comparison exercise follows the following pattern:
The NSRL is a powerful tool allowing us to restrict the number of suspect files we need to analyse making our job more effective.
In this case we know that Tony is an avid photographer, so we shall look for all files which have the word "image" in their types:
We can click on the link to see those Exif files in the image (Exif images are typically those taken by digital cameras):
As we see there is a file with a file name /_deleted_/D12 we have not seen before. The _deleted_ virtual directory collects all the files which have allocated Inodes, but no directory entries. These are typically files which have been deleted, but their Inode structure is still in tact.
The deleted file shows Caesar Palace, Las Vegas. Could this be taken by Tony prior to the hit?
We view the hexdump of this image:
The photo shows the familiar signature of a digital camera, namely a Digital Camera FinePix 3800 Ver1.00 made by FUJIFILM. We know that Tony owns this exact same camera. The photo was taken on 2003:07:30 13:17:32, the day of Don Vittos death.
This deleted file places Tony at the scene of the crime. After confronting Tony with this evidence, he broke down in tears claiming that he is an apprentice hit man, sent by Carlo Rizzi: "Its curtains, curtains for me!!!" said Tony.
We have seen how to use PyFlag on a fictitious investigation. We have learned how to use the table widget which repeats in many areas in PyFlag. Proper use of the table widget allows us to perform powerful searches, leading to fast efficient investigations.
We have seen how the VFS integrates to extend the reach of the forensic investigator being able to locate files carried within other files recursively. We are able to perform complex pre-indexed keyword searching inside files which are normally encoded such that the keywords do not appear in their bit wise representation (e.g. compressed files).