PyFlag Logo

Disk Forensics

Disk forensics is the science of extracting forensic information from hard disk images. There are a number of standard techniques which PyFlag supports. We use the file pyflag_stdimage_0.1.gz for this tutorial. This file is included in the standard tutorial samples.


One of the most powerful techniques in forensic disk analysis is Hash comparisons. Typically we load a hash database into pyflag to quickly classify files. The largest public hash database is the NSRL database maintained by NIST (National Institute for Standards and Technology). NIST makes periodic updates to the database and distributes the data in ISO format.

PyFlag has a utility script which will load the NSRL database into the PyFlag MySql database:

mic@dell:~/pyflag$ ./utilities/
Usage: path_to_nsrl_directory

An NSRL directory is one of the CDs, and usually has in it NSRLFile.txt,NSRLProd.txt.

To load hashes into the database, mount the ISO somewhere and point this script into the location. Note that currently the NSRL contains over 25 million entries, and takes several hours to load into PyFlag.


This feature is optional, and skipping this step may change some of the following examples. It is certainly possible to run PyFlag without loading the NSRL. You might want to skip this step if downloading the NSRL poses a problem due to its size.


The acquisition phase is usually where the image is first obtained during the execution of a warrant, or the incident response phase. The most common method for acquiring images is to boot the target machine into a Linux operating system, for example using Knoppix, or Helix. The Linux kernel will identify the device and make it available via a device node in the /dev/ filesystem. A full discussion of forensic acquisition using a linux system is outside the scope of this document.

The most common case is when the target disk is an IDE HDD. The user then needs to identify which raw device node the drive is attached. The following example shows a machine with a CD-RW drive on /dev/hda and an IDE disk on /dev/hdc, The IDE drive is identified by the kernel to have 6 partitions accessible via /dev/hdc1 to /dev/hdc6:

mic@dell:~/pyflag$ dmesg
Using anticipatory io scheduler
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hdc: IC25N060ATMR04-0, ATA DISK drive
ide1 at 0x170-0x177,0x376 on irq 15
hdc: max request size: 1024KiB
hdc: 117210240 sectors (60011 MB) w/7884KiB Cache, CHS=16383/255/63, UDMA(100)
  /dev/ide/host0/bus1/target0/lun0: p1 p2 p3 p4 < p5 p6 >

The kernel will allow access to the raw HDD by accessing the device nodes. In this example we provide a test HDD image to work on, but in practice the same steps may be taken using /dev/hda in place of pyflag_stdimage_0.1.

Typically hard disk drives are extremely large, with sizes in the several hunderd Gigabytes common. Manipulating such large images is typically very inconvenient, particularly since most systems never use much of the available space, leaving most of the disk full of runs of zeros. Most forensic packages provide for an image file format which provides some form of compression for this reason 1.

[1]Standard stream compressors like zip or gzip are inappropriate for this task, since they make it too slow to seek in the stream, having to decompress the entire stream from the start each time. Typically, forensic compression formats rely on compressing small blocks, so random seek/read operations can be completed in a short time.

PyFlag supports a convenient format called sgzip (Seekable gzip, based on the gzip compressor. This format is designed to be used in a variety of forensic applications. In this example, we will use sgzip to acquire the image:

mic@dell:~$ ~/pyflag/bin/sgzip < pyflag_stdimage_0.1 > /var/tmp/demo/image.sgz
Wrote 300 blocks of 32768 bytes = 9 Mb total

There are many variation on the above command line, e.g.:

dd if=pyflag_stdimage_0.1 | sgzip > /var/tmp/demo/image.sgz


sgzip pyflag_stdimage_0.1 && mv pyflag_stdimage_0.1.sgz /var/tmp/demo/

Note that it is still possible to use sgzip to image over the network. Imaging over the network is a useful technique when the target machine is located remotely or it is not possible to physically remove its disks. The standard way for remote imaging is:

ssh root@target dd if=/dev/hdc > image.dd

Where target is the remote machine to be imaged, with the desired disk being /dev/hdc. The image will be encrypted over the ssh tunnel and be written to a file called image.dd.

The sgzip format is a stream format which can be sent over pipes. It is most useful in conjunction with ssh, to compress the image as it is being acquired remotely:

ssh root@target dd if=/dev/hdc | ~/pyflag/bin/sgzip > image.sgz

The above technique does not require anything special to be installed on the remote machine, other than a functioning Unix like kernel which provides access to the raw device 2. The sgzip compression operation is performed on the local machine (The one initiating the ssh operation).

[2]The remote machine does not need to run Linux in that case, for example it can be a HPUX box, or Solaris box. Although in that case the raw device names may be named with a different convention.

Loading the image into PyFlag

Once the image had been acquired, it may be loaded into PyFlag. The loading of an image performs some initial anaysis on it.


In this fictitious example, we suspect the suspect (Tony Pistone) of killing Don Vitto - the famous godfather. Here is what we know:

  • Don Vitto was killed outside the Caesars palace in Las Vegas, on July 30, 2003.
  • The Tony claims he was never in Las Vegas, let alone near the palace in their entire life.
  • An important family meeting, in the palace was taking place at the time, we don't know how the suspect found out about it.
  • We acquired Tonys laptop and store the image in an sgzip format.

First we create a new case to store the analysis data in.

We now want to let PyFlag know that the image we wish to use is an sgzip image, located in the upload directory. PyFlag can handle many different types of hard disk images natively, without needing to convert them to a single native format. PyFlag has a number of drivers for different formats, which present a unified data source to the application as a whole. The concept of a data source is central to PyFlag.

After selecting the type of the datasource, and the parameters required, PyFlag allows the user to name the source. Later, during the analysis, it is possible to always refer to the image by that name.


PyFlag Load IO Source dialog.

Note that we select the image driver as sgzip, and the image file is taken from the upload directory. The offset in this image is 0 since the image is that of a partition. If the image was of the entire hard disk, the offset would need to be calculated from the partition table.

Finally we name the source as test.

When PyFlag loads a new image it does the following:

  1. The image filesystem is loaded into the database. This loads information about all the inodes and files in the image.
  2. Each file in the filesystem is scanned using all the scanners that were chosen by the user.
  3. If scanners discover new files or add virtual files to the Virtual File System (VFS), these virtual files are also scanned by the same scanners that were chosen previously.


PyFlag's Load filesystem menu

In the above figure we can see PyFlag's Load FileSystem menu. We are able to choose the IO Source to use, the scanners that will be invoked, and finally the filesystem driver that may be used. PyFlag uses the magic signature of the filesystem to hint which filesystem driver is most approriate (in this case the ext2 driver is most appropriate).


The filesystem hint is useful for indicating whether PyFlag is given a valid IO source. For example, if we have entered the offset incorrectly when selecting an IO source, the magic will not match any known filesystems and PyFlag's hint would be data.

The following is a brief overview of some of the more important available Scanners. For a full discussion of each scanner, consult the PyFlag manual:

scan file and record file type (magic)
This Scanner records a magic file type (determined by file header) for each file in the filesystem.
Create VFS nodes for deleted files.
The Scanner searches for deleted files and adds them to a virtual directory called _deleted_. Deleted files are files which have an in-tact inode, and therefore we know their block allocation, but are not mentioned in any directory Inodes. Hence we do not know their original name (Since names are stored in directory inodes).
Scan file for viruses
All files are scanned for viruses using ClamAV. Note that you must have clamav installed with a recent pattern file for this to work.
scan file and record file Hash (MD5Sum)
This scanner calculated the MD5 sum of each file, and compared it to the NSRL database.
Load in IE History files
This analyses IE history files encountered on the filesystem.
Load in Windows Registry files
This analyses Windows registry files encountered on the filesystem. Note that registry files may appear in many places, e.g. registry backups, user's Local Machine hives etc.
Recurse into Pst Files
This analyses Outlook PST files encountered on the filesystem. As these files are analysed, virtual files and directories are created for emails and attachments, which are scanned in turn by the other scanners.
Recurse into gziped files
This decompresses gzip files, and creates VFS entries for their data.
Recurse into Zip Files
This decompresses Zip files, and creates VFS entries for all files in the archive. (Note that this may cause the loading to be very slow if there are lots of zip files on the hard disk.
Scan unallocated space for files.
Unallocated space is space between allocated files. This scanner creates VFS nodes for contiguous runs of unallocated space. These spaces are then carved for other files within them, by searching for header/footer combinations. For example images are carved from unallocated space.
Keyword Index files
This scanner indexes each file it find for keywords loaded in the dictionary. Note that you must have a dictionary preloaded in order for this to yield useful results.

For our example, we shall choose to use the Linux ext2 driver, and for now, choose the default scanner to be run. Once we submit this form, the terminal will display some detailed progress information:

Current thread is Thread-1
Set file to read from as /var/tmp/demo/test_image.dd.sgz
Will shell out to run /home/mic/pyflag/pyflag/..//bin//dbtool
 -t test -d create blah
Will shell out to run /home/mic/pyflag/pyflag/..//bin//iowrapper
 -i sgzip -o filename=/var/tmp/demo/test_image.dd.sgz,offset=0
 /home/mic/pyflag/pyflag/..//bin//dbtool -t test -f linux-ext2
found thread Thread-1
found thread Thread-1
Set file to read from as /var/tmp/demo/test_image.dd.sgz
Loading Directory Entries
Loading Inode Entries
Loaded 200 of 2560 Inodes
Loaded 2200 of 2560 Inodes
Loaded 2400 of 2560 Inodes
Debug: Will invoke the following scanners: [<TypeScan.TypeScan
 instance at 0x405b19ec>, <Unallocated.DeletedScan instance at
 0x405b15ec>, <HashComparison.MD5Scan instance at 0x405b132c>,
 <IEIndex.IEIndex instance at 0x405b152c>, <RegScan.RegistryScan
 instance at 0x405b136c>, <PstFile.PstScan instance at 0x405b1eec>,
 <LogicalIndex.Index instance at 0x405b1cac>]
Will shell out to run /home/mic/pyflag/pyflag/..//bin//pasco -t test -g create
Debug: Handling inode D12 = /NTUSER.DAT, mime type: application/x-winnt-registry,
 magic: Windows NT registry file
Will shell out to run /home/mic/pyflag/pyflag/..//bin//regtool
 -f /var/tmp/results//case_demo/test_D12 -t reg_test -p 

We can see PyFlag loading the filesystem Inodes first off, then reporting which scanners will be used. The Inodes which are handled are then shown with their types. We can see in the above example, that as we encountered a "Windows NT registry file", PyFlag is invoking the reg_tool to load the registry in that file.

Once the loading is finished, we are able to "Analyse this data".

Analysing Filesystem Data

First we would like to browse the file system to see what PyFlag has deduced about it. The figure below shows the virtual filesystem as it initially appears.


Initial virtual filesystem view

We can see a number of interesting files, the first file taking our interest is called "DonVittos_private_key.txt". We wish to see the content of this file, so we click on the inode link (D23). Sometimes it is better to right click the link and select "Open in new tab" to follow a line of investigation without disturbing the original line. This way the new tabs can be closed when we finished with them and not lose our place in the current tab. We note also that we can highlight entries in tables by clicking on them, which could be used to indicate which files were looked at.

The "View File" screen is divided into tabs, with the first tab showing statistics about the file. We wish to see the contents of the file, hence we click on the Hexdump tab. This is shown below.


Hexdump display of a files contents.

The file appears to be a DSA private key. It is likely the Tony used this key to gain access to Don Vittos Linux laptop and was able to view his calendar. This explains how Tony knew about the family meeting. To confirm this is the Don's private key we can download the file and take it to the Dons laptop to check. By clicking on the download tab, our browser allows us to save the file onto our USB key.

We notice some more files, mainly image files which we can view. We find an interesting photo of Linus Torvalds trying on some virtual reality gear, as well as many disturbing photos of Linus getting dunked. This discovery is telling us that Tony is a Linux fan as well, and a collector of quality photography.

We want to closely examine all documents on the system. We notice that if we click on the folder icon we can navigate the tree, however, by clicking on one of the directory names, we are taken to a table view of the filesystem. Although the tree view is more intuitive for browsing the filesystem, the table view is far more powerful - since it allows us to search the filesystem:


The table view is divided into a number of interesting areas. At the top of the table we see the configure and save icons. These icons allow us to hide some columns in the table (in case its too wide to fit in the screen at once), and the save icon allows us to export the contents of the table in CSV format. This is most useful after applying some filters, so we are left with a small subset of the total data.

We see that it is possible to enforce filters on the data, to restrict the view to a small subset of the total data. In the above case we only see files which have "/Documents and Settings" in their path. Note that the % character is the wildcard.

We can click on column headers to sort by that column. Note the sorted by column is highlighted. We note that each row has either a white or gray background, with those rows with a common value for the sorted column in the same colour. This makes it easy for use to visully inspect similar rows.

Scrolling down to the bottom of the screen shows group by and search text entree boxes. The group by links allow us to count the number of unique entries of a given column, while the text searches allow us to add a new filter condition to table.

Since we are searching for images, we add a filter only displaying filenames ending with .doc by typing .doc in the text entry below the Filename column. We may also clear the original filter (requiring the files to have "documents and settings" in the path) by clicking on the filter condition.


Searching the filesystem for all files ending with .doc

We find a single hit. The file we find has a name of /Documents and Settings/Administrator/outlook.pst/Sent Items/Email:2097604/document.doc and an inode of D1285|P2097604:1.

What does this strange inode mean? PyFlag has a powerful recursive virtual filesystem (VFS) model. This means that inodes are strings which indicate how a file should be read. PyFlag has many drivers for different types of virtual files, and the output from each driver can be passed (piped) to the input of other drivers.

When PyFlag encounters an inode such as that shown above, the following steps are done to be able to read this file:

  1. Take the driver who is registered as D (the Sleuthkit driver), and pass it the special inode number 1285 (The Sleuthkit understands this as an inode number).
  2. Open this file.
  3. Find the driver who is registered as P (The PST file driver), and provide it the previously opened file as input.
  4. Ask the pst driver to open inode 2097604:1 - a special number the driver uses to refer to a particular object in the pst file.
  5. Take the output from this driver and display it in the GUI.

The overall result is that we are able to transparently view the word attachment which was sent in an email stored in the "Sent Items" folder in the "outlook.pst" file. Since this looks like a regular file to PyFlag, we are able to do whatever we want with it transparently - keyword index it, open it, virus scan it etc...

In this case we wish to view this word document. We click on download and view the document in We found Tony's secret document he mailed his accomplice.

The Virtual File System and Scanners

We have just seen that PyFlag can treat virtual files as real files, and we can operate on them as though they were really there. The question posed now is how do these files get there in the first place.

We PyFlag loads a case, it scans the filesystem with a variety of scanners. The scanners discover new virtual files as they analyse specific files, and add those to the VFS. After they are added, it is possible to search those virtual files transparently.

We suspect that Tony is hiding things inside compressed files. Normally scanning of compressed files is disabled, due to the processing overhead it may impose, but since this is a small filesystem, we would like to scan the entire filesystem again, this time for zip files and viruses.

We browse the filesystem again, and click on the root directory (/). We now click on the magnifying glass symbol at the top, and a popup window appears asking up which scanners we would like to run on this directory and subdirectories thereof?

In this case, we choose to scan for viruses, zip and gz files. The screen refreshes and we see an updated version of the filesystem.


Updated VFS showing newly discovered zip archives as virtual directories.

The updated view is showing virtual directories for the zip files, and the files within those as virtual files. Note the special Inodes denoted to the virtual files are similar to D15|Z1 indicating that the Sleuthkit inode 15 is passed through the Zip VFS driver and we extract file 1 from it. It is now possible to view individual files in the archive as if they were real files.

Searching for keywords

We know that Tony has many secrets. We need to find all occurances of such secrets in the image. A common forensic technique is keyword searching to locate the keyword secret.

Since keyword searching is a separate report, we click on the home button to get back to the main menu, and select Search Indexed Keywords. We type the word "secret" and assuming we had that word in our dictionary (see preparation section above), we find these hits:


A keyword search for the word secret

Note that hits were found in very unusual places. For example, we find a hit in Inode D1285|P2097316:0 which represents the body of an email. We also find it in D1285|P2097412:1|Z0 representing a zip file which was attached to an email in Tony's Inbox.

We can see that the VFS model gives us recursiveness and a great reach. The reach allows us to reach deep into successively packed files and decode deeply nested content. If we were to simply perform a grep on the image we would not have made any of these hits at all. This is because the PST file is encoded, while the rest of the data is compressed.

Virus Scanning

PyFlag uses ClamAV as a virus scanner. If we chose the virus scanner, we will now be able to see the hits made for viruses.


This depends on having an up to date virus pattern file. ClamAV can be updated by running the Freshclam script periodically. This script may be in a separate package to the main engine - check your distribution's packages.

The result of running the virus scan shows that one virus was detected being "Trojan.NTRootKit.044" a windows NT rootkit. Clearly Tony indulges in the practice of hacking in his spare time!!!

IE History

Internet Explorer maintains a list of history files and URLs users browsed to. This list help determine users browsing habits. When PyFlag comes across an IE History file, it loads it into a central table, which makes browsing of IE history easy.

In this case we shall look at the history files found on this image. We click on the home link, and then select IE history. We can find Eddie's (Tony's brother) history file.

In order to see what terms Eddie has searched for, we can restrict the URL to those with the word google in them. As can be seen in the figure below, we can see Eddie searched for real vnc, possibly as a means of hacking into Don Vittos computer.


NSRL and Hash comparisons

A very powerful forensic technique is hash comparison. This technique involves comparing the MD5 hash of each file with a large database of hashed (in our case NSRL). This comparison allows for the positive identification of each file.


The NSRL is simply a means of identifying each file. There is no determination of whether the file is good or bad. That determination is purely subjective and depends on the case.

For example, in some cases having the pgp program installed on a machine may indicate evil doings. Or having MSVC++ installed on a SOE may indicate it was used to compile exploits etc.

The determination of which software is appropriate needs to be made with the case in mind. Do not assume that just because back orific is installed the suspect did anything wrong - they may have been using it legitimately.

The usual hash comparison exercise follows the following pattern:

  1. Click the group by NSRL Product to see how many files are installed from each potential product.
  2. Note that many products ship identical files, so click on each product to see which files of that product are actually present on the image.
  3. Now we should have a good idea which software packages are installed, even if they were later uninstalled, but left remenants behind (quite common with windows software).
  4. We are usually left with an Unknown category of those files which did not match any hash. These files need to be examined closer, so we click on the Unknown category in the group by screen.
  5. Now we group by file type. This counts how many different files in each file type.
  6. We can now search for those file types of interest, for example we can see all file types with the word "executable" in their types, which were not identified by the NSRL.

The NSRL is a powerful tool allowing us to restrict the number of suspect files we need to analyse making our job more effective.

In this case we know that Tony is an avid photographer, so we shall look for all files which have the word "image" in their types:


Searching the image for those files that were not identified by NSRL which are also images.

We can click on the link to see those Exif files in the image (Exif images are typically those taken by digital cameras):


As we see there is a file with a file name /_deleted_/D12 we have not seen before. The _deleted_ virtual directory collects all the files which have allocated Inodes, but no directory entries. These are typically files which have been deleted, but their Inode structure is still in tact.

The deleted file shows Caesar Palace, Las Vegas. Could this be taken by Tony prior to the hit?

We view the hexdump of this image:


Hexdump of photo of caesars palace found on Tony's drive.

The photo shows the familiar signature of a digital camera, namely a Digital Camera FinePix 3800 Ver1.00 made by FUJIFILM. We know that Tony owns this exact same camera. The photo was taken on 2003:07:30 13:17:32, the day of Don Vittos death.

This deleted file places Tony at the scene of the crime. After confronting Tony with this evidence, he broke down in tears claiming that he is an apprentice hit man, sent by Carlo Rizzi: "Its curtains, curtains for me!!!" said Tony.


We have seen how to use PyFlag on a fictitious investigation. We have learned how to use the table widget which repeats in many areas in PyFlag. Proper use of the table widget allows us to perform powerful searches, leading to fast efficient investigations.

We have seen how the VFS integrates to extend the reach of the forensic investigator being able to locate files carried within other files recursively. We are able to perform complex pre-indexed keyword searching inside files which are normally encoded such that the keywords do not appear in their bit wise representation (e.g. compressed files).