This post will be focusing on the usage of the Linux tool ‘dd’ in the forensic imaging process, along with two other tools that have been directly derived from dd and one which is similar in functionality. In addition, this post briefly covers the issue of data completeness when preparing to forensically acquire a device.
What is DD?
Data Definition (dd) is a command-line tool primarily used in Unix Operating Systems. It serves a very simple, yet useful purpose; to copy data from a specified source to a specified destination. Typically, this will be done bit-by-bit, regardless of any file systems or operating systems that may be present.
The dd program is typically installed by default in most GNU/Linux distributions under a package called ‘coreutils’. However, its derivatives, which will be shown later in this post may need to be installed manually. If necessary, I will include more details about the manual installation of these tools as they are mentioned.
Owing to the many implementations of the Linux operating system, it is not uncommon to find dd installed on non-standard devices, such as those running Android. In addition, the dd tools can be implemented over a network using utilities like cryptcat. However, this post will be focusing on traditional storage device imaging using the dd tool(s) in a local, non-live environment.
Linux Block Devices
Because everything in Linux is technically interpreted as a file, this means dd can interact with a plethora of data. One of the most important pieces of data being “special” files in Linux; such as block devices like ‘/dev/sda’.
These block devices are of particular interest to us, as they can represent physical drives attached to your host system; ranging from hard drives to optical drives and even NVME devices. Drives attached to a Linux host system will be assigned a special device file in the ‘/dev/’ directory by the kernel. The naming convention for common device files includes:
Where X is a letter of the alphabet starting from ‘a’, (or a number starting from 0, for floppy drives) denoting the order of the devices. For example; a primary SATA hard drive which boots into Linux will be ‘/dev/sda’, while a secondary SATA hard drive will be ‘/dev/sdb’.
In addition, these block devices will often have similar files denoting partitions for each drive, which is usually done by appending a number to the block device. For example, the first partition of your primary SATA hard drive will be ‘/dev/sda1’ (typically the ‘boot’ partition). However, for the purposes of this post, I am only going to be focusing on the raw device files for the drives themselves and not those of the partitions.
In a Forensic Environment
In the context of digital forensic investigation; dd and its derivatives can be used to read data from the device file of an attached drive and write this data to a raw image file. Bear in mind that the data you acquire from a device such as a hard drive, may not necessarily be complete (see Data completeness section below). The resulting raw image file can then be easily imported into an appropriate analysis suite, or interrogated with other command-line tools.
I personally prefer to use Linux for performing digital forensics whenever I can and I find dd, along with its variants, to be invaluable tools. I would highly recommend using low-level command-line tools like dd to better understand the forensic imaging process before utilizing the well-known commercial tools.
Understanding the issue of data completeness is fundamental in the forensic acquisition phase, especially when dealing with Hard Drive Disks (HDD). There are caveats to consider when using imaging tools such as dd, one of them being that they do not have access to ALL of the data stored on a device. In most cases, this relates to ‘hidden areas’ commonly found on hard drives, which are typically inaccessible to the Operating System or the BIOS.
The two most common ‘hidden areas’ of a hard drive are known as the Host Protected Area (HPA) and the Device Configuration Overlay (DCO). The HPA was implemented to allow manufacturers to store diagnostic, monitoring and recovery tools on-disk, whilst DCO was introduced in ATA-6 and is used by manufacturers to change features between drive models and/or alter the observable capacity of the disk. These two hidden areas are simply sectors on the drive which have been specified as ‘protected’ by the drives configuration.
The third and perhaps most important ‘area’ to be aware of is referred to as the Service Area, or System Area. This area can occupy a significant part of the drives total capacity and is used to store information such as:
- SMART data
- Defective sector lists (P/G lists)
- Firmware code
- ATA passwords
- Servo information
Accessing the data contained in the Service Area is normally only possible via vendor-owned proprietary commands. However, Todd Shipley demonstrated via a proof-of-concept that it is possible to write data to this area, which has interesting implications for anti-forensics.
It is vital to not only be aware of, but try and account for these hidden areas when conducting an investigation. Some commercial forensic software will take measures to deal with these areas, however, more so the HPA/DCO, than the Service Area. It is not uncommon for such software, or even certain write blockers, to check for the existence of HPA/DCO areas before an image is acquired.
There are techniques a Linux-based user (like me) can utilize to ‘remove’ the HPA/DCO areas from a hard drive, and even deal with the Service Area to an extent. However, I will cover this in more detail in a future post.
Finally, while I am on the topic of data completeness, the USB flash drive I will be using to show off the functionality of the dd tools (see Testing Preparation section below) has similar issues. All USB devices contain a hierarchy of data known as ‘descriptors’ which are used to provide information to the host system, mainly to determine appropriate driver(s). This information includes, but is not limited to:
- Vendor and Manufacturer data
- USB device type
- Supported USB versions
- Configuration details
- Serial number of the device
- Number of endpoints
The primary descriptor found on USB drives is known as the ‘device descriptor’, which encompasses the entire device and is at the top of the hierarchy. Although the device descriptor contains forensically relevant information about the USB drive, this data is stored in the Read-Only Memory (ROM) chip and will NOT be imaged when using tools like dd. Luckily, on Linux at least, there are techniques and tools which allow us to read this data, which again, I will cover in a future post alongside the HDD hidden areas.
I am going to be using the tools on a Linux system to take an image of an unmounted 1GB USB flash drive, without any additional hardware. In a forensic environment, the drive being imaged would ideally be connected to an appropriate hardware write blocker to preserve data integrity, along with any other procedures being taken to account for data completeness.
You may wonder why the USB device I will be connecting to the Linux host system is assigned a block device with the naming convention ‘/dev/sdX’, considering the device is not connected through a SATA/SCSI interface. This is because, as of kernel version 3.15, Linux utilises a protocol called USB Attached SCSI (UAS) to facilitate the reading/writing of data to USB mass storage devices. With UAS; the SCSI command set is used for communicating with the USB device and is why, in this case, the block device uses the SCSI naming convention. You can see this process in the ‘dmesg’ output when the USB device is connected to the host system in Figure 1 below:
It is very important to note before I continue that dd is very unforgiving, especially if you enter the incorrect source and destination values. Therefore, if you are unfamiliar with dd I would highly recommend you run it in a controlled environment first, lest you risk corrupting or destroying your data. Always ensure you know how the tools and commands work before you run them in a live environment!
As mentioned previously, the standard dd tool is installed by default on most GNU/Linux distributions under the ‘coreutils’ package.
Using dd on the Linux command-line is very simple and given the block device we want to image is ‘/dev/sdb’, a typical dd command might look like this:
if (input file): This is our source file, which in this case is the block device associated with the USB device (See Figure 1).
of (output file): The destination file, which will be a raw image file consisting of the accessible data on the USB device.
bs (block size): This specifies the size of the data blocks to be copied from the input file in bytes. If this option is not specified, it will default to a block size of 512, analogous to the traditional sector size on a hard drive. In this case, I used a block size of 4k (4096), which was optimal for my setup. Larger block sizes are used for efficiency purposes but I would recommend using smaller block sizes where possible, because if you encounter read errors, you risk zero-filling readable data on a larger block size.
conv (convert): This option is vital if you run the dd command against a disk you suspect of having ‘bad’ or ‘defective’ blocks/sectors. Normally, the dd tool will abruptly terminate the command if a read error is encountered from the source drive, which the ‘noerror’ parameter prevents. However, you will also need to use the ‘sync’ option in conjunction with ‘noerror’, which will pad any unreadable ‘bad’ blocks with zeros in the output file. Bear in mind that should this occur, the resulting image will not match the original drive when hashes are calculated for each. To counter this, you can calculate hashes in specified intervals using the ‘dcfldd’ tool (see DCFLDD section below).
I personally do not use traditional dd for forensic imaging, however, it is very useful when extracting key excerpts of data from a drive. For example, the following dd command will extract the first 512 bytes of the accessible data, known as the Master Boot Record (MBR):
This specifies how many blocks, whose size we define with ‘bs’, are to be extracted. In this case, I only required a single block, starting at the beginning of the accessible data. This particular block of data is also referred to as the MBR ‘boot sector’ (0x55aa signature), which contains partition and Operating System information, as well as boot code used by the BIOS.
A few other common parameters used with dd, along with their function, are described as follows:
Where X is a number. This option will exclude X amount of blocks, of block size ‘bs=X’ at the start of the input file. For example, if an input file of 100 blocks is imaged with ‘skip=1’, the resulting output file will be 99 blocks in size, having excluded the first block.
This option should generally be used to save space on the file system, as any zeroed blocks in the output file wont be written to disk. See Arch Wiki for more details on sparse files.
This option will cause dd to show periodic transfer statistics such as; the amount of bytes copied, the elapsed time and the data transfer rate. Typically used for convenience purposes but can help determine optimal block sizes via transfer rates.
The first tool I will cover that has been forked from the ‘dd’ project is called ‘dcfldd’, which was developed by the Department of Defense Computer Forensic Lab and is considered to be an enhanced version of traditional dd. It boasts notable improvements over the original such as:
Multiple output file support
Hashing during data transfer
Split output file support
Log file support
In-built status progress
Bear in mind that dcfldd does not support any output format other than raw, meaning this tool cannot be used to output to forensic formats such as AFF, EWF, E01, etc. In addition, this tool should not be used when dealing with disks you suspect of containing defective sectors, due to a known issue in the tool itself.
Most of the common Linux distributions contain dcfldd in their core repositories and can be very easily installed from the command line. For a list of commands to help install dcfldd, please check for the appropriate distribution here. Note that Arch Linux and CentOS distributions will require additional repositories to be setup before dcfldd can be installed.
When using dcfldd on the Linux command-line; given the block device we want to image is ‘/dev/sdb’, a typical command would look like this:
The bs and conv parameters have not changed from their usage in dd, please refer to the previous demonstration of dd for more details on these options.
of (output file): In Figure 4, I have specified a second output file option, with a different file name, meaning I end up with two identical images of the source file (/dev/sdb). This may not be useful when dealing with very large datasets, however it does allow an examiner to save an image to different locations if necessary.
This option selects the algorithm SHA-256 to be used when calculating a cryptographic hash of the input and output files. The hashing algorithms MD5, SHA-1, SHA-256, SHA-384 and SHA-512 are currently supported within dcfldd. I would not recommend using MD5 or SHA-1 as they have been broken.
As mentioned previously, this option will calculate a hash of the data in specified intervals, in this case, every 100MB of data. This can be seen in the output file specified with the option below.
A good example of the logging functionality of dcfldd, this option designates a separate file which will store the calculated hashes we specified previously. As shown in Figure 4, this file contains a hash value for each 100MB of data, including the value for the whole data at the end. To check that this last hash value matched the source device, I ran sha256sum against the block device.
The dcfldd tool contains many other options and I would recommend reading through the man page if you want to take full advantage of its functionality. Like traditional dd; dcfldd also contains the options ‘count’, ‘skip’ and ‘status’, except the status command operates with a simple on/off parameter instead (e.g. status=on).
Like dd, I do not personally use dcfldd for forensic acquisition, primarily due to the reported issues it has with defective sectors. However, the ability to calculate a hash value at specified intervals can prove very useful in some circumstances.
The second derivative of dd that I am covering is ‘dc3dd’, which was developed by the Department of Defense Cyber Crime Center. It is very syntactically and functionally similar to the previous tool dcfldd. However, there are some slight differences between the two, the most notable being that the ‘conv=noerror,sync’ option and the progress bar are built into dc3dd by default. Additionally, dc3dd allows for automatic hash verification, which is a very useful feature not found in the other dd tools.
Again, most of the common Linux distributions contain dc3dd in their core repositories and can be very easily installed from the command line. For a list of commands to help install dc3dd, please check the appropriate distribution here. Note that CentOS will require either the EPEL, Repoforge or CERT Forensic repositories to be setup beforehand.
Using dc3dd on the Linux command-line has plenty of options for forensic examiners. Given the block device we want to image is ‘/dev/sdb’, a typical dc3dd command would look like this:
hof (hash output file): This option will calculate a hash of the specified output file, as well as compare this value to the one calculated for the input file. Should the hashes match, the command will output ‘[ok]’ next to the hash values in STDOUT.
This option will write the contents of STDOUT to a specified file. This is useful because if you plan on using dc3dd multiple times, you can write to the same log file each time, as it will not be overwritten.
hlog (hash log): Specifies a file where the hash value comparison is written to. If the hash verification is written to STDOUT, it will appear here and in the file specified by ‘log=’.
As seen before in dcfldd, I manually specified which hash algorithms I wanted the tool to use with the ‘hash=’ option. I used two (SHA-256 and SHA-512), with the same rational that MD5 and SHA-1 are broken. It is worth noting that dc3dd has different names for some options seen in the previous tools, which I have listed as follows:
count = cnt
Will read a specified amount of blocks from the input file. The size of these blocks can be altered with the ‘ssz’ option below.
bs = ssz
The default block size in dc3dd is 512, but this can be manually overwritten using ‘ssz’. This option will still accept non-absolute values like ‘4k’ (4096).
skip = iskip/oskip
Here skip is split into two options for input and output. ‘iskip’ will specify the amount of blocks to skip at the start of the input file and ‘oskip’ will specify the same but for the output file.
The dc3dd tool is an excellent choice for forensic examiners due its hash verification and advanced logging features. In the forensic imaging process, I personally use a combination of this tool and the next one; ‘ddrescue’.
As a side note, dc3dd is the imaging tool utilised in Bruce Nikkel’s ‘sfsimage’ program, which I highly recommend checking out here.
The final tool I will be covering is technically not a derivative of ‘dd’, but functions in a very similar way and is very useful for forensic imaging, despite being considered a ‘data recovery’ tool. This was developed as part of the GNU project and is not to be mistaken with ‘dd_rescue’, which ddrescue is considered to be an improvement upon. Because ddrescue is primarily focused on data recovery, it is the ideal tool to utilise on devices that are suspected to contain ‘bad’ blocks.
Like the previous tools, ddrescue is fairly easy to install on most Linux distributions due to its inclusion in their core repositories. For a list of distributions with appropriate commands and instructions needed to install ddrescue, please refer to this resource.
Despite being more oriented towards data recovery, ddrescue still provides options which will prove useful for forensic practitioners. As before, assuming the device we want to image is ‘/dev/sdb’, a typical ddrescue command would look like this:
Note how we do not need to specify the (if) and (of) options for the input and output files respectively as seen in the other tools.
-d / –idirect
This option specifies direct disc access for the input file and will bypass the kernel cache. Note that not all systems support direct disc access and ddrescue will warn you if your system does not.
This is the third parameter of ddrescue and despite being optional, it is highly recommended you use a map file. Note that the map file does not need to obey any naming convention like ‘*.map’ and can be named however you wish. As shown in Figure 6, the map file will show important information about the imaging process, specifically if there were any read errors. In this instance, ddrescue did not encounter any read errors, which is denoted by the ‘+’ symbol next to the hexadecimal value in the map file. Additionally, should the imaging process be interrupted for any reason, the map file will keep track of the recovered data and as long as the same map file is specified, the imaging can resume where it left off.
Should ddrescue detect any bad sectors on the disk you are imaging, there are steps you can take to recover them, however this is outside the scope of this post and is something I will cover in more detail in the future.
The documentation for ddrescue is very robust and well worth reading through if you have time. From the manual I have picked out some features that I found particularly interesting:
- Specifying the option ‘-R’ will read the input file in reverse passes. Like many of the options, this is used mainly to maximise recovered data on bad disks.
- Should ddrescue encounter bad sectors on the input file you are imaging, it will not write zeros to the output file in their place like the other dd tools will do.
- The physical block size will be dynamically decreased to maximise recovered data should ddrescue encounter bad sectors on the input file.
- Any interface (ATA, SATA, SCSI, etc.) supported by your kernel can be used with ddrescue.
- The ‘-i’ option can be used to specify a starting position on the input file. The option defaults to offset 0 if not specified.
Firstly, I went off on a bit of a tangent regarding data completeness at the beginning of this post. However, I think it is very important for forensic practitioners to understand exactly what data they are acquiring from a device and consider that it may not always be ‘forensically complete’.
All the tools covered in this post have their own strengths and weaknesses so personal preference will be the biggest factor in deciding which you want to use. However, in my opinion, I would always utilise ddrescue for imaging disks whenever possible due to its focus on recovering as much readable data as possible. Of course, you may not necessarily be in a position to choose one over the other, which is why I emphasise learning how each of them work.
I am aware that there are many other uses/options for the tools covered in this post, but I wanted to show that fundamentally, the forensic imaging process can be completed with command-line tools on Linux. Finally, this is by no means an exhaustive list of imaging tools/commands fit for every possible scenario you may come across, the tools covered here are meant to aid in learning the low-level processes of forensic imaging.
Thank you for reading and I hope this post has taught you something new!
While doing extra research for the content covered in this post, I found these online resources to be particularly insightful. I have also added a brief description of each one for convenience:
I mentioned them before, but these two papers by Todd Shipley and Bryan Door give great insight into the issue of data completeness:
The book ‘Practical Forensic Imaging’ by Bruce Nikkel is an excellent resource covering open-source tools in all areas of forensic acquisition.
The Forensic Examiners Introduction to Linux by Barry J. Grundy (LinuxLEO) is another great resource covering the forensic imaging process using Linux tools.