Last updated April 24, 2015
What is ZFS?
ZFS is a modern open-source
file system with advanced features to improve reliability. It is primarily used
on Linux (http://www.zfsonlinux.org) and by open source NAS systems such as FreeNAS (http://www.freenas.org)
and NAS4Free (http://www.nas4free.org).
ZFS reliability
ZFS developers claimed reliability as one of the main design
criteria. To that end the following features were implemented:
- Copy-on-write semantics.
- Checksums for both metadata and data.
- Redundant metadata.
- Journaling.
Copy-on-write serves two main purposes. First it allows
quick creation of volume snap-shots. Secondly it guarantees file system
consistency across system failures such as a power outage. (One may argue that consistency
can be achieved by a well-designed journal log.)
Checksums protect data integrity. Data corruption caused by
such problem as bit rotting can be detected and even corrected in mirrored or
RAID-Z configuration. ZFS provides end to end checksums for both data and
metadata.
File data locations are stored in a metadata structure
called the block pointer. All block pointers are redundantly stored.
The ZFS intent log (ZIL) provides journaling so that an
update to the file system can be redone in case of a system failure before the update
has been fully committed to storage.
Current state
ZFS development is currently driven by the open source group
OpenZFS (http://www.open-zfs.com).
Although ZFS was originally developed for Solaris, it is currently used
primarily on FreeSBD and Linux (http://www.zfsonlinux.org).
FreeNAS (http://www.freenas.org)
and NAS4Free (http://www.freenas.org) are two popular lines of NAS products
incorporating ZFS as the main file system.
What is RAID-Z
RAID-Z protects data against physical drive failure by
storing data redundantly among multiple drives. RAID-Z is similar to standard
RAID but is integrated with ZFS. In standard RAID, the RAID layer is separate
from and transparent to the file system layer. The file system is not aware of
the underlying RAID storage scheme and uses the RAID storage as it does a
single drive. The file system writes to and reads from a virtual single drive. The
RAID layer maps data blocks to physical hard drive blocks.
In RAID-Z the two layers become one. ZFS operates directly
on physical blocks. Parity data is computed separately for each data extent
following RAID-Z schemes. The virtual address and length of the extent decide its
physical location.
Single, double and triple parity
RAID-Z supports single (RAID-Z1), double (RAID-Z2) or triple
(RAID-Z3) parity or no parity (RAID-Z0). Reed-Solomon is used for double and
triple parity.
ZFS and RAID-Z recovery issues
QueTek® programmers have developed
advanced techniques to recover ZFS and RAID-Z data. The difficulties we have
encountered are discussed below.
Long chain of control blocks
ZFS contains cascaded chains of metadata objects that must be
followed in order to get to the data. In a typical ZFS configuration a data
recovery program has to do the following:
- Read and parse the name-value pair list in the
vdev label.
- Choose the most current uberblock.
- Read the Metal Object Set (MOS) using the main
block pointer in the uberblock.
- Read the Object Directory to determine the
position in the MOS of the DSL directory.
- Read the DSL directory to determine the position
in the MOS of the DSL dataset
- Note that at this point if the dataset is
nested, the previous two steps are repeated for each nesting level.
- Read the DSL dataset to determine the location
(not in the MOS) of the File System dataset.
- Read the File System dataset to determine the
location of the File System Objects (FSO) array.
- Read the Master Node in the FSO array to
determine the position in the FSO array of the root node.
- From the root node traverse the file system
tree.
Integration of RAID layer and file system storage layer
In conventional RAID, the RAID layer is separate from the
file system layer. The latter maps a virtually contiguous space of a file to
(potentially noncontiguous) extents in the volume space. Similarly the RAID
layer maps the virtually contiguous space of a volume to physical hard drive
blocks.
ZFS maps virtual file space directly to physical blocks.
When ZFS metadata is lost, a recovery program faces three
difficult tasks:
- Distinguish data from parity data.
- Determine sector sequence for sequential
scanning.
- Match parity data to data.
Distinguish data from parity data
Parity data is repeated in a fixed pattern in a RAID 5. For
example on a member drive of a 4-drive RAID 5 with block size 64KB and backward
symmetric rotation, a parity block follows every three data blocks. The mapping
from physical to logical sector numbers is depicted below. The second to fifth
columns contain the volume logical sector numbers:
Physical sector
|
Disk 0
|
Disk 1
|
Disk 2
|
Disk 3
|
0-127
|
0-127
|
128-255
|
256-383
|
Parity block
|
128-255
|
512-639
|
640-767
|
Parity block
|
384-511
|
256-383
|
1024-1151
|
Parity block
|
768-895
|
896-1023
|
384-511
|
Parity block
|
1152-1279
|
1280-1407
|
1408-1535
|
Figure 1
For example volume logical sector 640 is mapped to physical
sector 128 on Disk 1.
A data recovery program can determine whether a specific sector
contains data or parity data.
In RAID-Z the size of an extent varies between 1 and 256
sectors (assuming 512 bytes per sector). The length and location of an extent
are stored in the Data Virtual Address (DVA). For example a 4-drive,
single-parity RAID-Z may store a 10-sector extent in the pattern below. The
second to fifth columns contain the virtual sector numbers of the data extent:
Physical sector
|
Disk 0
|
Disk 1
|
Disk 2
|
Disk 3
|
1000
|
|
|
Parity
|
0
|
1001
|
4
|
7
|
Parity
|
1
|
1002
|
5
|
8
|
Parity
|
2
|
1003
|
6
|
9
|
Parity
|
3
|
Figure 2
For example sector 7 of the extent is mapped to physical
sector 1001 on Disk 1.
For another extent the mapping can be as follows:
Physical sector
|
Disk 0
|
Disk 1
|
Disk 2
|
Disk 3
|
1000
|
|
|
|
Parity
|
1001
|
0
|
4
|
7
|
Parity
|
1002
|
1
|
5
|
8
|
Parity
|
1003
|
2
|
6
|
9
|
Parity
|
1004
|
3
|
|
|
|
Figure 3
Or for another extent:
Physical sector
|
Disk 0
|
Disk 1
|
Disk 2
|
Disk 3
|
1000
|
Parity
|
0
|
4
|
7
|
1001
|
Parity
|
1
|
5
|
8
|
1002
|
Parity
|
2
|
6
|
9
|
1003
|
Parity
|
3
|
|
|
Figure 4
Sector 1001 on Disk 2 may be a data or parity sector,
depending on the context.
Determine sector sequence for sequential scanning
In figure 2 sector 1003 on Disk 0 is followed by sector 1001
on Disk 1. But in figure 3 it is followed by sector 1004 on Disk 0. Without the
extent context, a program cannot determine the sector sequence.
Match parity data to data
In a RAID-Z stripe, the data blocks and corresponding parity
block may not start on the same physical sector number. In figure 2, sector
1000 on Disk 2 contains the parity data for sector 1000 on Disk 3 and sector
1001 on Disk 0 and Disk 1. In figure 3, sector 1000 on Disk 3 contains the
parity data for sector 1001 on Disk 0, Disk 1 and Disk 2. In figure 4, sectors 1000
to 1002 on Disk 0 contain the parity data for the same sector numbers on all
the other drives. Sector 1003, however, contains the parity data only for sector
1003 on Disk 1.
Without ZFS extent metadata, it is extremely difficult to
match a parity block to the corresponding data blocks.
Long scan
The File Scavenger® long scan is used when file system
metadata is incomplete. Metadata is usually structured in a tree topology. If a
tree is partially corrupted, many branches may not have a path to the root. The
File Scavenger® quick scan starts from the root and traverses all branches and
sub-branches until the entire tree is scanned. This scan will miss branches
disconnected from the root. A long scan examines every sector to look for
disconnected branches.
Performing a long scan on RAID-Z is extremely difficult
because of the issues discussed in the previous section. For example:
- When the program sees a sector containing
metadata, it must determine if it is actual metadata or merely parity artifacts.
- If a metadata object spans multiple sectors, the
program cannot easily determine the correct sector sequence to read the entire
object.
Missing drive
A missing drive in RAID 5 can be completely rebuilt using
the remaining drives regardless of the file system status. However in RAID-Z
each stripe is a data extent. The missing drive can be completely rebuilt only
if all extent metadata is intact. For a corrupted ZFS volume, rebuilding a
missing drive is very difficult.
File undeletion
Files are ultimately stored in the File System Objects (FSO)
array of a dataset. Both the file and the parent folder are dnodes
in the array. The parent dnode contains the filename.
The file dnode contains the file extents, size,
dates, etc.
When a file is deleted ZFS updates the FSO extent containing
the file dnode and the parent dnode
extent containing the filename. Thanks to copy-on-write the original extents
still exist on the hard drive. A data recovery program must find the two
extents and match the file dnode to the correct
filename item in the parent folder extent. The program can reconstruct the
complete path because the path to the parent folder is still valid if the
parent folder has not been deleted.
Matching
the file dnode to the correct filename is not an easy
task. The filenames in the parent dnode are indexed
by the position of the file dnode in the FSO array.
The recovered file dnode does not contain its
position in the array. Matching the folder dnode
extent to the correct parent folder dnode is also
difficult.
Presently File Scavenger® does not offer a general solution
to ZFS undeletion due to the complexity of the tasks
involved. Our staff can perform undeletion on a
fee-based, case-by-case basis
Raw recovery
Raw recovery is a method where files are recovered based on data
patterns instead of file system metadata. Raw recovery is used in the absence
of metadata. The results are usually unstructured files with a generic filename.
We will discuss this in details by contrasting ZFS to NTFS. In NTFS the metadata
for a file is stored in a FILE record. The metadata includes the filename,
dates and location of the data. A FILE record is stored separately from the
actual file data.
If an NTFS volume is corrupted but the FILE records are
still intact, a data recovery program can look for FILE records and use the
metadata to recover the corresponding files.
When a FILE record is lost, the corresponding file may still
be intact but its name and location are not known. This is when raw recovery
comes into play. Many types of files contain identifiable header patterns. For
example a bitmap file (extension .bmp) starts with a header that contains a
special signature and the size of the file. Upon detecting the bitmap signature
in a sector, a program knows the sector is the beginning of a bitmap file. With
the file size in the header, the program knows where the file ends, assuming
the file data is contiguous. (If the file data is fragmented, raw recovery is
not possible and file carving techniques must be used. See http://www.quetek.com/data_recovery_techniques.htm#file_carving)
RAID-Z raw recovery is very difficult. Upon detecting a file
header, a data recovery program must determine the correct sector sequence to
end of file. As discussed in previous sections, this is very difficult when file
system metadata is incomplete.
Single parity RAID-Z versus RAID 5 performance comparison
We will compare the performance of single parity RAID-Z to RAID
5. The latter is by far the most popular RAID configuration.
No "write hole"
RAID-Z is immune to the RAID 5 "write hole" where a stripe may become
inconsistent if data is not completely written to one or more drives due to an
interruption such as loss of power. In RAID 5 a write operation involves
writing the data and the corresponding parity data. At the minimum that requires
writing to two drives. If power loss occurs and one drive is not written to,
the data and parity data become out of sync.
RAID-Z protects data with copy-on-write. The new data is first
written to a new location. Then the reference to the data is updated in an
atomic operation (i.e., an operation that is either performed completely or not
performed at all).
One may wonder if ZFS on RAID 5 is vulnerable to the write
hole. The following events occur when data at location A is modified:
- ZFS reads the data at location A. (This data is referenced
by the metadata at another location Z.)
- The data is modified in memory.
- ZFS allocates location B and writes the modified
data to location B.
- ZFS updates the metadata at location Z to reference
the new location B.
In RAID 5 data is updated on a per stripe basis because each
stripe contains a parity block that must stay in sync with the data. In the
example above both A and B may be in the same stripe. An incomplete write may
affect both A and B; therefore the vulnerability still exists. Location Z may
also be in the same stripe.
In RAID-Z the parity data is maintained per data extent. In
the example above A and B are two different extents. An incomplete write only
affects B.
Therefore ZFS on RAID 5 storage is still vulnerable to the
write hole. RAID-Z wins.
Read-modify-write cycle
In RAID 5 data is stored as blocks striped across all
drives. One block per stripe holds parity data. When even a small change is
made to one block, the RAID controller must update the corresponding parity
block. In order to compute the new parity block, the RAID controller must read all
data blocks in the stripe. An example of a stripe on a 4-drive RAID 5 is depicted
below:
Disk 0
|
Disk 1
|
Disk 2
|
Disk 3
|
Block 0
|
Block 1
|
Block 2
|
Parity block
|
Assuming block 0 is modified, the
sequence of operations is as follows:
- Read block 0.
- Modify block 0 in memory.
- Read block 1 and block 2.
- Compute a new parity block.
- Write the updated block 0 to Disk 0 and the new parity
block to Disk 3.
In RAID-Z the parity of each data extent is independently
computed. It is not necessary to read other extents. RAID-Z also wins here.
Reading small data extents
In RAID 5 data extents up to the RAID block size may be stored
entirely on one drive. Reading a small extent may require only one read.
RAID-Z stripes data on as many drives as possible with the
smallest stripe size being a sector. Only a one-sector extent is stored
entirely on one drive. A two-sector extent is striped across two drives. Larger
extents are striped across more drives up to n-1 (n is the total number of
drives). For example in a 6-drive RAID-Z configuration, a 5-sector extent may
be striped across 5 drives as shown below:
Disk 0
|
Disk 1
|
Disk 2
|
Disk 3
|
Disk 4
|
Disk 5
|
Parity sector
|
Sector 0
|
Sector 1
|
Sector 2
|
Sector 3
|
Sector 4
|
Reading small extents requires significantly more reads in
RAID-Z, especially with a large number of drives. RAID 5 wins hands down.
Parity overhead
RAID 5 uses the equivalence of one drive for parity data.
For example a 4-drive RAID 5 uses one-fourth of total capacity for parity data,
or 33% overhead.
RAID-Z overhead is equal to RAID 5 in the best case scenario
when the size of a data extent is a multiple of the number of drives less one.
In the example above an extent of 3 (or 6, 9, etc.) sectors has 33% overhead. Other
extent sizes incur more overhead. A 10-sector extent requires 4 parity sectors,
or 40% overhead. In the worst case a one-sector extent requires one sector for
parity data, or 100% overhead. The typical ZFS extent size is 128 sectors. In a
4-drive RAID-Z configuration that is 34% overhead.
Another overhead in RAID-Z is in the inability to use any
free space of one sector. Each extent requires at least two sectors: one for
data and one for parity.
Therefore a typical single-parity RAID-Z incurs about 1%
additional overhead compared to RAID 5.
128 KB extent size
The maximum ZFS extent size (or block size in ZFS
terminology) is 128 KB. With copy-on-write a data extent is written to a new
location when it is modified even by the smallest amount. A smaller extent size
reduces the time taken to write it to a new location. However this increases
fragmentation. Naturally contiguous data (such as a file being copied in whole from
another volume) may become fragmented.
At first glance 128 KB seems to be a bottleneck for reading large
files. In practice ZFS can store a large file in contiguous extents so that
reading a large chunk of data requires only one read per drive. However, if the
file is modified, copy-on-write will relocate any modified extents, thus
causing fragmentation.
Fragmentation is an issue of copy-on-write rather than
RAID-Z. ZFS on a single drive faces the
same problem.
|