Discussion:
EonStorage and the 4k drives
Adrian Saileanu
2011-03-25 19:02:14 UTC
Permalink
Hello,

I recently bought 4 Western Digital 2TB harddrives and for two days I'm reading everything about partition alignment and zfs on these new drives having an internal sector size of 4096 bytes.

Does any of you have any experience with zfs pools on 4k drives ?

I found this page http://www.solarismen.de/archives/12-Modified-zpool-program-for-newer-Solaris-versions.html where the author modified the zpool tool ( in zpool_vdev.c ) but users are reporting some issues when creating a pool with ZPOOL_CONFIG_ASHIFT = 12 ( 2^12 = 4096 ).
--
This message posted from opensolaris.org
Adrian Saileanu
2011-03-25 21:04:29 UTC
Permalink
I'm doing some test at the moment after I copied all the zpool versions from http://www.solarismen.de/archives/12-Modified-zpool-program-for-newer-Solaris-versions.html - both zpool-s10u8 and zpool-s10u9 work after linking the libzfs library.

1344 -rwxr-xr-x 1 root bin 673388 Apr 4 2010 libzfs.so.1
2 lrwxrwxrwx 1 root root 11 Mar 25 12:27 libzfs.so.2 -> libzfs.so.1

I created three different zpools over three harddisks which now are showing :
#zpool create zpool-t1p0 /dev/dsk/c1t1d0
#./zpool-s10u8 create zpool-t2p0 /dev/dsk/c1t2d0
#./zpool-s10u9 create zpool-t3p0 /dev/dsk/c1t3d0

#zdb
zpool-t1p0:
version: 22
name: 'zpool-t1p0'
state: 0
txg: 15
pool_guid: 73485483957774418
hostid: 13571568
hostname: 'eon1'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 73485483957774418
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 928407176351616095
path: '/dev/dsk/c1t1d0s0'
devid: 'id1,***@SATA_____WDC_WD20EARS-00M_____WD-WCAZA270/a'
phys_path: '/***@0,0/pci8086,***@1f,2/***@1,0:a'
whole_disk: 1
metaslab_array: 23
metaslab_shift: 34
ashift: 9
asize: 2000385474560
is_log: 0
create_txg: 4
zpool-t2p0:
version: 22
name: 'zpool-t2p0'
state: 0
txg: 15
pool_guid: 2601048085308766544
hostid: 13571568
hostname: 'eon1'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 2601048085308766544
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 12187966011736420873
path: '/dev/dsk/c1t2d0s0'
devid: 'id1,***@SATA_____WDC_WD20EARS-00M_____WD-WCAZA271/a'
phys_path: '/***@0,0/pci8086,***@1f,2/***@2,0:a'
whole_disk: 1
metaslab_array: 23
metaslab_shift: 34
ashift: 12
asize: 2000385474560
is_log: 0
create_txg: 4
zpool-t3p0:
version: 22
name: 'zpool-t3p0'
state: 0
txg: 4
pool_guid: 15215334829979844812
hostid: 13571568
hostname: 'eon1'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 15215334829979844812
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 8884729500419644159
path: '/dev/dsk/c1t3d0s0'
devid: 'id1,***@SATA_____WDC_WD20EARS-00M_____WD-WCAZA465/a'
phys_path: '/***@0,0/pci8086,***@1f,2/***@3,0:a'
whole_disk: 1
metaslab_array: 23
metaslab_shift: 34
ashift: 12
asize: 2000385474560
is_log: 0
create_txg: 4

Please notice the "ashift: 12" for pool-t2p0 and pool-t3p0.

eon1:153:~#df -k
Filesystem size used avail capacity Mounted on
...
/dev/dsk/c0t0d0s0 7.4G 264M 7.0G 4% /mnt/eon0
swap 1.3G 37M 1.3G 3% /tmp
swap 1.3G 60K 1.3G 1% /var/run
zpool-t1p0 1.8T 21K 1.8T 1% /zpool-t1p0
zpool-t2p0 1.8T 112K 1.8T 1% /zpool-t2p0
zpool-t3p0 1.8T 112K 1.8T 1% /zpool-t3p0

Notice the used space differs for zpool-t2p0 and zpool-t3p0.


#zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zpool-t1p0 127M 1.81T 0 1 592 112K
c1t1d0 127M 1.81T 0 1 592 112K
---------- ----- ----- ----- ----- ----- -----
zpool-t2p0 132M 1.81T 0 1 631 125K
c1t2d0 132M 1.81T 0 1 631 125K
---------- ----- ----- ----- ----- ----- -----
zpool-t3p0 132M 1.81T 0 1 637 126K
c1t3d0 132M 1.81T 0 1 637 126K
---------- ----- ----- ----- ----- ----- -----

While running dtrace's iosnoop on a "touch test" file in every pool I can see that the "ashift: 12" pools (zpool-t2p0 and zpool-t3p0) use the 4096 bytes per sector while the regular zpool-t1p0 uses 512 bytes.

0 1107 W 268196 3072 zpool-zpool-t1p0 <none>
0 1107 W 268204 512 zpool-zpool-t1p0 <none>
0 1107 W 738206868 3072 zpool-zpool-t1p0 <none>
0 1107 W 738206875 1024 zpool-zpool-t1p0 <none>


0 1111 W 738210624 4096 zpool-zpool-t2p0 <none>
0 1111 W 274560 4096 zpool-zpool-t2p0 <none>
0 1111 W 738210632 4096 zpool-zpool-t2p0 <none>
0 1111 W 274568 4096 zpool-zpool-t2p0 <none>

0 1113 W 738210616 4096 zpool-zpool-t3p0 <none>
0 1113 W 274568 4096 zpool-zpool-t3p0 <none>
0 1113 W 738210624 4096 zpool-zpool-t3p0 <none>
0 1113 W 1476403760 8192 zpool-zpool-t3p0 <none>

Any thoughts how this will affect the performance and available space ?
--
This message posted from opensolaris.org
Andre Lue
2011-03-26 21:55:01 UTC
Permalink
Hi Adrian,

I've seen a couple of posts where a few people have tried it. I recall Guether Alka may be using it. Maybe reach out to him or hope he'll chime in.

here's another
http://digitaldj.net/2010/11/03/zfs-zpool-v28-openindiana-b147-4k-drives-and-you/

I figure only time and mileage can tell.
--
This message posted from opensolaris.org
Manojav Sridhar
2011-04-07 15:41:00 UTC
Permalink
with current eon and 4k drives, i had much degraded performance upto 30% slower read and write. So I went back to the 512 byte sectors
--
This message posted from opensolaris.org
Andre Lue
2011-04-07 17:01:20 UTC
Permalink
Hi Vajonam,

The cases here are slightly different. I think you tested 4k drives with the released binaries. He is looking at compiling and replacing newer zfs/zpool libraries and binaries with 4k drives.
--
This message posted from opensolaris.org
Alan
2011-04-08 12:33:19 UTC
Permalink
I'm using some 2TB drives. Assuming they have 4K byte sectors, do I need to worry about it in terms of stability?

(I'm willing to trade some performance for the amount of storage, but obviously not if it means putting data at risk.)
--
This message posted from opensolaris.org
Andre Lue
2011-04-08 13:38:22 UTC
Permalink
I wouldn't explore it yet. I don't think the returns are yet worth the risk.
--
This message posted from opensolaris.org
Alan
2011-04-08 14:50:17 UTC
Permalink
I'm about to hit the road for a bit, but wanted to post before I do.

1. Does this mean I should not be using 2TB drives?

2. Is there a command I can run to report what byte sectors the drives have?

3. If it's 4K, is it possible to change it in EON, or is that something the drive controls?


(I found this thread: http://opensolaris.org/jive/thread.jspa?threadID=125702 but haven't had a chance to digest it yet. Leaving it here so it's easily findable.)
--
This message posted from opensolaris.org
Andre Lue
2011-04-08 17:36:51 UTC
Permalink
1. Does this mean I should not be using 2TB drives?
As far as I know 2TB drives are ok once 4k sectors or 512 byte emulation, are not part of the equation.

2. Is there a command I can run to report what byte sectors the drives have?
run format
select disk number example 0
type verify
look for "bytes/sector ="

3. If it's 4K, is it possible to change it in EON, or is that something the drive controls?
This would require the recompile attempts and replacement listed above. As far as I know this is not yet an official approved code change in future releases.
--
This message posted from opensolaris.org
Alan
2011-04-18 00:03:33 UTC
Permalink
Below is what I get from following the format instructions. (The stuff inside **asterisks** is what I entered.)

######################################################################

coltrane:2:~#format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c1d0 <drive type unknown>
/***@0,0/pci-***@1f,2/***@0/***@0,0
1. c1d1 <drive type unknown>
/***@0,0/pci-***@1f,2/***@0/***@1,0
2. c2d0 <drive type unknown>
/***@0,0/pci-***@1f,2/***@1/***@0,0
3. c2d1 <drive type unknown>
/***@0,0/pci-***@1f,2/***@1/***@1,0
4. c3d0 <drive type unknown>
/***@0,0/pci-***@1f,5/***@0/***@0,0
5. c4d0 <DEFAULT cyl 60796 alt 2 hd 255 sec 252>
/***@0,0/pci-***@1f,5/***@1/***@0,0
Specify disk (enter its number): **5**

format> **verify**
Warning: Primary and backup labels do not match.

Warning: Check the current partitioning and 'label' the disk or use the
'backup' command.

Primary label contents:

Volume name = < >
ascii name = <DEFAULT cyl 60796 alt 2 hd 255 sec 252>
pcyl = 60798
ncyl = 60796
acyl = 2
bcyl = 0
nhead = 255
nsect = 252
Part Tag Flag Cylinders Size Blocks
0 unassigned wm 0 0 (0/0/0) 0
1 unassigned wm 0 0 (0/0/0) 0
2 backup wu 0 - 60795 1.82TB (60796/0/0) 3906750960
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 unassigned wm 0 0 (0/0/0) 0
7 unassigned wm 0 0 (0/0/0) 0
8 boot wu 0 - 0 31.38MB (1/0/0) 64260
9 alternates wm 1 - 2 62.75MB (2/0/0) 128520

######################################################################

Note that the only disk that it identified automatically was disk 5. If I switched to a different disk, I would get something like this:


######################################################################

format> **disk**

Specify disk (enter its number)[0]: **4**


AVAILABLE DRIVE TYPES:
0. DEFAULT
1. other
Specify disk type (enter its number): 0
selecting c3d0
No current partition list
No defect list found
[disk formatted, no defect list found]
No Solaris fdisk partition found.
format> **verify**
WARNING - This disk may be in use by an application that has
modified the fdisk table. Ensure that this disk is
not currently in use before proceeding to use fdisk.
format>

######################################################################

So, the followup questions:

1. Should I be worried about the fact that only one of the disks is understood by this command?

2. I'm not seeing the 'bytes/sector =' stuff. Am I missing it, or do I need to attack that another way?
--
This message posted from opensolaris.org
Andre Lue
2011-04-18 03:16:46 UTC
Permalink
eon:/abyss/foo# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c5d0 <ST380013- 5MR089S-0001-74.50GB>
/***@0,0/pci-***@1f,2/***@0/***@0,0
Specify disk (enter its number): 0
selecting c5d0
NO Alt slice
No defect list found
[disk formatted, no defect list found]
/dev/dsk/c5d0s0 is part of active ZFS pool abyss. Please see zpool(1M).


FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
fdisk - run the fdisk program
repair - repair a defective sector
show - translate a disk address
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
volname - set 8-character volume name
!<cmd> - execute <cmd>, then return
quit
format> verify

Volume name = < >
ascii name = <ST380013- 5MR089S-0001-74.50GB>
bytes/sector = 512
sectors = 156248063
accessible sectors = 156248030
Part Tag Flag First Sector Size Last Sector
0 usr wm 256 74.50GB 156231646
1 unassigned wm 0 0 0
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
8 reserved wm 156231647 8.00MB 156248030

format>
--
This message posted from opensolaris.org
Andre Lue
2011-04-18 03:35:17 UTC
Permalink
Another way to try:

fdisk /dev/rdsk/c5d0
Total disk size is 9725 cylinders
Cylinder size is 16065 (512 byte) blocks

Cylinders
Partition Status Type Start End Length %
========= ====== ============ ===== === ====== ===
1 EFI 0 9725 9726 100

SELECT ONE OF THE FOLLOWING:
1. Create a partition
2. Specify the active partition
3. Delete a partition
4. Change between Solaris and Solaris2 Partition IDs
5. Edit/View extended partitions
6. Exit (update disk configuration and exit)
7. Cancel (exit without updating disk configuration)
Enter Selection: 7
--
This message posted from opensolaris.org
Patrick
2011-05-03 11:20:47 UTC
Permalink
Hi,

i have a raidz with 5x WD20EARS running. Im still testing how it performs.
My CIFS-Copy Speed is 45-75 Mbyte/Sek (as TotalCommander tells me), moving big files like DVD-ISOs.

As i didn't dare to mess around with the edited zpool file correcting the ashift... what will be the "best practice" setup. For now my speed is good enough for me, but would there be an even better performance on the HP n36L ? Will "the conversion" break the drives more or less?

What are your experiences?

Your new eon-fan...
Patrick
--
This message posted from opensolaris.org
Andre Lue
2011-05-03 16:02:15 UTC
Permalink
I wouldn't explore it yet.
--
This message posted from opensolaris.org
Loading...