Discussion:
[EON] Optimizing NAS Performance
Travis T
2011-04-13 01:57:35 UTC
Permalink
Hi all,

I've been running EON for almost a year now and overall I've been real happy with it. I recently had a user induced problem (one of four raidz drives was unplugged and required a resilver to bring back online), which has led me to realizing my build wasn't performing as well as I had hoped.

The resilver took just shy of a week. Once that was done, I began a scrub of the pool. After 3 days, I was up to about 12% and had a power outage during a storm. Restarted the scrub the next day, and it completed a week and a half from when I restarted it. It did find and repair data on one disk, which makes me glad I was using ZFS.

I also have a ZFS pool on a single disk that is running my VMWare virtual disks mapped to an ESXi server via NFS. My VM Guest speeds are pretty slow, and I believe it is due to performance of the EON box. I also experience very slow file transfers at times from the EON box to my windows computers (a mix of virtual servers and physical windows 7 machines). I would like to speed things up to a more acceptable level, but am not too sure where to start.

When initially testing EON, I did some transfer tests using IOZone, but didn't have anything to compare with. Where should I start in trying to optimize this NAS for better throughput to both my VMs and my clients?

Also, I'm using a Cisco enterprise class gigabit switch for connectivity to my VMWare host and clients, and I rarely see a sustained transfer over 10MB/s to the file server. I just did a quick test of moving 7 movie files (single MKV files) totaling 10.5GB and got a peak of 14MB/s transfer rate reported in Windows. The transfer jumped back and forth from about 30% to 0% of a 1GB connection but never was a stable transfer. See attached screenshot.

Where can I start looking for potential issues?
--
This message posted from opensolaris.org
Andre Lue
2011-04-13 16:13:39 UTC
Permalink
Hi Travist,

The following 2 kernel parameters can be adjusted if resilvering speed is too slow/fast:

zfs_resilver_delay /* number of ticks to delay resilver */
zfs_resilver_min_time_ms /* min millisecs to resilver per txg */

faster:
echo zfs_resilver_delay/W0|mdb -kw
echo zfs_resilver_min_time_ms/W0t3000|mdb -kw

slower:
echo zfs_resilver_delay/W2|mdb -kw
echo zfs_resilver_min_time_ms/W0t300|mdb -kw

Disclaimer: Use at your own risk. I have not used them extensively enough to make recommendations and I would do due diligence before modifying them in a production setting.

I would recommend testing the network and disks in a isolated fashion to verify each point is working as expected. Some posts to help:
http://eonstorage.blogspot.com/2010/12/benchmarking-eon-zfs-nas-performance.html
dladm show-link bge0 -s -i 1

http://eonstorage.blogspot.com/2010/03/whats-best-pool-to-build-with-3-or-4.html

There is also a diskspeed program included in EON to give a rough idea of individual disk read performance (test each of them). Remember your pool is limited by it's slowest member in a RAID Z. Also, there are many factors that contribute like, workload, IOPS per disk and more.

14MB/s does seem on the low side but is that a sync or async workload etc. What you are asking has no silver bullet answer. You have thoroughly work through the specs of the hardware/software combo and see if you can isolate the problem point, if there is one or what may be causing the issue.

I would start by listing the hardware specs and disk specs?

Use the included tools zpool iostat, dladm to verify what's happening, etc

Hope that helps
--
This message posted from opensolaris.org
Travis T
2011-05-06 01:01:09 UTC
Permalink
Sorry for not following up sooner. I have been out of town for some time.

Not too concerned with resilvering time, but I wonder if the slow times are symptoms of another problem.

I tried your suggestions. I was unsuccessful in the instructions in the first link. wget is not installed, I'm having trouble getting a binary kit installed. I have downloaded bin-130a files (a-e) and copied them to my zpool. I'm not sure where to go from here as the instructions on your site show to unzip a tgz file.

Ran dladm command (see attachment for dump). After starting the dump, I copied a 44GB file over to my zpool. You can see this where the packet count jumps to 15176. The download stalled so I cancelled it.

diskspeed.sh output ->

The current rpm value 0 is invalid, adjusting it to 3600
The current rpm value 0 is invalid, adjusting it to 3600
configured 30358 MB/sec
c0t0d0 144 MB/sec
c0t1d0 117 MB/sec
c0t2d0 19 MB/sec
c0t3d0 141 MB/sec
c0t4d0 21901 MB/sec
c1t0d0 19 MB/sec

This output doesn't seem right.

I'm really having trouble deciphering the output of these commands to pinpoint a problem, so I'm hoping by posting the output, you or someone else more knowledgeable with Solaris than I am can help point me in the right direction.

Hardware specs are as follows:

Motherboard:
MSI 870A-G54 AM3 AMD 870 SATA 6Gb/s USB 3.0 ATX AMD Motherboard

CPU:
AMD Sempron 140 Sargas 2.7GHz Socket AM3 45W Single-Core Processor SDX140HBGQBOX

RAM:
Kingston ValueRAM 4GB (2 x 2GB) 240-Pin DDR3 SDRAM DDR3 1333 (PC3 10600) Desktop Memory Model KVR1333D3K2/4GR

Drives:
Western Digital Caviar Blue WD10EALS 1TB 7200 RPM SATA 3.0Gb/s 3.5" Internal Hard Drive -Bare Drive (x4)

I have one other drive installed that I'm using for my VM virtual disk storage in a single disk configuration, but I don't recall the make/model off hand. If it's important, I can open the case to see.

Travis
--
This message posted from opensolaris.org
Andre Lue
2011-05-06 13:55:24 UTC
Permalink
Hi Travist,

The wget instructions stated you had to have a previous release(s) of the binary kit already installed otherwise you would have to transfer the files via sftp, smb etc.

After locating all the files bin-130aa, ... bin-130ae in /tmp
http://eonstorage.blogspot.com/2010/06/eon-060-zfs-binary-kit-snv130-released.html
cat bin-130a[a-z] > bin-130.tgz

You now have a complete tar file
cd /your_zpool
mkdir local
cd local
gzip -dc /tmp/bin-130.tgz | tar -xf -

diskspeed has a flaw where if the disk write produces an error and returns fast (equals writes file in short time), the file size is divided by a short write time producing an incorrect answer. The rule of thumb to use if the answer looks ridiculously high the write to that disk produced an error and the results should be discarded. The other values look ok except if (c0t2d0 19 MB/sec, c1t0d0 19 MB/sec) are not USB and part of your pool they could be slowing things down.
ie
configured 30358 MB/sec
c0t4d0 21901 MB/sec

your dladm output shows a peak 1 sec value of 43Mb/s and one following 25.7Mb/s looks like the burst or traffic went back to avg <32Kb/s after that. I don't see anything that looks like an issue here.

e1000g0 5 4580 0 4 548 0
e1000g0 28435 43140750 0 15176 1028282 0
e1000g0 17075 25781940 0 9120 618980 0
e1000g0 20 11698 0 36 36576 0
e1000g0 27 32420 0 16 2208 0

try iostat -En to see if the disks are reporting errors
--
This message posted from opensolaris.org
Andre Lue
2011-05-06 14:00:06 UTC
Permalink
Try iostat -En to see if the disks are reporting any errors
--
This message posted from opensolaris.org
Travis T
2011-05-07 00:18:36 UTC
Permalink
I followed the instructions on the binary-kit install, and think it is working. I will test more later tonight or tomorrow.

Attached is the iostat output you requested. I don't see anything unusual except for the illegal request, which I'm not sure if or how bad that is.
--
This message posted from opensolaris.org
Travis T
2011-05-10 23:42:19 UTC
Permalink
Andre,

Does the output I posted give you any indication of what may be going wrong?

Also, aside from the performance issues, I'm also having file transfers being interrupted mid-transfer with network unavailable messages in windows. My network is rock solid, so it's got to be something with EON. All of my clients that are windows 7 seem to have this problem. My 2003 boxes all seem to work fine (but slow - they are all VMWare hosts), with no loss of connectivity even with large transfers.

Could this be a compatibility issue with EON and Windows 7? I've heard many rumors of compatibility issues with different NAS software. Any thoughts on this?
--
This message posted from opensolaris.org
Andre Lue
2011-05-11 14:15:14 UTC
Permalink
Hi Travist,

The output posted were related to disks, there were no glaring indicators.

What kind of transfer cifs, nfs, rsync is this? How is it mounted, if? Are there any errors or logs? What is the size of the transfer? zpool structure, can you attach zpool iostat -v 1 while the transfer is going.

Anything is possible, I would say though its more a driver/ specific hardware issue as I've not seen this on hardware I've tested on. This is not to say it's not possible. I would need more details on what is going on if I will be able to help.

Did you do the network test? What were your conclusions?
http://eonstorage.blogspot.com/2010/12/benchmarking-eon-zfs-nas-performance.html
--
This message posted from opensolaris.org
Andre Lue
2011-05-11 14:19:24 UTC
Permalink
Hi Travist,

The output posted were related to disks, there were no glaring indicators.

What kind of transfer cifs, nfs, rsync is this? How is it mounted, if? Are there any errors or logs? What is the size of the transfer? zpool structure, can you attach zpool iostat -v 1 while the transfer is going.

Anything is possible, I would say though its more a driver/ specific hardware issue as I've not seen this on hardware I've tested on. This is not to say it's not possible. I would need more details on what is going on if I will be able to help.

Did you do the network test? What were your conclusions?
http://eonstorage.blogspot.com/2010/12/benchmarking-eon-zfs-nas-performance.html
--
This message posted from opensolaris.org
Andre Lue
2011-05-11 15:35:03 UTC
Permalink
Hi Travist,

The output posted were related to disks, there were no glaring indicators.

What kind of transfer cifs, nfs, rsync is this? How is it mounted, if? Are there any errors or logs? What is the size of the transfer? zpool structure, can you attach zpool iostat -v 1 while the transfer is going.

Anything is possible, I would say though its more a driver/ specific hardware issue as I've not seen this on hardware I've tested on. This is not to say it's not possible. I would need more details on what is going on if I will be able to help.

Did you do the network test? What were your conclusions?
http://eonstorage.blogspot.com/2010/12/benchmarking-eon-zfs-nas-performance.html
--
This message posted from opensolaris.org
Travis T
2011-05-12 00:55:19 UTC
Permalink
The file transferred during the dladm output was a 44G folder of several video files. Things started smoothly with windows reporting speeds of close to 100MB/s transfer rates. After a few seconds, it dropped down to somewhere between 10 and 20, then I got an error saying network not available (see screenshot).

This is a CIFS transfer, using widows 7x64. It's mounted via windows explorer as a network drive. No error logs that I'm aware of.

zpool iostat -v 1 attached. The output of this slowed to probably about one every 10-30 seconds when I started the transfer. This transfer started at 87MB/s and was down to 7MB/s when I stopped it. What I included is about 2-3 minutes of the transfer.

I did complete the network test, and I capped out at 687 MB/s at several different TCP window sizes. Those results are attached as well.
--
This message posted from opensolaris.org
Andre Lue
2011-05-12 13:30:40 UTC
Permalink
zpool iostat -v 1 attached. The output of this slowed to probably about one every 10-30 seconds when I started the transfer. This transfer started at 87MB/s and was down to 7MB/s when I stopped it. What I included is about 2-3 minutes of the transfer.

When you say "The output of this slowed to probably about one every 10-30 seconds" were you viewing this on a remote ssh session or direct console on EON?

Can you share a picture of your setup to better understand the layout?
--
This message posted from opensolaris.org
Travis T
2011-05-12 13:59:43 UTC
Permalink
I don't have a terminal setup on the console of the EON, this was via ssh (putty).

Attached is a network diagram.
--
This message posted from opensolaris.org
Andre Lue
2011-05-12 20:04:33 UTC
Permalink
Something does seem wrong because "The output of this slowed to probably about one every 10-30 seconds", the zpool output was set for every sec so there should have been an update every sec.

I'm not sure what it is. How long into the transfer does it happen? Is it possible to capture dladm and zpool iostat 1 sec updates of the transfer? Does /var/adm/messages show anything?

Which of these machines are running via ESXi and which are bare metal?
--
This message posted from opensolaris.org
Travis T
2011-05-13 00:57:21 UTC
Permalink
EON and the ESXi server are the only bare metal "servers". All clients are physical machines. EON is the NFS datastore for all virtual machines.

Another thing I've noticed that doesn't seem to be tied to anything else is that occasionally, there is a long delay between entering the username and password during ssh logins. Login prompt appears right away, after inputting username, it hangs for probably 1+ minutes, then I get the password prompt.

Here's a link to a screencast of exactly what is happening during a transfer from start to finish. dladm and zpool iostat running as well as the windows transfer window.

http://vimeo.com/23663418

Nothing in messages from today other than when I fat-fingered the password when logging in.
--
This message posted from opensolaris.org
Andre Lue
2011-05-13 14:22:25 UTC
Permalink
Hi Travist,

"All clients are physical machines." Did you mean "virtual" machines, referring to the all machines in the wired client VLAN?

Thanks for the video, something definitely goes wrong after 27 - 29th secs (approx 6 secs into the transfer, receving 89-92Mb/s simultaneously transmitting 1.9-2.2Mb/s), you can see the dladm output drops and there is practically no IO showing from zpool iostat (showed a max of 36 write IO on predator). The other weird thing is the zpool output not updating per 1 sec intervals.

There is definitely something wrong here but what I do not know. Could be a network driver, network, disk, nfs or other protocol issue. The only thing I can recommend is stripping down to a basic/simple setup to try and isolate where the problem lies. The first thing I'd probably start with is using a simple link where the trunk is. How does a transfer from the laptop perform with a similar dump, dladm, zpool iostat?

The long delay "hang for 1 min" after the passwd is also abnormal. Is this delay also there between when you run the ssh cmd and when you get the login prompt? What is that delay like? Maybe run from the EON "snoop -d e1000g0 -o /tmp/out" while you doing only a ssh session to capture and send to me.

Typically if the delay is there between when you issue the ssh cmd and when u get the login prompt it is usually a network or DNS (reverse lookup) issue.
--
This message posted from opensolaris.org
Travis T
2011-05-13 16:06:21 UTC
Permalink
All clients are in the wired client vlan, and are physical (bare metal/not virtual) clients.

The predator zpool is for vm disk storage, so while I'm wanting to increase performance of that, that's not even on my scope right now.

The drives in the "cargo" zpool are shared via SMB in EON. I don't have any XP-based clients anymore, but do have a couple of server 2003 virtual machines. I have heard rumor that Windows 7 boxes sometimes have performance issues with various different NAS OSs. I have tried transfers from my virtual Server 2k3 boxes, and while I don't get the errored out transfers, the speed is also low (which I attribute to the single disk zpool running 4-5 VMs which cannot handle that kind of traffic efficiently). This test may indicate there is a problem with protocols, etc between the EON and the Windows 7 (x64) box.

As for the SSH session, I have never seen a delay in getting the login prompt. The only delay has been between entering the username and the password prompt. Never a delay between entering the password and getting the shell prompt either. I've witnessed the same thing after elevating to root (using "su").

The other thing I noticed is that the delay in the video seems to come just before displaying the c0t2d0 drive, which also had an abnormally low output on diskspeed.sh. Coincidintally, this was also the drive that was inadvertantly unplugged and caused me to have to resilver/scrub the zpool. The zpool successfully resilvered and scrubbed with 0 uncorrected errors.

I don't suspect network issues, but I will try to back it up and eliminate anything out of the path that is unnecessary. I'm starting to lean towards a compatibility issue between windows 7 and EON though. Maybe I will try to boot a livecd of linux or something on my desktop to see if any of the problems are duplicated.
--
This message posted from opensolaris.org
Andre Lue
2011-05-13 19:26:36 UTC
Permalink
Hi Travist,

Booting a live CD Ubuntu, Openindiana, Solaris 11 express sounds like a good idea (maybe from one of the XBMC boxes) but testing a 44Gb or decent sized transfer may require a spare disk.

Are the XMBC boxes also Win 7?

If you can run the following while ssh-ing in and stop it after you get the prompt
snoop -d e1000g0 -o /tmp/out
--
This message posted from opensolaris.org
Travis T
2011-05-14 03:56:01 UTC
Permalink
Andre,

Tonight, I tried copying a video file (~1GB) from my desktop to my EON box. Timed out halfway through as usual. I then copied this file from my desktop to my server 2k3 vm. After the transfer finished, I tried transferring from the 2k3 box to EON. Same error a couple of minutes in. Tried from my laptop (Windows 7 x86) with same problem.

I just attempted using my desktop while booting from USB Ubuntu 11.04, 32-bit. Transfer of the same 1GB file screamed along until about 420MB were transferred. Error given was connection timed out.

It seems this is a problem with EON (not necessarily software), but I have no idea where to go from here.

FYI, the xbmc boxes are all linux-based, but I have no local storage on them. They boot from a CF drive and pull video files from the EON box. I've noticed problem lately with the performance of these as well. Movies (Standard Def) stutter or stop playing halfway through.

I just ssh'ed into the EON box, and noticed a delay after entering the password. Maybe not quite as long as some, but definitely delayed. Another oddity is that it shows the last login time and date, which are correct, but the computer name is not correct. It shows a computer that I haven't powered on in months. Not sure if this helps to pinpoint where the problem could be, but thought it would be worth mentioning.

TravisT
--
This message posted from opensolaris.org
Andre Lue
2011-05-15 16:19:46 UTC
Permalink
Hi Travist,

This sounds very much like a network related issue. I notice this board uses a "Realtek 8111DL" which historically has been a problematic nic for opensolaris.

Some people have used the gani driver with better success than the rge driver. The other thing, I notice you mentioned a e1000g0 which is a intel driver, did you add a nic?

I would suggest trying the http://sites.google.com/site/eonstorage/begin temporarily on different hardware (maybe one of the XBMC boxes) to rule out the hardware/nic.

It is getting that name from host resolution or DNS, so that should be investigated.

Did you run the snoop capture for ssh as requested?
--
This message posted from opensolaris.org
Travis T
2011-05-16 01:38:53 UTC
Permalink
The nic is an add-on Intel board because I was having problems with the onboard nic when I initially set the box up. I'm glad you mentioned that, because I think the initial problem I had with the on-board nic was very similar to this. Seems like downloads were slow and would timeout. I believe I posted on here and you helped me out then, so I will have to dig that out to see if the symptoms were the same.

The XBMC boxes have no local storage (only a CF card to boot), so I'm not sure what good it would do to run EON on them. From a hardware testing standpoint, I see the benefit, but I have no spare HD's at this point, so I may have to wait on this option.

I found the issue with the name resolution. My DHCP server's credentials were expired, and was not updating DNS as it handed out leases. The ip of the box I was remote-ing into the EON matched the name populated in my DNS server, so that's the name that it resolved. That is working like it should now.

I haven't ran the snoop yet. I will attempt to do so tonight and post the file.

Any indication of why the dladm hangs each time just before the disk that I had fail? Is that coincidence?
--
This message posted from opensolaris.org
Travis T
2011-05-16 01:43:36 UTC
Permalink
Andre,

If you have time, give this thread a quick read. Symptoms seem identical at this point.

http://opensolaris.org/jive/thread.jspa?messageID=492562&#492562

Added Link...


Message was edited by: travist
--
This message posted from opensolaris.org
Andre Lue
2011-05-16 04:20:09 UTC
Permalink
That thread seemed resolved with you using an intel nic and transferring north of 750GB, so I'm a little lost here???
--
This message posted from opensolaris.org
Travis T
2011-05-16 04:26:31 UTC
Permalink
It was, but since that one drive was unplugged, it seems to have come back to haunt me. I wonder if the problems were related to the onboard NIC at all, or if something else was acting up? I'd like to mount a NFS share through windows and see if an un-authenticated transfer has problems. I'm wondering if this could be an authentication problem??? No indication of it, but I'd like to rule that piece out since I'm using AD integration.
--
This message posted from opensolaris.org
Andre Lue
2011-05-17 13:31:17 UTC
Permalink
Hi Travist,

It is possible it's AD related but there are a few possibilities here. To rule them out, I think would lead back to a basic setup and re-introduce the pieces in a controlled fashion.

-Disk/zpool only: as a user on logged in from the console, copy a payload greater than or equal to the problem payload within the zpool.
-zpool + networking + various apps: assuming that succeeds test various applications to transfer similar payload(sftp, cifs/smb, nfs, rsync etc)

I know this sounds tedious but without a controlled approach It's not easy to pinpoint where the problem lies.
--
This message posted from opensolaris.org
Travis T
2011-05-28 17:19:14 UTC
Permalink
I've been out of town for the last couple of weeks, so I haven't had a chance to try this until today. I did a copy from my zpool to another folder in the same zpool. writes were very slow, and did a zpool iostat -v 1 in another terminal window. The output was once every 8-10 seconds rather than every one second. I think something is fundamentally wrong either at a hardware or software level within the EON box.
--
This message posted from opensolaris.org
Andre Lue
2011-06-02 04:24:42 UTC
Permalink
Hi Travist,

The zpool iostat -v 1 needs to be run directly from the EON console, not from a remote terminal. If remote, the network or some related factor could still be introduced.

If it is run from the console and does not update every 1 sec (not 8-10secs as described) I would suspect a hardware or driver issue (disk member or controller)
--
This message posted from opensolaris.org
Travis T
2011-06-04 15:45:29 UTC
Permalink
Andre,

I was finally able to connect a monitor/keyboard to the console to perform this test. The results were pretty much the same. When running iostat without the -v, the output was acceptable (~ every second). When adding the -v, it seemed to pause almost every time at c0t2d0 for several seconds. Again, this disk was disconnected and forced me to resilver the zpool not long before noticing these problems. Not sure if there's any relation, or if there's any way to determine if there are disk errors.
--
This message posted from opensolaris.org
Manojav Sridhar
2011-06-13 14:43:38 UTC
Permalink
Are your drives 4K or 512K sector size? I have seen some performance issues with 4k disks. I have since moved back to 512k disks.
--
This message posted from opensolaris.org
Travis T
2011-06-14 01:56:07 UTC
Permalink
I think you meant 512 byte and 4 kb. The drives currently installed are the WD10EALS drives, which are 512 byte sector drives. I also found that when running df, the output stalls at the disk I had problems with and reports the disk temp as 255*C. I think the culprit is that disk, but until I get the data off of the raidz, I'm not willing to pull it back out to test further.

I'm in the process of rsyncing everything over to another box, so after that happens I may be able to investigate a little further.
--
This message posted from opensolaris.org
Continue reading on narkive:
Loading...