Discussion:
EON: smb transfers timeout after several minutes
Travis T
2010-07-20 12:11:34 UTC
Permalink
Just reconfigured my EON to eliminate any adverse affects of me learning and changing things that I didn't know about at the time...

Everything is working like it should (as far as I can tell), but when transferring large sets of files, it seems to time out and loose connectivity. The first time I noticed this, I also saw an error message that said /tmp ran out of space.

After some research, I added "size=4096" in the vfstat file. I have 4G of RAM, so I thought this would be a good place to start. I started the transfers again, and found this morning that they again lost connectivity again. No messages on screen about /tmp sizes, or global catalog being unreachable (which happened the first time). The shares seem accessible this morning from the same windows boxes that did not complete transferring files without any other intervention.

As a solaris newb, I'd appreciate at least the direction in which I should look.
--
This message posted from opensolaris.org
Andre Lue
2010-07-20 14:19:42 UTC
Permalink
Double check that you have zfs swap configured.
swap -sh
before and after adding it will help you gauge if its added or not.

http://eonstorage.blogspot.com/2009/02/adding-zfs-swap-to-eon.html
--
This message posted from opensolaris.org
Travis T
2010-07-20 23:27:37 UTC
Permalink
I didn't see that on your site, thanks for the link. I need to dig through the blog posts there more to catch some of the tips/tricks.

I followed the instructions, and created an 8G swap in my zfs raidz pool. I checked via the swap -sh command and see it is active.

galaxy:1:~#swap -sh
total: 38M allocated + 19M reserved = 58M used, 11G available

I'm booting from a 4G USB drive and have 4G of RAM installed.

The original SMB transfer problem is still there. I'm seeing no errors logged on the screen of the EON server. Thoughts?
--
This message posted from opensolaris.org
Andre Lue
2010-07-21 03:33:41 UTC
Permalink
Hi Travist,

If I understand correctly the transfer ends abruptly before completing and EON continues to function normally?

Can you give more details about your "large sets of files" and "lost connectivity"?

-Was this SMB transfer using samba or CIFS version?
-What was the size of the transfer?
-Does this happen with a smaller transfer? Let's say 1/2, 3/4 etc?
-Does the transfer end at the same point? possibly the same file?
-Were there any messages on the client at point of disconnect? What's client OS?
-Can you list the motherboard model and link to specs?
--
This message posted from opensolaris.org
Travis T
2010-07-21 12:22:13 UTC
Permalink
Motherboard: http://www.newegg.com/Product/Product.aspx?Item=N82E16813130275

The "large transfer" was a folder containing 406G of file and folders, all varying in size. The transfer did not stop at the same folder/file each time and I've tried multiple difference large folders ranging from 100G - almost 1TB.

Single files don't seem to have any problems. I don't remember the exact message received, but it was something to the effect of the path is no longer accessible, although after seeing the message, I could still browse to the share.

Client OS that I've seen this problem with are server 2003 and windows 7-64.

I found a thread last night on smallnetbuilder between you and rousch00 and saw the recommendation to use the smb build of eon. I was using CIFS and have already started re-configuring. I tried just copying the x86.eon file over, but many of my configurations were lost. I decided to just start over.

Right now I'm stuck because I'm not sure how to join my MS domain. The 'smbadm' command is not found.
--
This message posted from opensolaris.org
Andre Lue
2010-07-21 14:18:10 UTC
Permalink
Hmmm, sounds like you have switched to samba version. That would explain why you are not finding smbadm(comes with CIFS version). For Samba all things are controlled via smb.conf. If you are not comfortable with editing or learning this file I would stick with the CIFS version.

Can you try to get the exact message. I'm wondering if the realtek nic is dropping for a sec.

can you run this command in a terminal while transferring and capture +/- 10 secs before and after the prob/message occurs
dladm rge0 show-link -s -i 1
--
This message posted from opensolaris.org
Travis T
2010-07-21 21:56:14 UTC
Permalink
Dre,

Thanks for the help. I gave the SMB image a try. The smb.conf defeated me (for now). I am back to re-installing the CIFS image. I plan to create a full log of all the commands from start to finish so I can do a good writeup for inexperienced EON users like me (I'm getting real good at the basic install - I think this it #6 or 7 now in two weeks!). If you're interested, please let me know.

I hope I will be able to post the results you requested tonight, as I'm trying to move my data over to the EON box ASAP to free up my current file server for VMWare.
--
This message posted from opensolaris.org
Travis T
2010-07-22 03:28:58 UTC
Permalink
Dre,

I have a detailed log of the install with the cifs image. I ran into a problem with idmap that I found what seems to be a bug report on. I have documented it in my notes.

It looks like the nic on the eon box is dropping out, and that is what is causing my problems. I happened to be ssh'd into the box from my desktop and had my laptop sitting next to me. I saw the transfer stop and my ssh session froze. I would have suspected my desktop if I hadn't been able to run a ping from my laptop at the same time. Without doing anything, the interface reset and the pings came back up on both remote boxes.

I didn't get a chance to pull the data you requested before finding this. Should I try and buy another nic and ditch the onboard? If so, any suggestions?
--
This message posted from opensolaris.org
Travis T
2010-07-22 04:45:54 UTC
Permalink
more testing seems to show that the dropouts only occur when writing to a deduped share. I had a transfer going for quite some time and as soon as I copied something to a dedup enabled share, the nic dropped out.

Below are the logs you requested. Also attached is a text file with the same info, pasted one after another.

64961 0 23566 0 0 64961 0 23566 0 0
62077 0 22530 0 0 62077 0 22530 0 0
63384 0 23002 0 0 63384 0 23002 0 0
41600 0 15066 0 0 41600 0 15066 0 0
50328 0 18154 0 0 50328 0 18154 0 0
56252 0 20359 0 0 56252 0 20359 0 0
44442 0 15994 0 0 44442 0 15994 0 0
39658 0 14356 0 0 39658 0 14356 0 0
53221 0 24584 0 0 53221 0 24584 0 0
49963 0 26636 0 0 49963 0 26636 0 0
63201 0 33729 0 0 63201 0 33729 0 0
56772 0 30258 0 0 56772 0 30258 0 0
60350 0 32231 0 0 60350 0 32231 0 0
40142 0 21464 0 0 40142 0 21464 0 0
56125 0 30006 0 0 56125 0 30006 0 0
1934 1 1007 0 0 1934 1 1007 0 0
3 0 3 0 0 3 0 3 0 0
4 0 5 0 0 4 0 5 0 0
3 0 2 0 0 3 0 2 0 0
5 0 4 0 0 5 0 4 0 0
3 0 2 0 0 3 0 2 0 0
22 0 5 0 0 22 0 5 0 0
12 0 2 0 0 12 0 2 0 0
9 0 2 0 0 9 0 2 0 0
1 0 2 0 0 1 0 2 0 0
1 0 4 0 0 1 0 4 0 0


rge0 42244 63778779 0 15294 923380 0
rge0 53018 79996708 0 24972 1468465 0
rge0 50009 75368002 0 26654 1553789 0
rge0 63158 95540766 0 33705 1966471 0
rge0 56206 84638248 0 29954 1749504 0
rge0 60733 91822608 0 32441 1895361 0
rge0 40007 60254653 0 21386 1249582 0
rge0 55711 84051928 1 29761 1738799 0
rge0 7 0 0 5 836 0
rge0 3 0 0 3 496 0
rge0 4 0 0 5 1300 0
rge0 3 0 0 2 340 0
rge0 5 0 0 4 1608 0
rge0 3 0 0 2 340 0
LINK IPACKETS RBYTES IERRORS OPACKETS OBYTES OERRORS
rge0 22 0 0 5 982 0
rge0 12 0 0 2 420 0
rge0 9 0 0 2 340 0
rge0 1 0 0 2 340 0
--
This message posted from opensolaris.org
Andre Lue
2010-07-22 14:03:50 UTC
Permalink
Check /etc/nsswitch.conf if it has the line

hosts: files dns mdns
--
This message posted from opensolaris.org
Travis T
2010-07-22 17:39:49 UTC
Permalink
dre,

After posting earlier, I rebooted and started another transfer to a non-deduped share. When I returned home, I found the system in the same state as last night - unresponsive via the network, could ping rge0 from console but nothing else. /var/adm/messages contained all info from reboot to current. I've included the output of all of the requested commands/files. I removed the "." from the nsswitch file so it would open easier in a browser.

Do you feel that this is a hardware problem or a software/config problem?

Thanks,
Travis
--
This message posted from opensolaris.org
Andre Lue
2010-07-22 19:38:41 UTC
Permalink
The nss_mdns can be fixed by changing entries
hosts: files dns mdns
ipnodes: files dns mdns
to
hosts: files dns
ipnodes: files dns

or by running uncommenting the /mnt/eon0/.exec entry (I recommend not running the service unless you use it)
/usr/lib/inet/mdnsd

Right now I have a suspicion the disconnect from shares, then non-responsiveness has to do with the idmap and AD setup. This error "Global Catalog servers not configured/discoverable" :
idmapd[250]: [ID 696364 daemon.error] Degraded operation (Global Catalog servers not configured/discoverable). If you are running an SMB server in workgroup mode, or if you're not running an SMB server, then you can ignore this message

See some similar symptoms here:
http://www.nexenta.com/corp/component/fireboard/?func=view&id=745&view=flat&catid=6

http://osdir.com/ml/os.solaris.opensolaris.storage.general/2007-11/msg00383.html
--
This message posted from opensolaris.org
Travis T
2010-07-22 20:29:33 UTC
Permalink
I'm using rge0. The intel nic didn't appear to be registered (by default at least) during the initial config.

I'll research the mdns to see if it's necessary and disable the svc.

I'm kind of thinking that the nic is dropping and the smb svc is loosing connection w/ AD, thus the message. I don't think the idmap would cause pings to be unroutable anywhere but locally, right? Unfortunately, I don't have the first idea of troubleshooting a nic problem like this on a solaris box...

I can test with the intel nic if you don't mind giving a quick lesson on activating it, if you think that may help.
--
This message posted from opensolaris.org
Andre Lue
2010-07-22 20:53:56 UTC
Permalink
It's possible this rge Realtek driver has some issues with the Realtek hardware. In the past some people have use the gani driver.

It would be good to try the intel nic. Try disabling the rge nic in the bios and see if the e1000g0 intel nic will be recognized (although it threw an error in messages file).

ifconfig -a to see it it is recognized.

to manual enable you can try
ifconfig e1000g0 plumb
ifconfig e1000g0 ip netmask 255.255.255.x
--
This message posted from opensolaris.org
Travis T
2010-07-23 00:00:07 UTC
Permalink
The only intel board I have is a pci fast ethernet board. I believe the command listed was for a gig board. The nic wasn't recognized on startup and I got several errors.

failed to plumb rge0
svc:/network/physical:default: Method "/lib/svc/method/net-physical" failed with exit status 96
network/physical:default misconfigured: transitioned to maintenance

Warning: No randomness provider enabled for /dev/random

If there is a different command to plumb the nic I have, I can try it - or if there is a board that you know will work that could be picked up locally I could possibly do that tomorrow.
--
This message posted from opensolaris.org
Andre Lue
2010-07-23 14:10:53 UTC
Permalink
Not sure whats going on but your messages file shows 2 nics. A intel gigE nic (e1000g0 with failure) and a realtek nic (rge0 link up)
Jul 22 06:09:09 galaxy e1000g: [ID 801725 kern.warning] WARNING: pci8086,10f0 - e1000g[0] : Identify hardware failed

Jul 22 06:09:05 galaxy mac: [ID 435574 kern.info] NOTICE: rge0 link up, 1000 Mbps, full duplex

For a intel fast ethernet nic, try the same ifconfig commands with device iprb0 instead of e1000g0

Try booting with the OEM image or the CD if you need to bypass the transitioned to maintenance roadblock.
--
This message posted from opensolaris.org
Travis T
2010-07-24 02:25:52 UTC
Permalink
Alright. I booted from CD (removed usb drive), disabled onboard nic in bios and installed intel dual port 10/100 nic (p/n 703875-004) into the pci slot. After boot, ifconfig didn't show any physical interfaces. Checked the /var/adm/messages and saw the lines you referenced above. Tried plumbing both e1000g0 and iprb0 and both gave the following error:

ifconfig: cannot open link "e1000g0/iprb0": DLPI link does not exist

Any suggestions?
--
This message posted from opensolaris.org
Andre Lue
2010-07-26 14:40:09 UTC
Permalink
Can you post the output for
ifconfig -a plumb
--
This message posted from opensolaris.org
Travis T
2010-07-27 00:55:20 UTC
Permalink
With both the onboard NIC enabled and the Intel NIC installed in the PCI slot...

ifconfig -a

galaxy:1:~#ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
rge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 172.16.1.30 netmask ffff0000 broadcast 172.16.255.255
ether 40:61:86:f5:e3:1a
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
inet6 ::1/128

ifconfig -a plumb

galaxy:2:~#ifconfig -a plumb
ifconfig: SIOCSLIFNAME for ip: rge0: already exists

I tried these commands with the intel NIC removed and received the same output as above.
--
This message posted from opensolaris.org
Travis T
2010-07-27 02:48:21 UTC
Permalink
BTW, I just ordered one of these - hopefully I'll recieve it by the end of the week and see if it works.

http://www.newegg.com/Product/Product.aspx?Item=N82E16833106121&cm_re=intel_pro_1000-_-33-106-121-_-Product
--
This message posted from opensolaris.org
Andre Lue
2010-07-27 21:25:10 UTC
Permalink
That card should work with the e1000g driver.
--
This message posted from opensolaris.org
Travis T
2010-07-28 00:58:48 UTC
Permalink
Great. Thanks for all the help - hopefully I'll have good news to report in a few days. Once I get everything worked out, if you're interested I'll draft up my notes into a how-to enabling AD authentication from start to finish... just let me know.

Travis
--
This message posted from opensolaris.org
Andre Lue
2010-07-28 16:32:41 UTC
Permalink
Sure, user experience documentation is always welcome. I'd add it to the user howto guide section. Here's an example I put up from Eoin
http://sites.google.com/site/eonstorage/active-directory-guide
--
This message posted from opensolaris.org
Travis T
2010-07-28 17:05:25 UTC
Permalink
I've used that guide heavily, and has helped me greatly. I've got SSH captures from start of install to completion of AD integration. I plan to do a write up explaining each step and also point out some of the important file locations/commands that have helped me troubleshoot along the way. Anyway, once I get it worked up I'll send it to you and you're more than welcome to use what you think is helpful.
--
This message posted from opensolaris.org
Travis T
2010-07-30 18:32:44 UTC
Permalink
Ok, so I received two intel NICs yesterday, and installed one in my desktop and one in the EON box. I've been transferring files all night (so far about 750G) and things are going pretty smoothly. I had a couple of operator induced errors that set me back a little, but no disconnects from the domain that interrupted file transfers.

One problem I'm having is that file xfers seem to be pretty slow, but as of now I'm attributing that to the windows box I'm transferring from. My speeds tend to spike and drop the whole time, but I'm copying from a single sata drive to a raidz pool of 4 disks. The diskspeed results show ~ 130MBs for each of the EON drives but across the network I'm not seeing the throughput I think I should be. Could also be related to the gigabit switch I'm running, but unless you have any suggestions I'll try tuning it more later.

Bottom line: I think the rge driver was causing my problems.
--
This message posted from opensolaris.org
Andre Lue
2010-08-03 16:18:32 UTC
Permalink
Hi Travist,

You can do the following to in 2 terminals to see if the transfer rates are making sense to what's happening.
#1 terminal
zpool iostat -v 1

#2 terminal
dladm show-link e1000g0 -s -i 1
--
This message posted from opensolaris.org
Travis T
2010-08-04 02:43:23 UTC
Permalink
I didn't see anything out of the ordinary while performing transfers. It could very well be attributed to the cheap gigabit switch that I'm using as well.

Possibly related though, all of a sudden when running diskspeed.sh I am getting the following:
galaxy:2:~#diskspeed.sh
The current rpm value 0 is invalid, adjusting it to 3600
The current rpm value 0 is invalid, adjusting it to 3600
configured 31657 MB/sec
c0t0d0 140 MB/sec
c0t1d0 148 MB/sec
c0t2d0 134 MB/sec
c0t3d0 130 MB/sec
c0t4d0 1084 MB/sec
c1t0d0 19 MB/sec

Not sure exactly how to decipher these results, specifically for c0t4d0. This disk was added as a single disk zfs device after the others were setup.
--
This message posted from opensolaris.org
Andre Lue
2010-08-05 05:32:01 UTC
Permalink
Travist,

That script has some short comings I would ignore the readings for c0t4d0 and the current rpm error. It gives an idea of the upper read speed limits of the disk

Did the dladm output match the transfer rates you were seeing?
--
This message posted from opensolaris.org
Travis T
2010-08-05 14:08:05 UTC
Permalink
The rates matched. I'll have to run another IOMeter test to see where things stand now that everything has settled in a little.

Last time I ran it from my Windows 7 pro box, I was seeing about 50MB/s write speeds to my 4 disk raidz pool, which was about half of the speeds of my initial testing. Again, those results could have been affected by the windows box.

I don't have any physical boxes capable of gigabit ethernet (only one was converted to my ESXi host), so the only testing I can do will be from a VM on that host. Not too sure what to expect, but I'll be happy to post the results.
--
This message posted from opensolaris.org
Andre Lue
2010-08-05 14:59:50 UTC
Permalink
Not necessary Travist, as long as things are working ok.

Feel free to send your AD guide to eonstore at gmail dot com whenever you get around to it.
--
This message posted from opensolaris.org
Travis T
2010-07-22 12:52:03 UTC
Permalink
I haven't rebooted, but I am getting a nss_mdns error (nscd[940]: [ID 131150 user.error]) that has filled the entire messages log. There is nothing in the log other than that.
--
This message posted from opensolaris.org
Andre Lue
2010-07-22 05:13:20 UTC
Permalink
Can you attach /var/adm/messages if you have not rebooted since the error?
--
This message posted from opensolaris.org
Loading...