[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[ale] lsof and a hung system
- Subject: [ale] lsof and a hung system
- From: jkinney at jimkinney.us (Jim Kinney)
- Date: Tue, 20 Oct 2015 12:25:15 -0400
- In-reply-to: <CADvA-d=VhZ36PnQNJSS0APfgkQWdreEuaDL+_RcRUp_T9eQGOg@mail.gmail.com>
- References: <[email protected]> <[email protected]> <CAEo=5Pz0OU_Pf4NcadPxXWTV0zqN+bG7JtXyTSfTcS467n0mtA@mail.gmail.com> <CADvA-d=VhZ36PnQNJSS0APfgkQWdreEuaDL+_RcRUp_T9eQGOg@mail.gmail.com>
Yep. The 10G card driver had oopsed all over itself and wouldn't keep a
connection up. I initially tried to stop network, unload the module,
load the module, start the network but even that failed to reset the
card completely. I needed to add a sleep 20 before loading the module
again. Once the connection was actually working the system was cleanly
rebooted to lop off the zombies and things were happily OK.
On Tue, 2015-10-20 at 11:32 -0400, Ed Cashin wrote:
> On Mon, Oct 19, 2015 at 10:58 PM, Jim Kinney <jim.kinney at gmail.com>
> wrote:
> ...
> > Other system with same nfs mounted storage is fine. Storage server
> > is connected to both number crunchers by dedicated, unswitched
> > 10Gbps fiber ethernet.
> > >
> >
> You mean with direct connections? In that case, the other number
> cruncher's connection could be fine, while the affected system could
> not be able to do networking to the NFS server (for some as yet
> undetermined reason), which could result in the behavior you describe
> if the NFS mount is "hard".
>
> --
> Ed Cashin <ecashin at noserose.net>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20151020/3d9afd79/attachment.html>