[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[no subject]



Thanks

Jonathan

On Tue, 2004-04-20 at 07:11, Jeffrey B. Layton wrote:
> Well, my response is - it depends. How long is long? How important
> is it to you? Can you checkpoint or modify the code to checkpoint?
> Unfortunately, there are questions you have to answer. However, let
> me give you some things I think about.
> 
> We run CFD codes (Computational Fluid Dynamics) to explore
> fluid flow over and in aircraft. The runs can last up to about 48
> hours. Our codes checkpoint themselves, so if we lose the nodes
> (or a node since we're running MPI codes), we just back up to the
> last checkpoint. Not a big deal. However, if we didn't checkpoint,
> I would think about it a bit. 48 hours is long time. If the cluster
> dies at 47:59 I would be very upset. However, if we're running
> on a cluster with 256 nodes with UPS and if getting rid of UPS
> means I can get 60 more nodes, then perhaps I could just run my
> job on my more nodes and get done faster (reducing the window
> of vulnerability if you will).
> 
> You also need to think about how long the UPS' will last. If you
> need to run 48 hours and the UPS kicks in about 24 hours, will
> the UPS last 24 hours? If not, you will lose the job anyway (with
> no check pointing) unless you get some really big UPS'. So in this
> case, UPS won't help much. However, it would help if you were
> only a few minutes away from completing a computation and
> just needed to finish (if it's a long run, the odds are this scenario
> won't happen often). If you could just touch a file and have your
> code recognize this so it could quickly check point, then a UPS
> might be worth it (some of our codes do this).
> 
> Unfortunately, there is no easy answer. You need to figure out
> the answers yourself :)
> 
> Good Luck!
> 
> Jeff
> 
> P.S. Dow - notice my address change. You can talk to me off
> line if you want.
> 
> > I understand your philosophy here but have a question?  What if the 
> > calculations are long and costly to restart?  Shouldn't I look at the 
> > value of spent computation that might have to be done over if I lose 
> > power?  The code I am most concerned about running on the cluster may 
> > or may not be checkpointable.  I think it might be, but I know my 
> > users and they won't want power to be an issue with predicting when 
> > their jobs will finish. ;-)
> >
> > Are Best UPS better performing than Tripplite or APC?  I have 
> > experience with Tripplite, APC, and Leibert so far and never used 
> > Best.  I like the toughness and quality of the enclosure of the APC 
> > and Leibert.  I like the quality of all three.  I like the performance 
> > and cost of APC and Tripplite.  Tripplite's cases or enclosures on the 
> > low end aren't as nice as APC, but when you get the high UPSes they 
> > have nice rack enclosures.  Performance wise, I haven't been able to 
> > tell a difference between the two.  Heat production leans toward APC 
> > producing less overall.
> >
> > What do you mean by getting the wrong power factor conversion? Do you 
> > mean getting 120v at 60Hz vs 220v at 60Hz on the output outlets?
> >
> > I appreciate all this advice!
> > Dow
> >
> >
> >
> > Jeffrey B. Layton wrote:
> >
> >> I'll give you my 2 cents about clusters and UPS's if you wish.
> >>
> >> A good cluster configuration will treat each compute node as
> >> an appliance. You don't really care about it too much and it
> >> doesn't hold any data of any importance. What you care about
> >> is the master node and/or where the data is stored These
> >> machines can have their own UPS or a single UPS to cover
> >> the machines (they may be more than one). Then take the cost
> >> savings (if you can) and put them into more nodes, or a better
> >> interconnect (if needed), or a large file system, or a better
> >> backup system, or .... well, you get the picture.
> >>
> >> Thinking of only putting a UPS on the important parts of the
> >> cluster will save you money, time, and headaches. However,
> >> if you put a cluster in a server room you can have all power
> >> covered by a single huge UPS and probably a diesel backup
> >> generator as well. This goes back to the purpose of a server
> >> room - to support independent servers, not clusters. While this
> >> is nice and good, it is somewhat wasteful. If you could have
> >> a combination of UPS/Diesel backed power and just regular
> >> conditioned power, that would be more economical. However,
> >> the budgets for clusters (computing) and the budget for facilities
> >> are never really seen as related by management. Even though
> >> they come out of the same overall pot within the company (or
> >> university), management has a tendency to compartmentalize
> >> things for easy managing (and the definite lack of brain power
> >> on the part of most managers). Try arguing that you really
> >> don't need the giant UPS/Diesel combo and you will get IT
> >> managers screaming all sorts of things about you. Sigh.
> >>
> >> Of course, these comments depend on your cluster configuration.
> >> If you are running a global filesystem across all of the nodes,
> >> so that each node has part of the filesystem, then you might
> >> want to think about a good UPS for all of the nodes (try
> >> restoring a 20 TB global filesystem from backup after a
> >> power outage).
> >>
> >> Good Luck!
> >>
> >> Jeff
> >>
> >>> What type of UPS system are you using? Do most install a large UPS 
> >>> system for the entire server room? If so, how much will this cost?
> >>>
> >>> Thanks,
> >>> Chris
> >>>
> >>> -----Original Message-----
&gt; &gt;&gt;&gt; From: Dow Hurst [<a  rel="nofollow" href="mailto:dhurst";>mailto:dhurst</a> at kennesaw.edu]
&gt; &gt;&gt;&gt; Sent: Monday, April 12, 2004 11:20 AM
&gt; &gt;&gt;&gt; To: ale
&gt; &gt;&gt;&gt; Subject: Re: [ale] Linux Cluster Server Room
&gt; &gt;&gt;&gt;
&gt; &gt;&gt;&gt;
&gt; &gt;&gt;&gt; Thanks Jonathon!  That is exactly the kind of ballpark I needed!  I 
&gt; &gt;&gt;&gt; don't need
&gt; &gt;&gt;&gt; the vendors right now as we are still kicking around ideas.  If 
&gt; &gt;&gt;&gt; anyone would
&gt; &gt;&gt;&gt; throw some specs or ideas out there, I'd appreciate it.  Here is a 
&gt; &gt;&gt;&gt; quick
&gt; &gt;&gt;&gt; question?  Is planning for double your planned load a good rule?  I 
&gt; &gt;&gt;&gt; would
&gt; &gt;&gt;&gt; think that would be a good idea.  How about backup cooling if the 
&gt; &gt;&gt;&gt; main unit
&gt; &gt;&gt;&gt; dies?  The firesafe is one I had not thought of.
&gt; &gt;&gt;&gt; Dow
&gt; &gt;&gt;&gt;
&gt; &gt;&gt;&gt;
&gt; &gt;&gt;&gt; Jonathan Glass (IBB) wrote:
&gt; &gt;&gt;&gt;  
&gt; &gt;&gt;&gt;
&gt; &gt;&gt;&gt;&gt; How big are the Opteron nodes?  Are they 1,2,4U?  How big are the 
&gt; &gt;&gt;&gt;&gt; power
&gt; &gt;&gt;&gt;&gt; supplies?  What is the maximum draw you expect?  Convert that 
&gt; &gt;&gt;&gt;&gt; number to
&gt; &gt;&gt;&gt;&gt; figure out how much heat dissipation you'll need to handle.
&gt; &gt;&gt;&gt;&gt;
&gt; &gt;&gt;&gt;&gt; I have a 3-ton A/C unit in my 14|15 x 14|15 server room, and the 24-33
&gt; &gt;&gt;&gt;&gt; node cluster I just spec'd out from IBM (1U, Dual Opterons) was 
&gt; &gt;&gt;&gt;&gt; rated at
&gt; &gt;&gt;&gt;&gt; a max heat dissipation (is this the right word?) of 18,000 BTU. 
&gt; &gt;&gt;&gt;&gt; According to my A/C guy, the 3-ton unit can handle a max of 36,000 
&gt; &gt;&gt;&gt;&gt; BTU,
&gt; &gt;&gt;&gt;&gt; so I'm well inside my limits.  Getting the 3-ton unit installed in the
&gt; &gt;&gt;&gt;&gt; drop-down ceiling, including installing new chilled water lines, was
&gt; &gt;&gt;&gt;&gt; around $20K.
&gt; &gt;&gt;&gt;&gt;
&gt; &gt;&gt;&gt;&gt; I do have sprinkler fire protection, but that room is set to 
&gt; &gt;&gt;&gt;&gt; release its
&gt; &gt;&gt;&gt;&gt; water supply independent of the other rooms. Also, supposedly, the 
&gt; &gt;&gt;&gt;&gt; fire
&gt; &gt;&gt;&gt;&gt; sprinkler heads (whatever they're called) withstand considerably more
&gt; &gt;&gt;&gt;&gt; heat than normal ones.  So, the reasoning goes, if it gets hot enough
&gt; &gt;&gt;&gt;&gt; for those to go off, I have bigger problems than just water.  Thus, I
&gt; &gt;&gt;&gt;&gt; have a fire safe nearby (in the same bldg...yeah, yeah, I know; 
&gt; &gt;&gt;&gt;&gt; off-site
&gt; &gt;&gt;&gt;&gt; storage!) that holds my tapes, and will shortly hold a hardware
&gt; &gt;&gt;&gt;&gt; inventory and admin password list on all my servers.
&gt; &gt;&gt;&gt;&gt;
&gt; &gt;&gt;&gt;&gt; If you want my list of vendors, send me an email off-list, or call my
&gt; &gt;&gt;&gt;&gt; office, and I'll see if I can track down the DPOs for you.
&gt; &gt;&gt;&gt;&gt;
&gt; &gt;&gt;&gt;&gt; Thanks
&gt; &gt;&gt;&gt;&gt;
&gt; &gt;&gt;&gt;&gt; Jonathan Glass
&gt; &gt;&gt;&gt;&gt;
&gt; &gt;&gt;&gt;&gt; On Fri, 2004-04-09 at 17:35, Dow Hurst wrote:
&gt; &gt;&gt;&gt;&gt;
&gt; &gt;&gt;&gt;&gt;  
&gt; &gt;&gt;&gt;&gt;
&gt; &gt;&gt;&gt;&gt;&gt; If I needed to take an existing space 400 square feet w/8' 
&gt; &gt;&gt;&gt;&gt;&gt; ceiling, 20'x20'x8', and add A/C and fire protection for a server 
&gt; &gt;&gt;&gt;&gt;&gt; room, what kind of cost would be incurred?  Sounds like an algebra 
&gt; &gt;&gt;&gt;&gt;&gt; problem from highschool doesn't it?  Let's say a full 84&quot; rack of 
&gt; &gt;&gt;&gt;&gt;&gt; 4CPU Opteron nodes and supporting hardware were in the room.  Does 
&gt; &gt;&gt;&gt;&gt;&gt; anyone have any ballpark figures they could throw out there?  Any 
&gt; &gt;&gt;&gt;&gt;&gt; links I could be pointed to?
&gt; &gt;&gt;&gt;&gt;&gt; Thank a bunch,
&gt; &gt;&gt;&gt;&gt;&gt; Dow
&gt; &gt;&gt;&gt;&gt;&gt;
&gt; &gt;&gt;&gt;&gt;&gt;
&gt; &gt;&gt;&gt;&gt;&gt; PS.  I'd like some other type of fire protection than sprinkler 
&gt; &gt;&gt;&gt;&gt;&gt; heads. ;-)
&gt; &gt;&gt;&gt;&gt;&gt;    
&gt; &gt;&gt;&gt;&gt;
&gt; 
&gt; 
&gt; _______________________________________________
&gt; Ale mailing list
&gt; Ale at ale.org
&gt; <a  rel="nofollow" href="http://www.ale.org/mailman/listinfo/ale";>http://www.ale.org/mailman/listinfo/ale</a>


</pre>
<!--X-Body-of-Message-End-->
<!--X-MsgBody-End-->
<!--X-Follow-Ups-->
<hr>
<ul><li><strong>Follow-Ups</strong>:
<ul>
<li><strong><a name="00817" href="msg00817.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> laytonjb at charter.net (Jeffrey B. Layton)</li></ul></li>
</ul></li></ul>
<!--X-Follow-Ups-End-->
<!--X-References-->
<ul><li><strong>References</strong>:
<ul>
<li><strong><a name="00736" href="msg00736.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> ChrisColeman at mail.clayton.edu (Chris Coleman)</li></ul></li>
<li><strong><a name="00743" href="msg00743.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> laytonjb at charter.net (Jeffrey B. Layton)</li></ul></li>
<li><strong><a name="00786" href="msg00786.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> dhurst at kennesaw.edu (Dow Hurst)</li></ul></li>
<li><strong><a name="00807" href="msg00807.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> laytonjb at charter.net (Jeffrey B. Layton)</li></ul></li>
</ul></li></ul>
<!--X-References-End-->
<!--X-BotPNI-->
<ul>
<li>Prev by Date:
<strong><a href="msg00811.html">[ale] diagnosis</a></strong>
</li>
<li>Next by Date:
<strong><a href="msg00813.html">[ale] Free Artwork Site: www.openclipart.org</a></strong>
</li>
<li>Previous by thread:
<strong><a href="msg00807.html">[ale] Linux Cluster Server Room</a></strong>
</li>
<li>Next by thread:
<strong><a href="msg00817.html">[ale] Linux Cluster Server Room</a></strong>
</li>
<li>Index(es):
<ul>
<li><a href="maillist.html#00812"><strong>Date</strong></a></li>
<li><a href="threads.html#00812"><strong>Thread</strong></a></li>
</ul>
</li>
</ul>

<!--X-BotPNI-End-->
<!--X-User-Footer-->
<!--X-User-Footer-End-->
</body>
</html>