[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[no subject]
On Tue, 20 Apr 2004, Jeffrey B. Layton wrote:
> Well, my response is - it depends. How long is long? How important
> is it to you? Can you checkpoint or modify the code to checkpoint?
> Unfortunately, there are questions you have to answer. However, let
> me give you some things I think about.
>
> We run CFD codes (Computational Fluid Dynamics) to explore
> fluid flow over and in aircraft. The runs can last up to about 48
> hours. Our codes checkpoint themselves, so if we lose the nodes
> (or a node since we're running MPI codes), we just back up to the
> last checkpoint. Not a big deal. However, if we didn't checkpoint,
> I would think about it a bit. 48 hours is long time. If the cluster
> dies at 47:59 I would be very upset. However, if we're running
> on a cluster with 256 nodes with UPS and if getting rid of UPS
> means I can get 60 more nodes, then perhaps I could just run my
> job on my more nodes and get done faster (reducing the window
> of vulnerability if you will).
Jeff touches on an important point here: what happens when you loose one
node? You should think about the hardware's MTBF and think about how often
you will loose a single node and what the consequences of that are. If
your computations run for a week without checkpoints and you have a lot of
nodes, you will have to worry about hardware failure as well as power. So
good coding practice involves checkpoints.
At the risk of getting flamed: Have you considered alternative
multiprocessor machines from Sun, SGI and the like? These systems have
great reliability and let you do things like put 60 G RAM on one machine.
> You also need to think about how long the UPS' will last. If you
> need to run 48 hours and the UPS kicks in about 24 hours, will
> the UPS last 24 hours? If not, you will lose the job anyway (with
> no check pointing) unless you get some really big UPS'. So in this
> case, UPS won't help much. However, it would help if you were
> only a few minutes away from completing a computation and
> just needed to finish (if it's a long run, the odds are this scenario
> won't happen often). If you could just touch a file and have your
> code recognize this so it could quickly check point, then a UPS
> might be worth it (some of our codes do this).
Most power problems where I used to work were very brief. I don't know
about what things are like here in Georgia, or weather or not you have
backup generators, but a UPS that gives you 30 seconds will get you
through a lot of tough spots and will save you from loosing your
computations because of a ten second power outage. If you want to ride
over major blackouts, a small UPS and a generator will be more cost
effective than a large UPS, but again, what's the point when your node
MTBF is on the same order as the frequency of power outages.
bjorn
</pre>
<!--X-Body-of-Message-End-->
<!--X-MsgBody-End-->
<!--X-Follow-Ups-->
<hr>
<ul><li><strong>Follow-Ups</strong>:
<ul>
<li><strong><a name="00820" href="msg00820.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> dhurst at kennesaw.edu (Dow Hurst)</li></ul></li>
</ul></li></ul>
<!--X-Follow-Ups-End-->
<!--X-References-->
<ul><li><strong>References</strong>:
<ul>
<li><strong><a name="00736" href="msg00736.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> ChrisColeman at mail.clayton.edu (Chris Coleman)</li></ul></li>
<li><strong><a name="00743" href="msg00743.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> laytonjb at charter.net (Jeffrey B. Layton)</li></ul></li>
<li><strong><a name="00786" href="msg00786.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> dhurst at kennesaw.edu (Dow Hurst)</li></ul></li>
<li><strong><a name="00807" href="msg00807.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> laytonjb at charter.net (Jeffrey B. Layton)</li></ul></li>
</ul></li></ul>
<!--X-References-End-->
<!--X-BotPNI-->
<ul>
<li>Prev by Date:
<strong><a href="msg00814.html">[ale] ipv6 dns requests???</a></strong>
</li>
<li>Next by Date:
<strong><a href="msg00816.html">[ale] ipv6 dns requests???</a></strong>
</li>
<li>Previous by thread:
<strong><a href="msg00817.html">[ale] Linux Cluster Server Room</a></strong>
</li>
<li>Next by thread:
<strong><a href="msg00820.html">[ale] Linux Cluster Server Room</a></strong>
</li>
<li>Index(es):
<ul>
<li><a href="maillist.html#00815"><strong>Date</strong></a></li>
<li><a href="threads.html#00815"><strong>Thread</strong></a></li>
</ul>
</li>
</ul>
<!--X-BotPNI-End-->
<!--X-User-Footer-->
<!--X-User-Footer-End-->
</body>
</html>