[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[no subject]



[...]


    ChangingLINKS> Can spammers figure out how to harvest the email
    ChangingLINKS> from that? Sure.

    ChangingLINKS> Will they harvest it? I strongly doubt it. In
    ChangingLINKS> short: Spammers are programmers, have access to
    ChangingLINKS> programmers (or buy ready made spamming
    ChangingLINKS> software). And writing such code would take *time*
    ChangingLINKS> which increases cost. Aside from the virus/

It took me all of about 5 minutes (start ~1109, working ~1116) to get
this which will work for one page; most of that time was remembering
how to use HTML::TokeParser (and a quick trip to the facilities :).


#!/usr/bin/perl

use LWP::Simple qw( get );
use HTML::TokeParser ();

my $url = shift;

my $content = get( $url )
  or die "Can't fetch $url\n";

my $stream = HTML::TokeParser->new( \$content )
  or die "Can't create TokeParser: $!\n";

while( my $t = $stream->get_token ) {
  if( $t->[0] eq 'S' and $t->[1] eq 'a'
      and $t->[2]->{href} =~ /^mailto:/ ) {
    my $addr = $stream->get_trimmed_text( "/a" );
    $addr =~ s/\s+at\s+/\@/;
    print $addr, "\n";
    last;
  }
}

exit 0;

__END__


In under an hour I could have this spidering in parallel (in about
four to five hours I could have something which would spread requests
out to come from any number of endpoints to make it not look like
spidering).  And all with just off the shelf components.  I probably
could do the same in Ruby, again with pretty much off the shelf
components, in about the same time (well, probably about 2 hours since
I'm still working on my Ruby-fu).  Serious harvesters will have
someone who could do the same in the same timeframes on retainer (not
to mention probably already having much of the spidering
infrastructure already in place).

-- 
Fletch                | "If you find my answers frightening,       __`'/|
fletch at phydeaux.org   |  Vincent, you should cease askin'          \ o.O'
                      |  scary questions." -- Jules                =(___)=
                      |                                               U


</pre>
<!--X-Body-of-Message-End-->
<!--X-MsgBody-End-->
<!--X-Follow-Ups-->
<hr>
<ul><li><strong>Follow-Ups</strong>:
<ul>
<li><strong><a name="00654" href="msg00654.html">[ale] (no subject) SPAM talk...</a></strong>
<ul><li><em>From:</em> groups at ChangingLINKS.com (ChangingLINKS.com)</li></ul></li>
</ul></li></ul>
<!--X-Follow-Ups-End-->
<!--X-References-->
<ul><li><strong>References</strong>:
<ul>
<li><strong><a name="00620" href="msg00620.html">[ale] (no subject) SPAM talk...</a></strong>
<ul><li><em>From:</em> baron_shatturday at hotmail.com (William Wylde)</li></ul></li>
<li><strong><a name="00631" href="msg00631.html">[ale] (no subject) SPAM talk...</a></strong>
<ul><li><em>From:</em> groups at ChangingLINKS.com (ChangingLINKS.com)</li></ul></li>
<li><strong><a name="00633" href="msg00633.html">[ale] (no subject) SPAM talk...</a></strong>
<ul><li><em>From:</em> jimpop at yahoo.com (Jim Popovitch)</li></ul></li>
<li><strong><a name="00636" href="msg00636.html">[ale] (no subject) SPAM talk...</a></strong>
<ul><li><em>From:</em> groups at ChangingLINKS.com (ChangingLINKS.com)</li></ul></li>
</ul></li></ul>
<!--X-References-End-->
<!--X-BotPNI-->
<ul>
<li>Prev by Date:
<strong><a href="msg00646.html">[ale] Shared Samba printer, SuSE 9.0</a></strong>
</li>
<li>Next by Date:
<strong><a href="msg00648.html">[ale] How much can one Linux machine do?</a></strong>
</li>
<li>Previous by thread:
<strong><a href="msg00641.html">[ale] (no subject) SPAM talk...</a></strong>
</li>
<li>Next by thread:
<strong><a href="msg00654.html">[ale] (no subject) SPAM talk...</a></strong>
</li>
<li>Index(es):
<ul>
<li><a href="maillist.html#00647"><strong>Date</strong></a></li>
<li><a href="threads.html#00647"><strong>Thread</strong></a></li>
</ul>
</li>
</ul>

<!--X-BotPNI-End-->
<!--X-User-Footer-->
<!--X-User-Footer-End-->
</body>
</html>