[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[no subject]
- <!--x-content-type: text/plain -->
- <!--x-date: Sat Apr 17 11:36:49 2004 -->
- <!--x-from-r13: syrgpu ng culqrnhk.bet (Tyrgpu) -->
- <!--x-message-id: [email protected] -->
- <!--x-reference: BAY8-[email protected] -->
- <!--x-reference: [email protected] -->
- <!--x-reference: 1082169701.10025.1.camel@blue -->
- <!--x-reference: [email protected] --> "http://www.w3.org/TR/html4/loose.dtd">
- <!--x-subject: [ale] (no subject) SPAM talk... -->
- <li><em>date</em>: Sat Apr 17 11:36:49 2004</li>
- <li><em>from</em>: fletch at phydeaux.org (Fletch)</li>
- <li><em>in-reply-to</em>: <<a href="msg00636.html">[email protected]</a>> (ChangingLINKS com's message of "Sat, 17 Apr 2004 02:27:44 -0500")</li>
- <li><em>references</em>: <<a href="msg00620.html">[email protected]</a>> <<a href="msg00631.html">[email protected]</a>> <1082169701.10025.1.camel@blue> <<a href="msg00636.html">[email protected]</a>></li>
- <li><em>subject</em>: [ale] (no subject) SPAM talk...</li>
[...]
ChangingLINKS> Can spammers figure out how to harvest the email
ChangingLINKS> from that? Sure.
ChangingLINKS> Will they harvest it? I strongly doubt it. In
ChangingLINKS> short: Spammers are programmers, have access to
ChangingLINKS> programmers (or buy ready made spamming
ChangingLINKS> software). And writing such code would take *time*
ChangingLINKS> which increases cost. Aside from the virus/
It took me all of about 5 minutes (start ~1109, working ~1116) to get
this which will work for one page; most of that time was remembering
how to use HTML::TokeParser (and a quick trip to the facilities :).
#!/usr/bin/perl
use LWP::Simple qw( get );
use HTML::TokeParser ();
my $url = shift;
my $content = get( $url )
or die "Can't fetch $url\n";
my $stream = HTML::TokeParser->new( \$content )
or die "Can't create TokeParser: $!\n";
while( my $t = $stream->get_token ) {
if( $t->[0] eq 'S' and $t->[1] eq 'a'
and $t->[2]->{href} =~ /^mailto:/ ) {
my $addr = $stream->get_trimmed_text( "/a" );
$addr =~ s/\s+at\s+/\@/;
print $addr, "\n";
last;
}
}
exit 0;
__END__
In under an hour I could have this spidering in parallel (in about
four to five hours I could have something which would spread requests
out to come from any number of endpoints to make it not look like
spidering). And all with just off the shelf components. I probably
could do the same in Ruby, again with pretty much off the shelf
components, in about the same time (well, probably about 2 hours since
I'm still working on my Ruby-fu). Serious harvesters will have
someone who could do the same in the same timeframes on retainer (not
to mention probably already having much of the spidering
infrastructure already in place).
--
Fletch | "If you find my answers frightening, __`'/|
fletch at phydeaux.org | Vincent, you should cease askin' \ o.O'
| scary questions." -- Jules =(___)=
| U
</pre>
<!--X-Body-of-Message-End-->
<!--X-MsgBody-End-->
<!--X-Follow-Ups-->
<hr>
<ul><li><strong>Follow-Ups</strong>:
<ul>
<li><strong><a name="00654" href="msg00654.html">[ale] (no subject) SPAM talk...</a></strong>
<ul><li><em>From:</em> groups at ChangingLINKS.com (ChangingLINKS.com)</li></ul></li>
</ul></li></ul>
<!--X-Follow-Ups-End-->
<!--X-References-->
<ul><li><strong>References</strong>:
<ul>
<li><strong><a name="00620" href="msg00620.html">[ale] (no subject) SPAM talk...</a></strong>
<ul><li><em>From:</em> baron_shatturday at hotmail.com (William Wylde)</li></ul></li>
<li><strong><a name="00631" href="msg00631.html">[ale] (no subject) SPAM talk...</a></strong>
<ul><li><em>From:</em> groups at ChangingLINKS.com (ChangingLINKS.com)</li></ul></li>
<li><strong><a name="00633" href="msg00633.html">[ale] (no subject) SPAM talk...</a></strong>
<ul><li><em>From:</em> jimpop at yahoo.com (Jim Popovitch)</li></ul></li>
<li><strong><a name="00636" href="msg00636.html">[ale] (no subject) SPAM talk...</a></strong>
<ul><li><em>From:</em> groups at ChangingLINKS.com (ChangingLINKS.com)</li></ul></li>
</ul></li></ul>
<!--X-References-End-->
<!--X-BotPNI-->
<ul>
<li>Prev by Date:
<strong><a href="msg00646.html">[ale] Shared Samba printer, SuSE 9.0</a></strong>
</li>
<li>Next by Date:
<strong><a href="msg00648.html">[ale] How much can one Linux machine do?</a></strong>
</li>
<li>Previous by thread:
<strong><a href="msg00641.html">[ale] (no subject) SPAM talk...</a></strong>
</li>
<li>Next by thread:
<strong><a href="msg00654.html">[ale] (no subject) SPAM talk...</a></strong>
</li>
<li>Index(es):
<ul>
<li><a href="maillist.html#00647"><strong>Date</strong></a></li>
<li><a href="threads.html#00647"><strong>Thread</strong></a></li>
</ul>
</li>
</ul>
<!--X-BotPNI-End-->
<!--X-User-Footer-->
<!--X-User-Footer-End-->
</body>
</html>