[linux-elitists] web server software for tarpitting?
Greg Folkert
greg@gregfolkert.net
Tue Feb 12 12:07:16 PST 2008
On Tue, 2008-02-12 at 10:33 -0800, Gerald Oskoboiny wrote:
> * Evan Prodromou <evan@prodromou.name> [2008-02-12 12:37-0500]
> >On Sun, 2008-02-10 at 23:06 -0800, Gerald Oskoboiny wrote:
> >> The other day we posted an article [1] about excessive traffic
> >> for DTD files on www.w3.org: up to 130 million requests/day, with
> >> some IP addresses re-requesting the same files thousands of times
> >> per day. (up to 300k times/day, rarely)
> >>
> >> The article goes into more details for those interested, but the
> >> solution I'm thinking will work best (suggested by Don Marti
> >> among others) is to tarpit the offenders.
> >
> >...and not punish everybody else, right?
>
> Right, just punish those who are abusive.
>
> >> W3C's current traffic is something like:
> >>
> >> - 66% DTD/schema files (.dtd/ent/mod/xsd)
> >> - 25% valid HTML/CSS/WAI icons
> >> - 9% other
> >
> >It sounds like W3C has been having a problem satisfying its promises,
> >then. When you publicize an URL, like a DTD or schema, you're giving
> >some tacit permission to use that URL.
>
> Yes, but a single IP address re-fetching the same URL thousands
> or hundreds of thousands of times a day seems excessive.
>
> >It seems to me the way to solve your problem is to:
> >
> > 1. Clarify and publicize best practises for using W3C resources
> > into a server use policy. How often is it OK to hit a W3C-hosted
> > DTD? Once a day? Once an hour? Once a minute?
>
> Yeah, we'll have to figure something out there.
>
> > 2. For absolutely terrible bad-behavers, block them by IP number --
> > or return a brief-as-possible HTTP 403 response with a link to
> > your server use policy . It sounds like a quick way to cut down
> > on your traffic and save some headaches.
>
> We have been doing this since May 2006 to no effect.
>
> Every 10 minutes, a cron job wakes up and scans the logs for the
> previous 10 minutes, and any IPs who requested the same resource
> more than 500 times in 10 minutes, or who made more than 6000
> requests in 10 minutes get blocked from the entire site for the
> next 24 hours with custom responses depending on the abuse:
>
> http://www.w3.org/Help/abuse/re-reqs
> http://www.w3.org/Help/abuse/fast-reqs
>
> But that hasn't accomplished much and we're still getting
> hammered so we're looking at tarpitting.
I would suggest using a reverse proxy with caching turned on. AND also
use a simple re-write mapping script in that proxy, checking frequency
etc and number of requests... does a redirect to your policy page.
> > 3. Build a content-distribution network (CDN) to free up your
> > servers for the important stuff. You could either pony up the
> > cash for a commercial CDN, or you could use W3C's goodwill in
> > the Web community to put together a free and informal system of
> > mirrors.
>
> We do have an automatic mirroring system and it's easy to add
> more mirrors, but it seems silly to scale up to handle traffic
> that doesn't have much business being there in the first place
> (in my opinion. Others on staff think we should just serve all
> these requests as quickly as we can.)
>
> >The whole tarpit thing sounds too smart by half. I think a more direct
> >approach is more ethical, and also sets a good example for other Web
> >publishers.
>
> I'd expect most medium/large sites have some kind of defensive
> measures in place to deal with abuse. Google and Wikipedia block
> all access from generic user-agents like Java/x and
> Python-urllib/x.
Here is an example of a script written in Perl, its only using that a
compiled Perl as I wanted to separate out mod_perl stuff and system
Perl. I am forced to use CentOS at work and updates have killed things
as far as content providing.
call it /usr/local/bin/chkip_abuse.pl
#!/usr/local/perl5.8.8/bin/perl
# turn off buffering for the remainder
$| = 1;
use strict;
use lib '/usr/local/perl5.8.8/lib';
use Cache::FileCache;
my $allowed = 30;
my $cache = new Cache::FileCache(
{ 'cache_root' => '/tmp',
'namespace' => 'abuse',
'default_expires_in' => 1800 });
while (<STDIN>) {
chomp;
my $ip = $_;
my $attempts = $cache->get( $ip );
$attempts++;
$cache->set($ip, $attempts);
if ( $attempts > $allowed ) {
print 1;
} else {
print 0;
}
print "\n";
}
Here is the snippet for reverse proxy httpd.conf
RewriteMap chkip_abuse prg:/usr/local/bin/chkip_abuse.pl
RewriteCond %{REQUEST_URI} ^/some-dtd-url
RewriteCond ${chkip_abuse:%{REMOTE_ADDR}} =1
RewriteRule . - [F]
Now, given that this is a quick hack, this may or may not work perfectly
and DOES NOT do any cleanup in /tmp. I use something similar to this for
distributed denial of service attacks on Apache servers and it
effectively QUELLS them nicely (especially used in a proxy setup!).
Though I use 300 seconds and 5 requests in that period for the URIs in
question.
My cleanup script is a cronjob, run as needed, usually every couple of
hours.
/usr/local/abuse_clean.sh
#!/bin/bash
cd /tmp/abuse
if [ "$1" = 'count' ]
then debug='yes'
fi
if [ "$debug" = 'yes' ]
then
echo "Files to remove:"
find . -type f -atime 2 -print | wc -l
echo "Empty dirs to remove:"
find . -type d -empty -print | wc -l
find . -type d -empty -print | wc -l
find . -type d -empty -print | wc -l
else
find . -type f -atime 2 -print | xargs -r rm
find . -type d -empty -print | xargs -r rmdir
find . -type d -empty -print | xargs -r rmdir
find . -type d -empty -print | xargs -r rmdir
fi
This purges all files with older than 2 days and empty directories. So,
yeah go ahead and complain this ain't clean or what have you... it works
and has quelled up-to 10M hits a day to a Missionary Ministry that I
host, quite effectively.
Now.. I just KNOW someone out there is going to complain. But this could
really be useful for this purpose.
--
greg@gregfolkert.net
PGP key 1024D/B524687C 2003-08-05
Fingerprint: E1D3 E3D7 5850 957E FED0 2B3A ED66 6971 B524 687C
Alternate Fingerprint: 09F9 1102 9D74 E35B D841 56C5 6356 88C0
Alternate Fingerprint: 455F E104 22CA 29C4 933F 9505 2B79 2AB2
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20080212/e8e9c296/attachment.pgp
More information about the linux-elitists
mailing list