All bloggers detest spam and splogs. Spammers attempt to co-op your website into advertising for them while stealing your bandwidth, and splogs cut right to the chase by stealing your words and reposting them with their own ads. We’ve got plugins like Akismet, Spam-Karma that stop comment spam while other plugins like AntiLeech and Digital Fingerprint help you fight the rebologging splogs.

All these plugins are great at helping to defend your website from the nefarious types, but the greatest tools in the fight against splogs and spam are elbow grease and a discerning mind.

One thing I know most webmasters (including myself) don’t do enough of is sift through our weblogs. Website logs collected at the server level can tell us a myriad of things but most importantly, they can help us pinpoint who is stealing our bandwidth (via out of control robots and comment spammers) and who is stealing our content. A tiny bit of hard work, the ability to follow directions, some familiarity with excel, and a keen eye for the out of place is all you need to figure out who is stealing from you by looking at your logs. Let me show you.

I use AWstats, a log analysis program that many hosts provide to their customers to watch trends in website activity. It is probably one of the best log analysis packages available for free (open source). However, AWstats doesn’t make it easy to look at the data from different angles, make notes, and generally manipulate it in anyway.

Excel on the other hand is great at manipulation, but short on log analysis. Using the two together it is possible to create a formidable anti-splog and spam defence using htaccess rules. I’ll show you how I use AWstats to summarize the data, and Excel to manipulate and sort through the data to highlight the most flagrant abusers. Please note that although anytime is better than never, this analysis is best done when you have a good pool of data to work with, probably near the end of the month.

Step 1. Log into AWStats and look at the Hosts summary. It shows you the top 25 web addresses that connect to your website ranked according to the number of pages viewed. Pages refer to the HTML (or XHTML, PHP, etc…) files on a website and do not include images or other files requested as a result of loading a “Page” (like js,css… files). For example, if you are viewing this page on MaxPower (and not an RSS reader) your IP will show up in my logs with 1 page view (unless of course you’ve browsed around, then you’ll have more page views).

Shown below is an example AWstats output from my log files. The area highlighted in red has been enlarged and is superimposed on top of the image. Listed first is the IP, the number of pages loaded, the number of hits, the total amount of bandwidth that IP has used, and the date they last visited.

AWStats

Step 2. Click on the ‘Full list’ of hosts. While the top 25 addresses give an ok summary of what is going on, we need to look at the big picture and filter based on a different set of criteria other than pages. To do that, click on the ‘Full list’ link and let the new page load. You’ll now be looking at the top 1000 IP’s sorted by page loads.

Step 3. Copy and Paste the full list of hosts into a texteditor and save the file. Place your mouse just next to the first entry, click and hold down, and select the entire table. Copy (CTRL-C) the selected text and paste (CTRL-V) it into NotePad, PSPad, or whatever text editor you like. Save the file.

Select and Copy

Step 4. Import the text file into Excel. Start Excel and open the text file you just saved. The text import wizard should appear. Make sure it is set to tab delimited and click finish. You should now have a replica of the table found in AWstats.

Step 5. Download and open the handy dandy AWstats analysis WorkBook. To simplify this process on a month to month basis, I created a template file in excel to do all the computing and format the data for easier viewing. Download the file below and open it in excel.

AWstats analysis Workbook V1

This Workbook contains 5 worksheets: INSTRUCTIONS, RAWSTATS, CALCULATIONS, FILTER ANALYSIS, and SUMMARY STATS & RULES. The INSTRUCTIONS worksheet contains a much shorter version of these instructions on this page. The RAWSTATS worksheet is where you will paste the raw data from the excel spreadsheet created in step 4. Both the RAWSTATS page and the CALCULATIONS page exist to format and manipulate the data in order to get it ready for manual sorting and sifting which takes place in the FILTER ANALYSIS and SUMMARY STATS & RULES worksheets.

Step 6. Copy the rawdata to the AWstats Analysis Workbook. You should now have two excel files open, the data previously imported via the text import wizard, and the AWstats Analysis Workbook. Select columns A through E from the text imported AWstats excel table and copy them (CTRL-C). In the AWstats Analysis Workbook, open the RAWSTATS worksheet, place your cursor in the highlighted teal box at B2 (where it says PASTE HERE) and push copy (CTRL-V). The adjacent cells should immediatly fill themselves in based on the formulas already contained in the page. You’ll know you’ve done this right because the RAWSTATS worksheet will no longer have any colour.

Once you have pasted the data in, there is no need to edit the RAWSTATS or CALCULATIONS worksheets — doing so may screw the rest of the Workbook up.

Step 7. Become familiar with the AWstats-Analysis Workbook. Within the the FILTER ANALYSIS worksheet there are 5 columns to sort by: IP, P/H Ratio, BandWidth (MB), WHOIS, and FLAG for Ban. The IP refers to the origin of the network activity sent to your webserver, the Bandwidth is the amount of activity in MB, and the WHOIS entry for each IP takes you to the domaintools.com lookup page for that IP.

The P/H Ratio is the ratio computed between the number of pages (the HTML files on a website) loaded to the number hits. Hits include all the images and other files requested as a result of loading a “Page” (like js,css… files). A ratio of 1 would indicate that the number of pages loaded and the number of hits are equal while a ratio of 0.2 would indicate that for every page loaded many other files are served up as well.

The last column, FLAG for Ban, is used to create htaccess rules to ban that IP from ever receiving data from your website again. If a 1 is placed in the row corresponding to any given IP, the appropriate htaccess rule is generated in the SUMMARY STATS & RULES worksheet. Also on the SUMMARY STATS & RULES worksheet you will find the top 25 Bandwidth hogs (and a corresponding link to the whois page of each).

Bandwidth FilterStep 8. Find and identify potential splogs and spams based on Bandwidth and the P/H ratio. Now its time to get to work. Go to the FILTER ANALYSIS worksheet and select the arrow pull down just above the BandWidth column. Select Custom…, and in the menu box that opens select ‘is greater than’ and in the adjacent box type ‘10′ (or a value that you think best captures the top bandwidth users). Click OK and the entire table will now only show IP’s whose network traffic is greather than 10 MB. Also notice that the arrow pull down has been shaded blue to indicate that the data in that column is being filtered.

The P/H ratio that corresponds to my browsing habits on my site is 0.61. Since I am the administrator, I don’t stop to look at the pictures too often and spend more time browsing around in the WordPress admin. I can safely assume that a normal Internet visitor browsing my site would have a ratio equal or lower to this. Therefore, if I filter the data by the P/H Ratio at a value equal or greater than 0.7, I can also assume that the remaining unfiltered rows of data represent spolgs, spammers, or bots.

Step 9. Seperate the friendlies from the unknown. This step deserves to be flushed out further in it’s own post. Take what follows as a rough guideline. At this point, the data has been filtered twice, once to include only IP’s that have used more than 10 MB, and another for IP’s whose P/H ratio is greater than 0.7. On this list will probably be IP’s corresponding to major search engines like Google. Use the whois link to try and determine what company or organization is attached to each IP.

A splogger or spammer will have a very high P/H ratio because they simply grab one of two files: if they are a splogger than they will probably just download your RSS feed file. If a spammer, they will only contact your comments script.

Both are no good. Flag entries you feel confident about being bad for your webserver as ‘1′ in the Flag for Ban column.

I’ll have more on sifting through the data to determine who is doing what in part 2

Step 10. Ban the bad splogs, bots, and spammers. Once you’ve identified a few candidates for permanent ban from your webserver, go to the SUMMARY STATS & RULES Worksheet. Scroll down past the top 25 bandwidth hogs and you’ll find a set of htaccess rules generated to match the sites you selected in the other worksheet to ban. To implement these rules you’ll need to download your current .htaccess file from your webserver and append the new entries to it. If you don’t have an htaccess file you can create your own by copying the rules and pasting them into a blank textfile. Save the textfile as htaccess.txt, upload it to the server, and rename it on the server as .htaccess. [If your host uses something other than apache, you're on your own.]

Got ideas to make this better? Please share! Leave a comment or contact me.

This post has 9 comments.

  1. [...] How To: Check Your Logs for Spammers and Splogs using AWstats and Excel (Part 1) (tags: blog reference tutorials howto) Veröffentlicht von Dominic Abgelegt unter Linksammlung [...]

  2. Surfer
    31 Oct 06
    12:58 pm

    Hey

    Just pop up for saying its helpful as i was looking how to take em to excel and i didnt think to paste them to word pad first grrr.

    Found here via, Yahoo with ‘how to download awstats logs’ and u were 3th ;)

    Good luck with your site

  3. m0n
    08 Nov 06
    1:33 pm

    Excellent work — I recently eliminated a very large bandwidth leak thanks to the methods and tools presented in this article. So here are over 350 Million monthly “thank you’s” for your efforts!

    Cheers!

  4. [...] How To: Check Your Logs for Spammers and Splogs using AWstats and Excel [...]

  5. Kostas
    12 May 07
    4:42 am

    SOOOOOO many thanaks for this tool you developed and shared with everybody.
    It is excellent.
    I only have a small problem that I would greatly appreciate if you can help. I dont get any data for the generated Htaccess rules. I do got a lost of 25 bandwidth hogs though.
    Is there something I am doing wrong?

    Cheers

  6. udigis
    10 Jun 07
    2:11 am

    Thanks for the wonderful tool, and the detailed instructions. We’ve since eliminated many unnecessary visits. Our P/H ratio is much lower than 0.7, somewhere near 0.2.

  7. [...] How To: Check Your Logs for Spammers and Splogs using AWstats and Excel [...]

  8. Thanks for sharing this…I am new to stat analysis and this is an excellent resource!

  9. Catch 22
    14 Sep 09
    10:09 am

    Thanks for your instructions and workbook. I used it with OpenOffice and it worked fine. Saving AWstats to a text file is unnecessary. In OpenOffice, CTRL+ C at AWstats, CTRL+ V in the spreadsheet, will do the trick.

    Thx.