Category Archives: Project 14 – Improving SHIVA Spampot

Improving SHIVA Spampot – Week 10 Summary

Hi all,

Greetings to all! Hope everyone’s doing fine. This week, after completing global config implementation, attention was shifted to work on writing the installer.And if we leave some bugs aside, it’s almost done! Will be testing it this week.

What was done this week?

Started by analyzing the complete process of setting up SHIVA and preparing the list of things that are to be automated. We chose “bash” as our installer’s language as we will be executing a lot of shell commands and copying files. Option to select that whether a person wants to use a local database or not was also added. So, the installation script starts with testing for the packages that are required for SHIVA. Previously was using the exit status of “which” command but then I came across another useful command, ” dpkg -s ‘package_name’ “. After checking for prerequisites, it asks user if they want to set up local databases or not. After that it checks for sets up SMTP Receiver part and finally the Analyzer part. Here’s the initial screenshot of script running on a Ubuntu 12.04 LTS virtual machine.

installer

 

Since, currently the target OS of SHIVA are Debian based distros, we tested this on Ubuntu 12.04LTS, Ubuntu 13.04 and Linux Mint 15. The only problem that we faced was that we needed to make some directories like queue, raw spam, attachment,  etc by ourselves. Will be fixed this week.

This week’s plans:

This week plans are to test the script further on different environments and may be other distros too. Also, directory creation part will automated this week. Till then, have a good time.

Happy coding! :)

 

Improving SHIVA Spampot – Week 9 Summary

Hi all,

Hope everyone’s doing fine and busy coding. As stated in last week’s post, have been AFK, a bit. This past week, completed the global configuration implementation and also added some clean-up code which runs every time SHIVA is [re]started.

What was done this week?

As stated in last week’s post, after implementing global configuration in SHIVA, was facing some analyzer speed issues. Was unable to locate the problem at first as there were not many changes that could affect the speed. Then , started digging log files to see which step was taking time. And, there was our culprit – Relay code. Discussed this issue with mentor, who confirmed that this issue was something he faced in past too. Disabled relaying and SHIVA was working as expected. Will need to dig exim’s logs to see what’s causing delay in relay.

After that, while checking the size of queue directory, realized that ‘ls -l’ was showing that the queue directory ( i.e. the directory where all spam are dumped) is not empty, even though it was. Strange! Started digging Google to see if others have been facing the same issue, and seems like they’re. Found this related question on Server Fault – Monotonic growth of Linux directory size/block count. Wasn’t able to find any solution, except for deleting and creating the queue directory again. Suggestions are welcome, feel free ping me on twitter.

This week, also added some code that cleans up the old data, if any. What this clean code do is, it deletes any previous records that might exist in temporary database. This is necessary as if there’s any previous record(s) in database, it might clash with the new data that is to be pushed into database. Then, it also deletes and creates the queue directory. That’s all for this week.

Happy coding!

Improving SHIVA Spampot – Week 8 Summary

Hello friends,

First of all, apologies for the late post. Been travelling a bit, therefore was AFK. As stated in last week’s post, work on adding SMTP AUTH to smtpd library of Python was continued. Contacted the author of secure_smtpdBenjamin E. Coe, for the permission to use his code in the project, which he happily granted. :)

What was done this week:

Completed the work on SMTP AUTH. While testing it, noticed some bugs, fixed them too. Will be updating the github repo soon. After completing this work, started working on global configuration file. The end result expected is that user should be able to configure SHIVA from one place. This will make it easy to deploy and configure. Started with making list of configurations to be added in the config file. After that wrote a config file having various option. Next task was to make project’s code read the configurations from the file. For parsing the config file, we’re using ConfigParser library of Python. It’s very easy to use and implement. Faced some issues during the process, but finally was able to get it work. :D

Plans for this week:

There are some bugs in the global configuration implementations, that are affecting the speed of analyzer (very badly). Need to look deeper into the code to find the culprit. This week’s plans are to fix this bug and some other bugs that are minor but might turn out to be critical in long run.

Happy coding!

Improving SHIVA Spampot – Week 7 Summary

Hello all,

Hope everyone’s keeping well and enjoying the work. :) This past week we had our mid-term evaluations and since you’re reading this blog, you can guess that I’ve passed the evaluations. :D  Hope that everyone else succeeded too. As you may expect, since we had our evaluations this week, most of the part of this week was spent on re-factoring code, adding comments, doc-strings and discussing post mid-term plans. Week started with completing the lengthy blog post in which everything that has been achieved was post. After that, ‘shivamaindb’ module was added that correlates data between main and temporary databases and changes were made in .sql file of main database. This updated code was then pushed to the Github repo. After that, I discussed the mid-term evaluation procedure with my mentor. Then completed the evaluation questionnaire, provided by Carol and submitted it. Then came the hardest part, of waiting for the result. Received email from Google on 00.40 IST, 3rd August that I’ve passed the evaluations. *Happy times*

What was done this week?

After the evaluations were over, future plans were discussed. The need for SMTP authentication was felt so that we can have more control over our SMTP relay. Started digging the smtpd library of Python for SMTP AUTH options and after digging internet and forums, found that smtpd library of Python is very basic and it doesn’t support SMTP authentication and SSL/TLS. Well, this was quite a shock. Searched some more and found this library, secure_smtpd by Benjamin E. Coe, that adds SSL/TLS and authentication support to the default smtpd library. So, modified Shiva receiver and changed the code to use secure_smtpd instead of smtpd. After changing restarted the receiver and tested authentication. It was working fine but then we found one problem.

‘secure_smtpd’ library also forks processed. We were using this from lamson code, so when we checked processes using “ ps -el | grep -i ‘lamson’ ”, we found that more than 5 instances of lamson were running. When we tried to stop receiver, it killed the parent process but child processes were still there. This doesn’t seem good for project. Now, it is agreed upon that the default smtpd lib of python will be modified and the SMTP AUTH code from the secure_smtpd will used in it. Have started working on it.

Another challenge was to find a real smtp server that implements AUTH feature so that SHIVA can mimic its behaviour and will look legitimate if someone connects using telnet. Found one such server at relay.plus.net. Have started work on it. Will be done this week.

Plans for this week:

  • Completing the AUTH implementation
  • Start work on global configuration file

Happy coding! Have fun. :)

Improving SHIVA Spampot – Week 6 Summary

Hello all,

Hope everyone’s enjoying ‘cos I most certainly am. This last week, as stated in last post, work was done on the script that’ll co-relate the spam data between temporary database and the main database. The main objective of this script is to check if the spam that are stored in temporary database already exist in the main database or not. If it doesn’t exist, it’s added into main database, if it doesn’t, various processing takes place. So, started working on it, and guess what? It’s done! :D

This completes project’s mid term objectives. So, let’s have a recap, of what was planned till mid terms and what is being achieved. This is going to be a long post.

Work Done:

  • Improving Intelligence of Core Engine

Status: Done!

SSDEEP hash comparison was added into SHIVA. This integration made spam distinction very accurate. Before this, Md5 was used to distinguish between old and new spam. This failed, due to obvious reasons. After integrating fuzzy hashing, we noticed that now only distinct spams are being stored in our database, thereby reducing redundant data. For example, in last 24 hours,, SHIVA received close to 1.85 million spams. Analyzer analyzed them parallely and when we checked main database, following were the statistics:

  • Total spams – 1847091
  • Distinct Spams – 34
  • No. of distinct URLs – 41
  • No. of distinct IPs – 365
  • Size of database – 976 kb

Needless to say, intelligence is improved. This reduced the database size and data redundancy.

  • Benchmarking spam analysis speed

Status: Ongoing.

Benchmarking the analyzer speed is an ongoing process. Changes are made, speed is calculated, more changes, more benchmarks. So, this is important part of the project as this shows if we made some improvements or not. Currently, speed ranges between 3.5-6k spams per minute, depending upon the number of distinct spams received in that specific hour.

  • Redesigning the DB schema

Status: Almost Done!

In week 4, it was decided that we’ll stick to the MySQL database only. Instead of one, SHIVA now uses of two databases, a temporary and a main database. The reason behind keeping two databases is to improve analyzer’s efficiency. This is done by maintaining temporary records on hourly basis. After an hour, those records are dumped into a temporary database and analyzer returns to analyze records for next hour. In the background, a script starts working that fetches records from temporary database and matches it against the records in main database. If no match is found, we save that record into database, else if match is found, we increase the spam and relay counter, update first seen/ last seen date, and then search for new content in that spam. The content that we search for is:

  • New attachments
  • New URLs
  • New source IP
  • New SHIVA sensor id.

If new content is found, we update the respected spam’s data in main database.

Designing temporary database schema was pretty easy and simple. We kept the main database, almost same as it was before. We might need to restructure our main db schema. We’re planning to seek industry expert’s advice for that matter.

  • HPfeeds integration

Status: Done!

SHIVA is now able to send feeds to the HPfeeds channels. SHIVA can be configured to send the following data:

  • URLs
  • Spamming IPs
  • Raw spam files
  • Attachments from the spam

Plans are to create 3 channels, on which SHIVA’s data will be published, depending upon its type. The planned channels are:

  • A channel that’ll have parsed data, i.e. URLs, spamming IPs, spam counter, etc
  • A channel that’ll have raw spam, for researchers interested in raw spam file.
  • And a channel, for spam attachment files.

Data will be sent to these channels on hourly basis, or whatever time the user sets in shivascheduler module.

Hpfeeds simple parser in action.

.hpfeeds parser

Hpfeeds code – https://github.com/RahulBinjve/shiva/tree/master/ShivaAnalyzer/lib/python2.7/site-packages/lamson/hpfeeds

  • Modifying Code, Adding New Modules, and Refactoring:

Status: Done

Along with the above stated work, there were many other things that were done. These are:

  • As per the suggestions of Lukas, the code was re-indented, renamed and was restructured to comply with the PEP 8 — Style Guide for Python Code standards. Thanks a lot for the valuable suggestion. Now, I understand that standardizing code makes it easier for other people to contribute.
  • Some of the old modules that were not in used were removed from the code and most of the modules were either modified or written from the scratch to match the new algorithm used for spam detection and database handling.
  • Unused code and comments from SHIVA were removed. Scaffolding code was also removed and was re-factored according to PEP 8 — Style Guide for Python Code.

Therefore, everything that was in mid-term plans (and even more than what was planned) has been achieved. Now, waiting for the evaluations. :) *Fingers Crossed*

Github Repository – RahulBinjve/shiva

EDIT/ UPDATE: Added screenshot showing hpfeeds in work. Code updated. Please refer to above mentioned repository.

Improving SHIVA Spampot – Week 5 Summary

Hi all,

Hope everyone’s keeping fine and enjoying the time. This week was very productive and I was able to achieve one of the objectives of the project. :D Read on to know what we achieved.

Continuing from the last week’s work, started working and improving the temporary list and database implementation. Then, it was decided to postpone backend script’s work aside for some days and concentrate on hpfeeds/hpfriends integration.

hpfeeds is a lightweight authenticated publish-subscribe protocol that supports arbitrary binary payloads. hpfriends is an evolution of the hpfeeds system. With hpfriends, users can share interesting channels with their friends, who in turn can redistribute data they receive.

Started with reading the source code of hpfeeds on github, then it’s documentation and studying the implementation in Dionaea. Hpfeeds turned out to be very easy to implement and useful protocol. I’d like to personally thank Mark Schloesser, for creating such an awesome protocol to share data easily. Hpfeeds require an ident and secret key to subscribe and publish on various channels. It was provided by the mentor within 1-2 hours. A channel was also created for testing purpose. So, started with writing a small script to send information and binary payloads too. After that, wrote a parsing script that parses data according to the type of payload (“data-type” field is sent alongwith data).

 

Flow of the hpfeeds’ code:

Fetch data from database and local filesystem -> Create a dictionary -> Encode it to JSON using json.dumps() -> Publish on channel -> Receive on parsing script -> Decode from JSON using json.loads() -> Process according to the “type” field.

This was the objective that project achieved. After that wrote the full fledged publish and subscribe scripts and integrated publish script into the SHIVA’s code. This publish script is executed after xx number of hours (Just like the module which pushes data to temporary database). The parsing script is very basic. It just dumps raw spam samples and attachments into local directory and prints URLs and spamming IPs on stdout. It is provided as an example, if someone is willing to write a full fledged parser too.

Four types of data is being published, as of now:

1. URLs

2. IPs

3. Raw spam samples

4. Spam attachments

Code was also modified to meet the standards of PEP 8 – Style Guide for Python Code. Current code of SHIVA can be found on my Github repository, RahulBinjve/shiva.

This week’s plans are to complete SSDEEP’s implementation, i.e. completing the script that’ll co-relate the data between temporary and main database. This is going to be one tough week, as midterm evaluation is coming.

Improving SHIVA Spampot – Week 4 Summary

Hello all,

Four weeks have passed. And this past week, the project witnessed some serious brainstorming. As stated in last week blog, the way SSDEEP was implemented/integrated into SHIVA turned out to be unsuccessful, due to speed constraint. Even though the implementation saved lot of space and eliminated almost all the recurring spams (of same type), but the speed at which SHIVA analyzed mail was unacceptable. So, as stated in last week’s blog, alternative ways were searched for. Week started with studying Solr, and writing some test cases for integrating it with SHIVA but soon we realized that Solr won’t be of much help in this part of SHIVA. So, again, Bummer! Ability to differentiate is the crux of whole project and everything is dependent upon successfully analyzing and eliminating recurring mails.

Then we came up with another idea. Here’s the detailed plan.

What was done this week? (Long text follows)

The core of the idea is to maintain a temporary database on x-hourly basis. All the spams received in x-hour(s) will be maintained in a list of dictionaries. Each dictionary will hold the information of unique spams. This will make the SSDEEP comparisons fast as we’ll need not to fetch hashes from database every time a spam is received. Also, since records are maintained on x-hourly basis, there’ll be less data. Those hourly records will also be pushed into a temporary database. After x hours have passed, a worker script will start that’ll fetch data from the temporary database, will cross check it against the original database and will only push the (overall) genuine spams. The database and the list will then be flushed, so that data for next x-hour(s) can be collected. Since, all the unique mails are stored in main memory first, length-based comparison is also done. Only the spams which are of relative length are compared against SSDEEP.

High level flow of code,

A spam arrives -> len(spam) calculated -> Only the hashes of the spam having relative length (for testing, it’s between +10% to -10%) are compared against the spam

For example,

Let, len(spam) = 788 chars. So, relative length is in between int(788 * 0.90) and int(788 * 1.10) i.e 709 and 866. Therefore, hashes of unique spam having length in this range will be compared.

This effectively decreases the number of comparisons.

 

For the above purpose, 3 modules were entirely replaced or we can say, were re-written. These modules consist of a module that identifies new or old spams, a module that pushes new spams into list and temporary database and a module that processes old spams, searching for any new content. Temporary database to accompany the new modules was also designed. It was implemented along with the 3 new modules. This weekend, plans are to test the efficiency of this new technique. So, finally, I can say that this week was very productive.

 

All the best everyone. See you, next week! :)

 

Improving SHIVA Spampot – Week 3 Summary

Hello friends,

It’s been 3 weeks and I hope that all of you are enjoying this time, just like me. This past week was bit less productive than the past two weeks. This week SSDEEP in SHIVA was implemented for differentiating between old spam i.e. already existing in database and the ones that are new. After that, SHIVA was put to real world testing.

Spams started to come and mails were also sent manually. We stopped the analyzer part for a while so that enough mails can be collected in the mail directory for benchmarking. After a while, mail directory showed around 128k spams. Those were enough for testing analyzer’s speed, i.e. speed after deploying SSDEEP. First, receiver was stopped so that no new spams may come and after that analyzer was started. One thing to note here is that previously, speed of analyzing spams was 6.5k-7k per minute.

So, analyzer started analyzing mails. A stopwatch was also started parallely with the analyzer. Mail directory was checked regularly. It took around 40 minutes and 6 seconds before the mail directory was empty. So, statistics wise,

Statistics

Number of spams – 1, 29, 000 (approx.)

Time taken by Analyzer – 40 mins. 06 seconds or 40.10 minutes

Spams analyzed per minutes – 1, 29, 000 / 40.10 = 3217 spams per minute (approx.)

Distinct IPs of spammers – 725

As we can deduce from above, the speed has dropped to half. Well, this is disappointing, but it came with a good news too, only distinct mails were being stored in our database. When database was checked, it had only 10 distinct spams. Well, this is little assuring. We had different IPs and other metadata but content-wise, there were only 10 distinct spams. We had a counter of how many times a specific spam has been received. Even though, the number of distinct spams were 10, but those 10 spams were sent from 725 distinct IPs.

Since, the speed of analyzing is not satisfactory, other ways to implement SSDEEP and other algorithms that can be used were discussed. It was then, that we came to know about Solr.

“ SolrTM is the popular, blazing fast open source enterprise search platform from the Apache LuceneTM project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. “ – From the project’s page.

We are planning to index every spam’s content and then after matching the content, only the hashes which  are more likely to be matched, will be fetched and compared. This will lower the number of comparisons that SSDEEP will need to do, even if we have thousands of distinct hashes in our database. Since, Solr is relatively new technology for me, I started with reading documentations and understanding how it can be used with Python. By the end of the week, I now have a working test script that pushes data to Solr. The script uses “request” module of Python to interact with Solr. This week, plans are to implement and test the new distinct spam detection technique, using Solr and also to benchmark the analyzing speed. *Fingers crossed*

Improving SHIVA Spampot – Week 2 Summary

Hello friends,

Second week of coding has come to an end and from the project’s point of view, this week went very well.

In the starting of the week, we started with discussing various pros and cons of using a NoSQL system over our existing RDBMS. A lot of digging into web was done to see if MySQL has the capacity to handle large amount of data. In the initial research, it was found that MySQL, if properly designed and optimized, can handle large amount of data. So, for now, we decided to stick around MySQL. After that, some new functionalities to the existing SHIVA codebase were added. These functionalities are:

1. Saving headers of the spam.

2. Integrated Shiva Scheduler into main code.

3. Integrating SSDEEP into SHIVA.

In Detail:

1. Saving headers of the spam.

Till now different fields of a spam were extracted/ parsed, but header was not one of them. This past week, the ability to parse the headers and saving them was added to SHIVA. Now,one can analyse the mail header information, too.

2. Integrated Shiva Scheduler into main code.

Shiva Scheduler, is the module of SHIVA which resets the relay counter after x number of hour(s). This module is important, because after specified number of mails have been relayed in x hour(s), relaying stops. Relay counter should be set to 0 again so that new spams received in next hour can be relayed. APScheduler is being used to schedule this task. Previously, it needed to be invoked separately, but now it starts as soon as our Analyzer comes to life.

3. Integrating SSDEEP into SHIVA.

This was the main highlight of the week. Spammers send similar spam in bulk quantity. Say, 1000-10000. Those spams are almost same, but to avoid detection, spammers change a word or two in each spam. Due to this, MD5 checksum failed. Therefore, we started digging around for different ways to identify almost similar data. We found two ways to accomplishing the task,

  • Fuzzy String Comparison using different string distance algorithms like Levenshtein distance, Zaro distance, Zaro-Winkler distance, etc. We also came upon a pretty good python library having all this algorithms, jellyfish.

So, test cases were written to check the various algorithms. SSDEEP’s results were satisfactory but it had some issues with comparing hashes of small or pattern-less data. Still for now, it is decided to give SSDEEP a chance. SSDEEP was integrated into the main SHIVA code and was tested inside the internal network. Changes in the database schema, in accordance to SSDEEP, were also made. Now, today, on Monday, it is planned to give SHIVA a real life run. Let’s see how it performs. *fingers crossed*

That’s all for this past week. Happy coding and have a productive week. :)

Improving SHIVA Spampot – Week 1 Summary

Hello all,

I’m Rahul Binjve, and working on Shiva (Spam Honeypot with Intelligent Virtual Analyzer). First week of GSoC coding period is over, and I hope, like me, you too are enjoying the time.

Week 1 Summary :

Well, at the starting part of week, I along with the help of my mentor, implemented and set up the spampot, once again. After it was set up, we tested it, but it wasn’t relaying any mail. That left us thinking that what’s wrong now. We checked config files and all but nope, everything was fine but still not relaying. So, I started to dig Mail Transfer Agent (MTA), exim’s log. That’s when I realized that all mail providers are dropping our mail, because they were being originated by my residential or dynamic IP. Bummer!!
Due to this issue, not a single mail relayed and I couldn’t see it working in real. Well, to cope with the issue, I started working from my mentor’s organization. Now sitting in their network, where a static  IP in the isolated DMZ has been provided to me to set up an instance of SHIVA,  I was able to relay some mails, hassle-free. And the fun begun.

Along with configuring Shiva, debugging errors and understanding the application code flow, I also got the idea of how spammers work and what are their methodologies.  The bulk mails that we received on our controlled open relay helped me in understanding the working of spammers in a better way. Also after analysing our DB, I completely understood that why we need a better way to differentiate between old and new spam. Fuzzy hashing, is the hope.

After this, in the later part of week, we discussed the need of using No-SQL/ Document-Based DB for the project and started looking for options. If we decide this big switch, we’ve  two choices, the very popular mongodb and Cassandra. We’ve planned to decide our choice by end of this week and then I shall start the whole migration process. We might have to make changes in order of deliverables, because, rest of the things depend upon our choice of db too.
This Week’s Plan:

  • Read and read more about the applicability of RDBMS and NO-SQL DBs.
  • Read the comparision and use-cases for MongoDB and Cassandra.
  • Decide what to use by end of this week.
  • Start writing test cases for the selected DB.
  • Prepare design for DB.
  • Start the whole big switch.

Hoping to have a productive week! :)