Hope everyone’s enjoying ‘cos I most certainly am. This last week, as stated in last post, work was done on the script that’ll co-relate the spam data between temporary database and the main database. The main objective of this script is to check if the spam that are stored in temporary database already exist in the main database or not. If it doesn’t exist, it’s added into main database, if it doesn’t, various processing takes place. So, started working on it, and guess what? It’s done! 😀
This completes project’s mid term objectives. So, let’s have a recap, of what was planned till mid terms and what is being achieved. This is going to be a long post.
SSDEEP hash comparison was added into SHIVA. This integration made spam distinction very accurate. Before this, Md5 was used to distinguish between old and new spam. This failed, due to obvious reasons. After integrating fuzzy hashing, we noticed that now only distinct spams are being stored in our database, thereby reducing redundant data. For example, in last 24 hours,, SHIVA received close to 1.85 million spams. Analyzer analyzed them parallely and when we checked main database, following were the statistics:
- Total spams – 1847091
- Distinct Spams – 34
- No. of distinct URLs – 41
- No. of distinct IPs – 365
- Size of database – 976 kb
Needless to say, intelligence is improved. This reduced the database size and data redundancy.
Benchmarking the analyzer speed is an ongoing process. Changes are made, speed is calculated, more changes, more benchmarks. So, this is important part of the project as this shows if we made some improvements or not. Currently, speed ranges between 3.5-6k spams per minute, depending upon the number of distinct spams received in that specific hour.
Status: Almost Done!
In week 4, it was decided that we’ll stick to the MySQL database only. Instead of one, SHIVA now uses of two databases, a temporary and a main database. The reason behind keeping two databases is to improve analyzer’s efficiency. This is done by maintaining temporary records on hourly basis. After an hour, those records are dumped into a temporary database and analyzer returns to analyze records for next hour. In the background, a script starts working that fetches records from temporary database and matches it against the records in main database. If no match is found, we save that record into database, else if match is found, we increase the spam and relay counter, update first seen/ last seen date, and then search for new content in that spam. The content that we search for is:
- New attachments
- New URLs
- New source IP
- New SHIVA sensor id.
If new content is found, we update the respected spam’s data in main database.
Designing temporary database schema was pretty easy and simple. We kept the main database, almost same as it was before. We might need to restructure our main db schema. We’re planning to seek industry expert’s advice for that matter.
SHIVA is now able to send feeds to the HPfeeds channels. SHIVA can be configured to send the following data:
- Spamming IPs
- Raw spam files
- Attachments from the spam
Plans are to create 3 channels, on which SHIVA’s data will be published, depending upon its type. The planned channels are:
- A channel that’ll have parsed data, i.e. URLs, spamming IPs, spam counter, etc
- A channel that’ll have raw spam, for researchers interested in raw spam file.
- And a channel, for spam attachment files.
Data will be sent to these channels on hourly basis, or whatever time the user sets in shivascheduler module.
Hpfeeds simple parser in action.
Hpfeeds code – https://github.com/RahulBinjve/shiva/tree/master/ShivaAnalyzer/lib/python2.7/site-packages/lamson/hpfeeds
Along with the above stated work, there were many other things that were done. These are:
- As per the suggestions of Lukas, the code was re-indented, renamed and was restructured to comply with the PEP 8 — Style Guide for Python Code standards. Thanks a lot for the valuable suggestion. Now, I understand that standardizing code makes it easier for other people to contribute.
- Some of the old modules that were not in used were removed from the code and most of the modules were either modified or written from the scratch to match the new algorithm used for spam detection and database handling.
- Unused code and comments from SHIVA were removed. Scaffolding code was also removed and was re-factored according to PEP 8 — Style Guide for Python Code.
Therefore, everything that was in mid-term plans (and even more than what was planned) has been achieved. Now, waiting for the evaluations. *Fingers Crossed*
Github Repository – RahulBinjve/shiva
EDIT/ UPDATE: Added screenshot showing hpfeeds in work. Code updated. Please refer to above mentioned repository.