Category Archives: Project 3 – Thug Distributed Task Queuing

Thug Distributed Task Queuing – Final Blog Post

Hi Everyone,

This is the final blog post about the Thug Distributed Task Queuing Project. Here I will describe the Distributed feature that we have added to the already existing project Thug, by which now analyses of URL’s has become easy and efficient.

Project Overview:

Previously Thug worked like a stand-alone tool and does not provide any way to distribute URL analysis tasks to different workers. For the same reason it was neither able to analyze difference in attacks to users according to their geolocation (unless it is provided a set of differently geolocated proxies to use obviously). Now after implementation of this project we are able to solve both problems by creating a centralized server which will be connected to all the Thug instances running across the globe and will distribute URLs (potentially according to geolocation analysis requirements). After that the clients will consume the tasks distributed by centralized server and will store the results in database after processing them.

Server:

On Server we are able to handle all the Clients(worker) and now we are able to distribute URL’s on the basis of clients geolocation i.e. if we want to check the working of a URL in a particular country then we can put that URL in that country and then a client connected from that country will process the URL and give back the result. So by this we are not only able to distribute URLs among clients running from all over the world but now we are also able to analyze the attacks to particular countries.

These are the working demos of flower(celery monitoring tool) to see workers processing tasks:

Workers connected from India and processing tasks:

Screenshot from 2013-09-29 13:26:34

Tasks description which are running or completed:

Screenshot from 2013-09-29 13:30:26

 

Worker:

Workers are the clients or Thug Instances running from all over the world. They are connected to 2 types of queues: Generic queue and its nation queue(like Indian client would be connected to India queue, so on). Now whenever server puts up URLs in the queues workers connected to that queues consumes the URLs and after processing them sends back the results to server for further processing.

 

Architecture:

 

Development:

Here I want to describe about the optimizations on which I worked and currently working on. I made 2 other prototypes in which I tried to do some optimizations and currently also reside in Github repo. In 1st prototype I tried to distribute URLs according to clients system performance i.e. if a clients system is super fast so we will give him more URLs as compared to others. This was done using Redis DB, worker will calculate a performance value in Redis Sorted Set after every 2 min.(example) and then whenever Server wants to distribute URLs it will query the Redis Sorted Set and will allocate URLs to clients having more system performance value(as better system performance means better system). So by this we might be able to get the quicker response from the clients, but here a problem occurred i.e. we were facing difficulty related to distributing URLs according to geolocation.

2nd prototype optimization was very simple as we just increased the Prefetch Value of systems having better system performance value, so the clients whose systems are better than other will process more URLs than others as they will prefetch more URLs than others.

 

That’s all I wanted to share about my Project. But in total, this was super exciting summer and I liked & learned a lot by participating in GSoC.

I want to thank everyone who helped me in completing my project:

1st and most important is Angelo Sir(mentor) who helped me a lot in his busy times also, he answered my each and every dumb query. Thanks a lot sir, really he is an amazing guy :)

Then I want to thank Sebastian Sir(backup mentor) and Kevin Sir. I did some great discussions with Sebastian sir which helped me a lot in doing project and Kevin sir worked as an unofficial mentor as he helped me a lot in working with Celery plus he advised me a lot while implementing the project.

I also want to thank David sir for organizing & managing the Honeynet GSoC so well and I would also like to thank Tan Kean Siong sir for starting a Introduction mailing list for giving students a platform to introduce themselves.

Let’s always keep on working!

ThugD github repo can be find at https://github.com/Aki92/Thug-Distributed.

More details & documentation about project can be find at http://aki92.github.io/Thug-Distributed/.

Thanks,
Akshit Agarwal

Thug Distributed Task Queuing Week-10,11 Blog

Week-10,11 Blog:

Finally in last 2 weeks we did Testing in Real World situation and as expected all things were good. The features that we tested worked perfectly, but still we had to do some rigorous testing of all the features present in it. It all became possible due to Angelo Sir, so again thanks a lot to him as he helped me a lot in testing our project. Setup was something like this:

  • Server were located in the Italy.
  • 2 workers(clients) were working from: India and Italy.
  • We tested both generic & geolocation based distribution and it worked very smooth without any major problems.

Small bugs were present in the code which were fixed and it worked well. We also faced a small problem in connecting worker with message broker running at server, but it was really very easy to do things with Celery.

Last week I also worked on optimizing the our code and did some of the changes in it. Now as time is left so we are thinking to add some new feature. So currently we are thinking of some good feature that can be added. And we will also be doing testing and finalizing code in coming time.

Works done Last week (20/08 – 02/09):

  • Tested project in Real World situation where Server resided in Italy and workers at different geolocations.
  • Optimized the code.
  • Discussion on adding new features.

Plans for Next week:

  • Add some new feature.
  • Testing all the features of the project.
  • Finalizing Code.

Thug Distributed Task Queuing Week-8,9 Blog

Week-8,9 Blog:

This week was not so productive as compared to some previous weeks. We were working to test our Project in real world for which we requested a Virtual Machine at HP Servers but didn’t got any quick response. Meanwhile we worked to add a Command Line Functionality at Server side of Project so that distribution of tasks can be done easily and in efficient way. Finally we were able to successfully add the Command Line functionality into our project for Server, it have both Thug Options that are present in Thug and also some new options necessary for Thug Distributed Project. We had also add some unique options in Command line argument feature to ease the distribution, like Distributing URLs with some priority based useragent (Web Browser), distributing a single URL to multiple queues and also distributing bunch of URLs to multiple queues. For a very large number of URLs and queues we have also added a option to take queue and url’s from files. This week I also tried to work on specifically New Relic and deployed the Thug Agent on it, to check that what functionalities we can get from it. Last week I also worked on optimizing the distribution efficiency, for which I am now changing Prefetch Multiplier value in Celery at run time using Client System Performance assuming that we will be having more URLs in comparison to Clients running Thug Instances. Currently our main work is to run our project in real world situation where a Server can reside at some place and client at other place, so from this post I want to request HP that please reply to the ticket opened by Angelo Sir for a VM at HP servers.

 

Works done Last week (06/08 – 19/08):

  • Added Command Line Argument functionality for Distributing tasks.
  • Worked to optimize the distribution efficiency.
  • Hosted Thug Agent on New Relic to check out it fully.

Plans for Next week:

  • Testing working prototype by using a Virtual Machine at HP Servers.
  • Work on optimizing current prototype.
  • Add some new required features.

Thug Distributed Task Queuing Week-7 Blog

Week-7 Blog:

This week I was unable to do a lot of work, because as planned most of the time was spend in submitting the midterm evaluation and just waiting for results :P. But finally it was all gone very smoothly and I was also able to pass the evaluation successfully. This week I just spend time in finding that how we can improve our project and can make our project available to other people easily. While searching for this I came across the world’s most popular place for volunteer computing i.e BOINC. Plus I also tried to work on Nagios and New Relic, so that we can use it to monitor our network, but still I am unable to work with them properly. I was unable to make contact with mentors due to their busy schedule, so this week also I was unable to test the project in real world. But I will try my best to do it in coming week under mentors supervision.

 

Works done Last week (29/07 – 05/08):

  • Finalized the documentation.
  • Searched about how to increase scope of project.
  • Tried Nagios and New Relic.

Plans for Next week:

  • Testing working prototype.
  • Discussing future of project with mentors and different things that can be added into it.
  • Work on optimizing current prototype.

Thug Distributed Task Queuing Week-6 Blog

Week-6 Blog:

This week started nice as the problem of Pydot got solved easily. The Problem was in PyParsing version 2.0.1 as “nocomma” is not included in it, so problem solved after installing PyParsing 1.5.7 and Thug started working properly. :) Finally I was able to play with Thug.

Then I tested the Prototype with Small(5-8 nodes) & Big(15-20) Networks on localhost. While testing minute changes were done to work it properly.

Then we changed the format of Thug Options passed by server to remote thug instances from Python Dictionary Data Structure to JSON format so that it works efficiently. And some other minute changes were done to finalize the code for Midterm Evaluation.

Next task was to make documentation using Sphinx(awesome thing), which can be found here Thug Distributed Documentation .

As from tomorrow Mid Term Evaluation is going to start, so I would like to summarize the work we had done till now in our Project.

Summary:

We are able to make 2 working prototypes of Thug Distributed Project which are ready to be tested on real world.

Prototypes:

1. Without Considering Client Capability : This is the Simple Implementation of Thug Distributed Task Queuing Project which works very good. This prototype distributes URLs very Efficiently and in Optimized manner. The only drawback is that it doesn’t consider the client’s Capability i.e. how fastly a client can process a URL, and distributes URLs considering all clients as same.

In this implementation we had used Celery as Task Distributor, RabbitMQ as Message  Broker and Redis for Backend. In it 2 types of queues are made i.e. GenericGeolocation based, in which tasks are feed according to Server need. While at other end whenever any client run Thug Instance it gets connected to 2 Queues: Generic Geolocation based Queue (“India” queue if client is in India). Then whenever URLs are fed into Queues they are distributed automatically to clients connected to that particular Queue.

Its code can be found here.

2. Considering Client Capability : This Implementation is a bit complicated but considers the Clients Capability which needs some optimizations. This prototype distributes URLs according to the Clients Capability i.e. better Clients gets more URLs to process than clients weaker than it. It has some major drawbacks i.e. a single queue is being made for every Client, distribution of tasks is done manually based on Performance value of Clients which is not very optimized and a new overhead of updating performance value & querying it on Redis Server is being added.

In this implementation we had used Celery as Task Distributor, RabbitMQ  as Message Broker, Redis for Backend and Redis Sorted Set for maintaining Clients Performance Value. In it whenever client run Thug Instance it gets connected to Redis Server and regularly updates its System Performance Value on its Sorted Set data structure so that Performance values of all Clients remain in sorted order. Then whenever Server wants to distribute URLs, it queries Redis Server and get back the list of Clients based on their Performance Value. Then it just distribute task to best client and updates the client list by querying Redis Server in every second, here in distribution mechanism further optimization will be done.

Its code can be found here.

Till now whatever we had done in our Project would not have been possible without Angelo Sir and Sebastian Sir. Angelo Sir is great person to work with, he was always there to help me. While due to time gap I was not able to interact much with Sebastian Sir but he was also there whenever I needed his help and I did some good discussion with him. Very happy with mentors. :)

Works done Last week (24-28/07/2013):

  • Tested Code after integrating Thug on localhost with bigger network.
  • Finalizing code for evaluation.
  • Making documentation for Project.

Plans for Next week:

  • Getting Project Evaluated.
  • Celebrate if Evaluation is Passed. ;)
  • Integrate HPFeeds & Mnemosyne for getting results back on Server from clients.
  • Testing code in a real world situation(Unable to do in last week).

 

Sorry for some non professional answers.

Thug Distributed Task Queuing Week-5 Blog

Week-5 Blog:

This week I spend most of my time in fighting with Ubuntu to install Google’s V8 and PyV8 to run Thug on my system. Actually there is a problem with Boost Library of Ubuntu as told by Angelo Sir, which creates problem to install Google V8 and without it Thug doesn’t work, so its basic necessity to run Thug. I almost spend 2-3 days in just trying to install Google V8 :( , but finally I was able to overcome it and finally I installed successfully all libraries except the Pydot which after installing is not working properly :P . This week my main aim was to understand ThugAPI code and how thug.py is calling ThugAPI so that I would be able to do changes in thug.py code and use it to integrate in my Thug Distributed Code.

I was able to understand both ThugAPI and thug.py code due to simplicity of code, thanks to Angelo Sir :). Then I made a copy of thug.py code and after some changes in it I was able to call Thug from my celery code. Finally Thug project got integrated into Thug Distributed code and now we will be able to test Thug Distributed Project in Real World.

Works done Last week (16-23/07/2013):

  • Understanding Thug Code and running it on my system.
  • Integrate Thug into ongoing Thug Distributed Project.
  • Tested code after integrating Thug on small network.

Plans for Next week:

  • Figure out the problem in PyDot installation, so that Thug runs properly on my system.
  • Testing code with a bigger network.
  • Final Touch Ups for mid term evaluation.
  • Testing code in a real world situation.

Thug Distributed Task Queuing Week-4 Blog

Week-4 Blog:

This week was a nice week as I was able to implement the idea of ”Distributing tasks among clients according to performance value” using Redis. Then I spend some time in testing this new implementation and found that it was working fine but the problem was that I was not getting the results better than our previous implementation. As in this implementation, distribution was done by me and not AMQP, due to which it was slow and I had also committed some mistakes in code. Then after discussion with Angelo Sir we decided that firstly we should integrate Thug project with our Thug Distributed Project so that we would be able to do some real world tests. As now we had made prototypes of both types distribution according to performance value and without it, so we will be spending some time in testing them and then we will optimize them.

Apart from this while testing the previous code I found out that for every task a new queue is being made which would increase a lot of load on Main Server, so after getting help from Celery IRC I got to know that it was due to AMQP backend as it publishes a message for every result due to which it creates a queue for every tasks. So after knowing this I switched to Redis backend as their is no such problem of creating individual queues in Redis, so now I am having a combination of AMQP(broker) and Redis(backend). This decreased a lot of load on Main Server, and in upcoming version of Celery their will be a RPC AMQP by which no individual queue will be made, so if it will be better than Redis than we will switch to it.

Works done Last week (8-15/07/2013):

  • Implementing ”distributing tasks among clients according to performance value” idea using Redis.
  • Testing both the implementations made till now.
  • Improved backend, performance related code in previously made prototype and tested it with a bigger network.

Plans for Next week:

  • Integrating Thug project with our code.
  • Testing code after integrating Thug.
  • Testing and improving the code of “distributing tasks among clients according to performance value”.

Thug Distributed Task Queuing Week-3 Blog

Week-3 Blog:

This week was not so productive and less coding was done in comparison to previous 2 weeks, as most of time was spent in searching for the solution of distributing tasks according to clients performance value. In this week we were able to implement the generic and geolocation based queue idea till Wednesday, then we started to work on adding the feature of distributing tasks among clients and finding performance value for clients. But when we explored more about the Celery then we came to know that we can’t integrate idea of “distributing tasks according to performance” directly into Celery . Then I tried to work and discuss about the techniques to implement this feature, I got some solutions like creating an inter mediator where all clients would update their performance value and this inter mediator would distribute tasks among clients according to this value after consuming tasks from queue connected by main server. But in this solution I was fearing that it would increase our performance or it would deteriorate it. Then after taking help from other professionals I got an idea in which inter mediator can be eliminated and instead of it I can use Redis where clients would update their performance values and whenever main server wants to distribute tasks it would just query the Redis server and would get the list of clients (sorted according to performance value), which will be used to distribute tasks. I also got to know about 2 new tools to monitor infrastructure of a network i.e. Nagios and New Relic, which can help us in future to monitor our whole network or the clients. I thought that if we can also build a distributed environment at same place where server will reside then dependency on the clients to process the tasks would be eliminated, but here another issue will arise i.e. we won’t be able to analyze difference in attacks related to geolocation. So there are many things which can be included in our project. I also had to discuss all above ideas with mentor to know their pros/cons and feasibility to include in project.

Works done Last week (1-7/07/2013):

  • Implementing generic and geolocation based queue idea.
  • Searching about how to implement our idea of distributing tasks according to client performance.
  • Worked on finding performance value of each client.

Plans for Next week:

  • Testing queues implementation performance.
  • Implementation of idea “distributing tasks among clients according to performance value”.
  • Studying about Redis.
  • Discussing and Analyzing different techniques of new features.

Thug Distributed Task Queuing Week-2 Blog

Week-2 Blog:

Week 2 turned out to be very productive and was very crucial as till the starting of 2nd week we were unable to do the changes in the architecture of our project. But in this week we spent a lot of time in discussing and analyzing the architecture changes for our project, as good architecture leads to a great project. So after a lot of discussions and analysis we finally made out a good architecture which is simple, efficient and extendable. Great thanks to Angelo Sir for the idea. :)

Idea: Clients (Thug Instances) will send periodical updates about  their system status (Performance) to the Main Server and would also be connected to two types of queues: generic and geolocation-based. The Main Server will always be putting URLs (tasks) into generic and/or geolocation-based queues according to need and at the other end Clients would consume that URLs and after processing them will store results into Database to be analyzed later on. Here we would use the extendibility of Celery and would try to embed our functionality into Celery builtin functionalities.

The idea would be further Optimized if possible at later time when we would finish in implementing such functionalities.

I also want to share a tool called FLOWER for monitoring and administrating Celery clusters which will greatly extend the capability of our project.

Works done Last week (23-30/06/2013):

  • Studying Celery.
  • Discussing and analyzing the Architecture changes.
  • Implemented functionality to find Client geolocation using IPv4 & IPv6 addresses and Team Cymru IP to ASN mapping service.
  • Implemented functionality at client side to automatically join geolocation-based queue or make one if doesn’t exist.

Plans for Next week:

  • Testing and Improving the already implemented functionalities.
  • Finding Parameters and Algorithm for Performance Value of Clients running Thug Instances, so that distribution is done efficiently.
  • Implement the Algorithm for finding Performance Value.

Thug Distributed Task Queuing Week-1 Blog

Week-1 Blog:

Project started with a great week, as I came across some very good software’s like RabbitMQ and Celery which had increased the capability of our project a lot.

Works done Last week (17-23/06/2013):

  • Started learning about RabbitMQ from book: RabbitMQ in Action.
  • Implemented a simple Prototype with a Single Queue for distributing tasks and Callback Queues for getting results back, using the RabbitMQ. (Prototype using RabbitMQ)
  • Started learning about Celery from its documentation.
  • Implemented the same Prototype as above using Celery + RabbitMQ where Celery worked for Distributing tasks and RabbitMQ as Message Broker. (Prototype using Celery)
  • Found CPU and Memory free usage for calculating Performance Value which will be used in each system running Thug Instance.

Plans for Next week:

  • Analyzing and discussing a new idea to check connected systems(running thug instances) capability by which we will be free from using Performance Value.
  • Implementing a Prototype of the new idea and comparing its performance with the old one.
  • Studying Celery and RabbitMQ into more depth for improving prototype.
  • Try to run the Prototype using Thug API on client side.