Big Data and Cloud Computing

Friday, March 4, 2016

Polymorphic and annonymous functions in Scala

Anonymous functions in Scala are nameless , lightweight functions which can be passed around without any side affects. They are also called as function literals, lambda functions, lambda expressions or simply lambdas. (x,y)=>x+y is an anonymous function which means that the input to the function are two variables and the result will be the sum of them. Another common lambda expression is an underscore form eg _+1 which means the function is an increment function.
Monomorphic functions are those functions which have only one type. Polymorphic functions are functions which are generic and can work for different datatypes. Following example shows the usage of anonymous function, Monomorphic and Polymorphic functions and tail recursion in Scala.

object Demo{
def main(args:Array[Strings]):Unit={
var arrD= new Array[Double](4)
arrD(0) = 2.0
arrD(1) = 4.0
arrD(2) = 6.0
arrD(3) = 7.0
println(binarySearch(arrD,7.0))
println(binSearch[Double](arrD, 7.0, (a:Double, b:Double)=> a>b))
}

// Monomorphic binary search with tail recursion
def binarySearch(ds: Array[Double], key: Double): Int = {
@annotation.tailrec
def go(low: Int, mid: Int, high: Int): Int = {
if (low > high) -mid - 1
else {
val mid2 = (low + high) / 2
val d = ds(mid2)
if (d == key) mid2
else if (d > key) go(low, mid2, mid2 - 1)
else go(mid2 + 1, mid2, high)
}
}
go(0, 0, ds.length - 1)
}

// return the index of the searched data if found else
// a negative number if data doesn't exist
// takes an anonymous function/lambda as a parameter for comparison
//this is a generic method which takes any data type
def binSearch[X]( data: Array[X], key:X , gt: (X,X) => Boolean):Int={
@annotation.tailrec
def go(low:Int, mid:Int, high:Int ):Int={
if( low > high) -mid-1
else{
var mid2 = (low + high) /2
var d= data(mid2)
val greater = gt(d,key)
if ( !greater && !gt(key,d)) mid2
else if (greater) go(low,mid2,mid2-1)
else go(mid2+1,mid2,high )
}
}
go(0,0,data.length-1)
}
}

Let me know if you have any comments on the same.

Thursday, March 3, 2016

Higher order functions in scala

A function that takes another function as an argument is called a higher order function. Like any other function parameter we give type as Int=>Int which indicates that f expects an integer as input and integer as a return type(........, f:Int=>Int). A complete example is shown which uses a Higher order function and tail recursion.

object Demo{

def main(args:Array[Strings]): Unit ={

println(formatResult("Nth febonacci number where value of N is", 5, febonacci))
println(formatResult("Factorial of ", 7, factorial))

}
// nth fibonacci number using tail recursions
def febonacci(n:Int):Int={
@annotation.tailrec
def nextNum(num1: Int, num2:Int, n :Int):Int={
if(n==0) num2+num1
else nextNum(num2,num2+num1,n-1)
}
nextNum(0,1,n-2)
}

def factorial(n:Int):Int={
@annotation.tailrec
def fact(num:Int, acc:Int):Int={
if(num<=0) acc
else fact(num-1, num*acc)
}
fact(n,1)
}

def formatResult(name:String, n:Int, g:Int=>Int )={
val msg="The %s %d is %d"
msg.format(name,n,g(n))

}
}

Tail Call Recursion in Scala

Recursions are resource intensive and may lead to Stack overflow if processing a huge data set. Tail Recursions help convert a recursion to a simple iteration. Pure Functions comes handy because the function calls are easily replaced by their respective results using substitution model and referential transitivity. Scala is a very powerful language because of such features which help remove the side effects of functions and help define function objects. following are few examples which showcase tail recursions:

// nth fibonacci number using tail recursions
def feb(n:Int):Int={
@annotation.tailrec
def nextNum(num1: Int, num2:Int, n :Int):Int={
if(n==0) num2+num1
else nextNum(num2,num2+num1,n-1)
}
nextNum(0,1,n-2)
}

// factorial function
def factorial(n:Int):Int={
@annotation.tailrec
def fact(num:Int, acc:Int):Int={
if(num<=0) acc
else fact(num-1, num*acc)
}
fact(n,1)
}

Awaiting your comments and examples which you tried to convert a recursion to a tail recursion.

Monday, January 25, 2016

Data Security using SSL.

Clickstream is a term used to fetch the data from a web server where the data is generated by the traffic on the website. There is lot of useful information and patterns which can be analysed to make sense out of raw data. Security of the data is very important and here comes the need of SSL/TLC.

SSL is secure socket layer and TLS is Transport layer security

Poodle attack and end of SSl 3.0.

Poodle attack is similar to beast attack. By this attack the attacker can gain access to cookies and private data of the user. Because of such incidents HIPAA (Health Insurance Portability and Accountability Act)- says to stop using ssl 3.0 for all health related websites.

Every website which wants to use SSl has to have a SSL Certificate.

Thawte or Verisign are the two companies which provide SSL certificate to websites for a stipulated timeframe.

SSL Certificate is nothing but the a public and private key for that particular website.

If Client doesn’t trust server , Client-side SSL certificates are used and server has to verify the same.

If both client and the server trusts each other , a symmetric key is generated by client and the cipher to be used.

Then this symmetric key or the password is encrypted and send to Server . Only Server can decrypt this key

Rest of the data can be transmitted using the key and the chosen cipher.

Keys used for SSL are 2048 bit and ciphers are 128bit to make it more secure

Wednesday, January 28, 2015

Splunk: Listen to your Data

Hunk™: Splunk Analytics for Hadoop

Hunk helps us to explore, analyse and visualize. It is fast compared to rest of the lot.Hunk is a next evolution in big data analytics.Hunk helps to convert data stored in Hadoop clusters into strategic assets. Deriving value and insights out of huge data stored by organisations has proven a challenge.
Hunk works on all major distributions of Apache Hadoop including those from Amazon Web Services, Cloudera, Hortonworks, IBM, MapR and Pivotal.
Others who are giving a tough competition to splunk are Logstash ,Kibana , Graylog2 ,LogLogic, LogRhythm,Graylog2 ,Elasticsearch etc. Splunk provides a increasingly useful alternative to commercial log analysis tools.
Although Splunk is the wonderful log analysis tool but also there are a lot of open source alternatives and competitors of Splunk. If you cannot afford the high price of Splunk, you can get some open source and free log analysis tools which provide almost same functionality of Splunk. I have listed down 20 free and open source alternatives and competitors of Splunk Log Analysis Tool. Following is the list of all these open source alternatives of Splunk.

1. Scribe - Real time log aggregation used in Facebook

Scribe is a server for aggregating log data that's streamed in real time from clients. It is designed to be scalable and reliable. It is developed and maintained by Facebook. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups.

2. Logstash - Centralized log storage, indexing, and searching

Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use. Logstash comes with a web interface for searching and drilling into all of your logs.

3. Octopussy - Perl/XML Logs Analyzer, Alerter & Reporter

Octopussy is a Log analyzer tool. It analyzes the log, generates reports and alerts the admin. It has LDAP support to maintain users list. It exports report by Email, FTP & SCP. Scheduled reports could be generated. RRD tool to generate graphs.

4. Awstats - Advanced web, streaming, ftp and mail server statistics

AWStats is a powerful tool that generates advanced web, streaming, ftp or mail server statistics graphically. It can analyze log files from all major server tools like Apache log files, WebStar, IIS and a lot of other web, proxy, wap, streaming servers, mail servers and some ftp servers. This log analyzer works as a CGI or from command line and shows you all possible information your log contains, in few graphical web pages.

5. nxlog - Multi platform Log management

nxlog is a modular, multi-threaded, high-performance log management solution with multi-platform support. In concept it is similar to syslog-ng or rsyslog but is not limited to unix/syslog only. It can collect logs from files in various formats, receive logs from the network remotely over UDP, TCP or TLS/SSL . It supports platform specific sources such as the Windows Eventlog, Linux kernel logs, Android logs, local syslog etc.

6. Graylog2 - Open Source Log Management

Graylog2 is an open source log management solution that stores your logs in ElasticSearch. It consists of a server written in Java that accepts your syslog messages via TCP, UDP or AMQP and stores it in the database. The second part is a web interface that allows you to manage the log messages from your web browser. Take a look at the screenshots or the latest release info page to get a feeling of what you can do with Graylog2.

7. Fluentd - Data collector, Log Everything in JSON

Fluentd is an event collector system. It is a generalized version of syslogd, which handles JSON objects for its log messages. It collects logs from various data sources and writes them to files, database or other types of storages.

8. Meniscus - The Python Event Logging Service

Meniscus is a Python based system for event collection, transit and processing in the large. It's primary use case is for large-scale Cloud logging, but can be used in many other scenarios including usage reporting and API tracing. Its components include Collection, Transport, Storage, Event Processing & Enhancement, Complex Event Processing, Analytics.

9. lucene-log4j - Log4j file rolling appender which indexes log with Lucene

lucene-log4j solves a recurrent problem that production support team face whenever a live incident happens: filtering production log statements to match a session/transaction/user ID. It works by extending Log4j's RollingFileAppender with Lucene indexing routines. Then with a LuceneLogSearchServlet, you get access to your log using web front end.

10. Chainsaw - log viewer and analysis tool

Chainsaw is a companion application to Log4j written by members of the Log4j development community. Chainsaw can read log files formatted in Log4j's XMLLayout, receive events from remote locations, read events from a DB, it can even work with the JDK 1.4 logging events.

11. Logsandra - log management using Cassandra

Logsandra is a log management application written in Python and using Cassandra as back-end. It is written as demo for cassandra but it is worth to take a look. It provides support to create your own parser.

12. Clarity - Web interface for the grep

Clarity is a Splunk like web interface for your server log files. It supports searching (using grep) as well as trailing log files in realtime. It has been written using the event based architecture based on EventMachine and so allows real-time search of very large log files.

13. Webalizer - fast web server log file analysis

The Webalizer is a fast web server log file analysis program. It produces highly detailed, easily configurable usage reports in HTML format, for viewing with a standard web browser. It handles standard Common logfile format (CLF) server logs, several variations of the NCSA Combined logfile format, wu-ftpd/proftpd xferlog (FTP) format logs, Squid proxy server native format, and W3C Extended log formats.

14. Zenoss - Open Source IT Management

Zenoss Core is an open source IT monitoring product that delivers the functionality to effectively manage the configuration, health, performance of networks, servers and applications through a single, integrated software package.

15. OtrosLogViewer - Log parser and Viewer

OtrosLogViewer can read log files formatted in Log4j (pattern and XMLL yout), java.util.logging. Source of events can be local or remote file (ftp, sftp, sa ba, http) or sockets. It has many powerful features like filtering marking, formatting, adding notes, etc. It could also format SOAP messages in logs.

16. Kafka - A high-throughput distributed messaging system

Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to Hadoop.

17. Kibana - Web Interface for Logstash and ElasticSearch

Kibana is a highly scalable interface for Logstash and ElasticSearch that allows you to efficiently search, graph, analyze and otherwise make sense of a mountain of logs. Kibana will load balance against your Elasticsearch cluster. Logstash's daily rolling indicies let you scale to huge datasets, while Kibana's sequential querying gets you most relevant data quickly, with more as it becomes available.

18. Pylogdb - A Python-powered, column-oriented database suitable for web log analysis

pylogdb is a database suitable for web log analysis.

19. Epylog - a Syslog parser

Epylog is a syslog parser which runs periodically, looks at your logs, processes some of the entries in order to present them in a more comprehensible format, and then mails you the output. It is written specifically for large network clusters where a lot of machines (around 50 and upwards) log to the same loghost using syslog or syslog-ng.

20. Indihiang - IIS and Apache log analyzing tool

Indihiang Project is a web log analyzing tool. This tool analyzes IIS and Apache Web logs and generates real time reports. It has Web Log Viewer and analyzer. It is capable to analyze the trend from the logs. This tool also integrate with windows Explorer so you can attach a log file in to indihiang tool via context menu.

Friday, May 2, 2014

Compiling and Executing a java native interface code

In order to create a java native interface we need to write a class which contains native functions. Suppose SystemCheck.java is the java file containing the native functions.
Keep SystemCheck.java in the package com/tp/pc/schedule/system

For compiling
javac com/tp/pc/schedule/system/System.java

This will create a class com/tp/pc/schedule/system/System.class

Create a header File for the class
javah -d inc com.torresnetworks.policycontrol.schedule.system.System

Execute
java -Djava.library.path=/path/to/native/library -jar system.jar

Sunday, April 27, 2014

GitHub: An Open Source Developer's Tool

GitHub is a code sharing and publishing service. It is a social networking site for programmers.What is so speacial about GitHub? At the heart of GitHub is Git, an open source project started by Linus Torvalds. Git, like other version control systems, manages and stores revisions of projects. Git can control word docs and project files as well.

The difference between other version control systems live CVS and Subversion is that they are centralized but Git is distributed. In distributed version coltrol systems if you want to make changes you need to copy the whole repository to your own system. After making changes on the local copy you can check in the changes to the central system. You don’t have to connect to the server every time you make a change.

GitHub is a Git repository hosting service. Git is a command line tool but GitHub provides web based graphical user interface. In addition to that it provides access control and other features, such as a wikis and basic task management tools. Following are the three features of GitHub:
Fork: Forking is the most important feature of GitHub which means that the repository of one user can be transfered to another account. This way you can modify a repository under your account on which you don’t have write access
Pull Request:If you like to share the changes made, you can send a notification called a “pull request” to the original owner.
Merge: If the Pull Request is already made that user can then, with a click of a button, merge the changes found in your repo with the original repo.

I think this is the best approach an open source project should be executed.
If you want to contribute to an open source project then GitHub provides the best and easiest approach. Earlier we use to manually download the project’s source code, make your changes locally, create a list of changes called a “patch” and then e-mail the patch to the project’s maintainer. The maintainer would then have to evaluate this patch, possibly sent by a total stranger, and decide whether to merge the changes.

GitHub is growing where each day many repositories are forked and many more are merged. On 23 December 2013, GitHub announced that it had reached 10 million repositories. There is no hard limit on the size of repository but the guideline says that it should not exceed one gigabyte. There is a check for files larger than 100MB in a push; if any such files exist, the push will be rejected.