Edureka

What is KNN Algorithm?

K nearest neighbors or KNN Algorithm is a simple algorithm which uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction.

kNN is often used in search applications where you are looking for similar items, like find items similar to this one.

Algorithm suggests that if you’re similar to your neighbours, then you are one of them. For example, if apple looks more similar to peach, pear, and cherry (fruits) than monkey, cat or a rat (animals), then most likely apple is a fruit.

How does a KNN Algorithm work?

The k-nearest neighbors algorithm uses a very simple approach to perform classification. When tested with a new example, it looks through the training data and finds the k training examples that are closest to the new example. It then assigns the most common class label (among those k-training examples) to the test example.

KNN Algorithm using Python | Edureka

Subscribe to our YouTube channel to stay updated with our fresh content

What does ‘k’ in kNN Algorithm represent?

k in kNN algorithm represents the number of nearest neighbor points which are voting for the new test data’s class.

If k=1, then test examples are given the same label as the closest example in the training set.

If k=3, the labels of the three closest classes are checked and the most common (i.e., occurring at least twice) label is assigned, and so on for larger ks.

kNN Algorithm Manual Implementation

Let’s consider this example,

Suppose we have height and weight and its corresponding Tshirt size of several customers. Your task is to predict the T-shirt size of Anna, whose height is 161cm and her weight is 61kg.

KNN Algorithm - 1 - edureka

Step1: Calculate the Euclidean distance between the new point and the existing points

For example, Euclidean distance between point P1(1,1) and P2(5,4) is:

euclidean distance - KNN Algorithm - edureka KNN Algorithm - 2 - edureka

Step 2: Choose the value of K and select K neighbors closet to the new point.

In this case, select the top 5 parameters having least Euclidean distance

Step 3: Count the votes of all the K neighbors / Predicting Values

Since for K = 5, we have 4 Tshirts of size M, therefore according to the kNN Algorithm, Anna of height 161 cm and weight, 61kg will fit into a Tshirt of size M.

Implementation of kNN Algorithm using Python

Handling the data
Calculate the distance
Find k nearest point
Predict the class

Check the accuracy

Don’t just read it, practise it!

Step 1: Handling the data

The very first step will be handling the iris dataset. Open the dataset using the open function and read the data lines with the reader function available under the csv module.

import csv
with open(r'C:\Users\Atul Harsha\Documents\iris.data.txt') as csvfile:
	lines = csv.reader(csvfile)
	for row in lines:
		print (', '.join(row))

Now you need to split the data into a training dataset (for making the prediction) and a testing dataset (for evaluating the accuracy of the model).

Before you continue, convert the flower measures loaded as strings to numbers. Next, randomly split the dataset into train and test dataset. Generally, a standard ratio of 67/33 is used for test/train split

Adding it all, let’s define a function handleDataset which will load the CSV when provided with the exact filename and splits it randomly into train and test datasets using the provided split ratio.

import csv
import random
def handleDataset(filename, split, trainingSet=[] , testSet=[]):
	with open(filename, 'r') as csvfile:
	    lines = csv.reader(csvfile)
	    dataset = list(lines)
	    for x in range(len(dataset)-1):
	        for y in range(4):
	            dataset[x][y] = float(dataset[x][y])
	        if random.random() < split:
	            trainingSet.append(dataset[x])
	        else:
	            testSet.append(dataset[x])

Let’s check the above function and see if it is working fine,

Testing handleDataset function

trainingSet=[]
testSet=[]
handleDataset(r'iris.data.', 0.66, trainingSet, testSet)
print ('Train: ' + repr(len(trainingSet)))
print ('Test: ' + repr(len(testSet)))

Don't just read it, master it with Edureka

Step 2: Calculate the distance

In order to make any predictions, you have to calculate the distance between the new point and the existing points, as you will be needing k closest points.

In this case for calculating the distance, we will use the Euclidean distance. This is defined as the square root of the sum of the squared differences between the two arrays of numbers

Specifically, we need only first 4 attributes(features) for distance calculation as the last attribute is a class label. So for one of the approach is to limit the Euclidean distance to a fixed length, thereby ignoring the final dimension.

Summing it up let’s define euclideanDistance function as follows:

 
import math
def euclideanDistance(instance1, instance2, length):
	distance = 0
	for x in range(length):
		distance += pow((instance1[x] - instance2[x]), 2)
	return math.sqrt(distance)

Testing the euclideanDistance function,

data1 = [2, 2, 2, 'a']
data2 = [4, 4, 4, 'b']
distance = euclideanDistance(data1, data2, 3)
print ('Distance: ' + repr(distance))

Step 3: Find k nearest point

Now that you have calculated the distance from each point, we can use it collect the k most similar points/instances for the given test data/instance.

This is a straightforward process: Calculate the distance wrt all the instance and select the subset having the smallest Euclidean distance.

Let’s create a getKNeighbors function that returns k most similar neighbors from the training set for a given test instance

import operator 
def getKNeighbors(trainingSet, testInstance, k):
	distances = []
	length = len(testInstance)-1
	for x in range(len(trainingSet)):
		dist = euclideanDistance(testInstance, trainingSet[x], length)
		distances.append((trainingSet[x], dist))
	distances.sort(key=operator.itemgetter(1))
	neighbors = []
	for x in range(k):
		neighbors.append(distances[x][0])
	return neighbors

Testing getKNeighbors function

trainSet = [[2, 2, 2, 'a'], [4, 4, 4, 'b']]
testInstance = [5, 5, 5]
k = 1
neighbors = getNeighbors(trainSet, testInstance, 1)
print(neighbors)

Step 4: Predict the class

Now that you have the k nearest points/neighbors for the given test instance, the next task is to predicted response based on those neighbors

You can do this by allowing each neighbor to vote for their class attribute, and take the majority vote as the prediction.

Let’s create a getResponse function for getting the majority voted response from a number of neighbors.

import operator
def getResponse(neighbors):
	classVotes = {}
	for x in range(len(neighbors)):
		response = neighbors[x][-1]
		if response in classVotes:
			classVotes[response] += 1
		else:
			classVotes[response] = 1
	sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
	return sortedVotes[0][0]

Testing getResponse function

neighbors = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
print(getResponse(neighbors))

Step 5: Check the accuracy

Now that we have all of the pieces of the kNN algorithm in place. Let’s check how accurate our prediction is!

An easy way to evaluate the accuracy of the model is to calculate a ratio of the total correct predictions out of all predictions made.

Let’s create a getAccuracy function which sums the total correct predictions and returns the accuracy as a percentage of correct classifications.

def getAccuracy(testSet, predictions):
	correct = 0
	for x in range(len(testSet)):
		if testSet[x][-1] is predictions[x]:
			correct += 1
	return (correct/float(len(testSet))) * 100.0

Testing getAccuracy function

testSet = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
predictions = ['a', 'a', 'a']
accuracy = getAccuracy(testSet, predictions)
print(accuracy)

Since we have created all the pieces of the KNN algorithm, let’s tie them up using the main function.

# Example of kNN implemented from Scratch in Python

import csv
import random
import math
import operator

def handleDataset(filename, split, trainingSet=[] , testSet=[]):
	with open(filename, 'rb') as csvfile:
	    lines = csv.reader(csvfile)
	    dataset = list(lines)
	    for x in range(len(dataset)-1):
	        for y in range(4):
	            dataset[x][y] = float(dataset[x][y])
	        if random.random() < split: trainingSet.append(dataset[x]) else: testSet.append(dataset[x]) def euclideanDistance(instance1, instance2, length): distance = 0 for x in range(length): distance += pow((instance1[x] - instance2[x]), 2) return math.sqrt(distance) def getNeighbors(trainingSet, testInstance, k): distances = [] length = len(testInstance)-1 for x in range(len(trainingSet)): dist = euclideanDistance(testInstance, trainingSet[x], length) distances.append((trainingSet[x], dist)) distances.sort(key=operator.itemgetter(1)) neighbors = [] for x in range(k): neighbors.append(distances[x][0]) return neighbors def getResponse(neighbors): classVotes = {} for x in range(len(neighbors)): response = neighbors[x][-1] if response in classVotes: classVotes[response] += 1 else: classVotes[response] = 1 sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedVotes[0][0] def getAccuracy(testSet, predictions): correct = 0 for x in range(len(testSet)): if testSet[x][-1] == predictions[x]: correct += 1 return (correct/float(len(testSet))) * 100.0 def main(): # prepare data trainingSet=[] testSet=[] split = 0.67 loadDataset('iris.data', split, trainingSet, testSet) print 'Train set: ' + repr(len(trainingSet)) print 'Test set: ' + repr(len(testSet)) # generate predictions predictions=[] k = 3 for x in range(len(testSet)): neighbors = getNeighbors(trainingSet, testSet[x], k) result = getResponse(neighbors) predictions.append(result) print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
	accuracy = getAccuracy(testSet, predictions)
	print('Accuracy: ' + repr(accuracy) + '%')
	
main()

Explore Machine Learning in Depth

This was all about the kNN Algorithm using python. In case you are still left with a query, don’t hesitate in adding your doubt to the blog’s comment section.

Feature image - Machine Learning with R - Edureka

Machine Learning in R for Beginners with Example

Machine Learning with R Machine learning is the present and the future! From Netflix’s recommendation engine to Google’s self-driving car, it’s all machine learning. This blog on Machine Learning with R helps you understand the core concepts of machine learning followed by different machine learning algorithms and implementing those machine learning algorithms with R. This […]

AI and Iot in FIFA | Edureka Blog | Edureka

AI and IoT in FIFA: Smart Sports

Most football enthusiasts will know that the sport has been a bit slower than others when it comes to adopting new technologies. There are several speculated reasons for this — the game’s age and the fact that it is dominated by traditionalists — to name a few. Nevertheless, this year’s football world cup seems to […]

AI in Wimbledon | Edureka Blog | Infographic

AI in Wimbledon: Power Highlights, Analytics and Insights

Tennis is a wonderful and unique sport, no doubt about it. What makes tennis unique is not the interesting gameplay it offers or the massive following it commands, this racket-ball sport is one-of-its-kind because of the speed at which it adopts new-age technologies, such as AI in Wimbledon this year. To ensure that the sport […]

World Cup 2018: 5 Game Changing Technologies in Football

Football is arguably the most popular sport in the world. According to FIFA.com, a total of 3.2 billion people tuned in to watch the 2014 football World Cup. But, did you know that technology is playing a crucial role in making football what it is today? In fact, modern football can be considered an autonomous […]

Top Technical Skills to Secure Jobs of the Future

Unlike most domains, Information Technology is dynamic and changes rapidly in short periods of time. If you are a technology professional, it is very important to re-skill yourself to survive in this highly competitive IT industry. For instance, let’s look at the curious case of a Java developer. A decade ago, Java professionals were a […]

machine-learning-tutorial - Machine Learning Tutorial

Machine Learning Tutorial for Beginners

I have already covered the higher level concepts in of Machine Learning in my blog What is Machine Learning? As promised here is a tutorial blog in the series, titled "Machine Learning Tutorial". This tutorial blog will help you to understand about: Understanding Machine Learning with an Analogy What is Machine Learning? Biggest Confusion AI vs [...]

Machine Learning - What is Machine Learning - Edureka

What is Machine Learning? Machine Learning For Beginners

What is Machine Learning? Well, Machine Learning is a concept which allows the machine to learn from examples and experience, and that too without being explicitly programmed. So instead of you writing the code, what you do is you feed data to the generic algorithm, and the algorithm/ machine builds the logic based on the given [...]

AI vs Machine Learning vs Deep Learning

AI vs Machine Learning vs Deep Learning, these terms have confused a lot of people. If you too are one among them then this blog – AI vs Machine Learning vs Deep Learning is definitely for you. AI vs Machine Learning vs Deep Learning Artificial Intelligence is the broader umbrella under which Machine Learning and Deep […]

PySpark MLlib Tutorial : Machine Learning with PySpark

Machine learning has gone through many recent developments and is becoming more popular day by day. Machine Learning is being used in various projects to find hidden information in data by people from all domains, including Computer Science, Mathematics, and Management. It was just a matter of time that Apache Spark Jumped into the game of Machine […]

The post K-Nearest Neighbors Algorithm Using Python appeared first on Edureka Blog.

To ensure that the sport is the best it can be, Wimbledon, the world’s oldest and most followed tennis tournament, has IBM onboard as its technology partner. Each year, IBM introduces several new utilities that redefine the way tennis is enjoyed. Last year, we saw tools that were focussed on big data analytics to track player performance and predict match results. This year, however, has been on a whole new realm. We recently saw AI in Wimbledon make headlines all over the web.

IBM used automated data analytics using their famous AI, IBM Watson. Let’s look at what artificial intelligence and analytics did to make the Wimbledon tournament better for both players and spectators this year.

AI in Wimbledon: Making Tennis Smarter

As evident from the image above, there were two trending technologies that IBM adopted this year to make the tournament a success. The first one, big data analytics, was basically an upgrade to last year’s initiative. While Wimbledon 2017’s approach was to analyze player stats and match highlights to determine the best players in each form of the game, this year IBM Watson, which is an AI and the second technology IBM used, took this to a whole new level.

The insights that IBM Watson gained over a period of 13 days were not only used to determine player performances, they also became the building blocks of 2 major innovations in the field of tennis: The Messenger Bot Ask Fred and Power Highlights. The image above clearly depicts the way these two technological marvels made tennis a better experience for both spectators and players.

It is quite evident that tennis is one of the few sports that has a very open perception with regard to adopting new technologies. We’re sure that IBM will bring in even better utilities in the coming years from its post as Wimbledon’s technology partner. Be sure to keep an eye out for them.

If you like our coverage on the usage of technology in different domains like sports, why not subscribe to our blogs and be the first to get these updates?

The post AI in Wimbledon: Power Highlights, Analytics and Insights appeared first on Edureka Blog.

Releasing software isn’t an art, but it is an engineering discipline. Continuous Deployment can be thought of as an extension to Continuous Integration which makes us catch defects earlier.

In this blog on Continuous Deployment, you will go through the following topics:

So, before we deep dive into continuous Deployment, let me brief you about DevOps first!

What is Continuous Deployment?

It is an approach of releasing software on the production servers continuously in an automated fashion. So, once a code passes through all the stages of compiling the source code, validating the source code, reviewing the code, performing unit testing & integration testing, packaging the application continuously, it will then be deployed onto the test serves to perform User Acceptance tests. Once that is done, the software will be deployed onto the production servers for release and this is said to be Continuous Deployment.

What is Continuous Deployment | Edureka

Now, often people get confused between the terms, Continuous Delivery & Continuous Deployment. So let me clarify the confusion for you!

Continuous Delivery vs Continuous Deployment

Continuous Delivery vs Continuous Deployment - Continuous Deployment - Edureka

Continuous Delivery does not involve deployment to production on every change that occurs. You just need to ensure that the code is always in a deployable state, so you can deploy it easily whenever you want.

On the other hand, Continuous Deployment requires every change to be deployed automatically, without human intervention.

So, as you can see in the diagram once Continuous Integration stages are completed, the newly built application is automatically deployed to production then it is Continuous Deployment. On the other hand, if we manage to automate everything, but decide to require a human approval in order to proceed with the deployment of the new version, then we are taking into account Continuous Delivery. Well, the difference is very much subtle, but it has enormous implications, making each technique appropriate for various situations.

Continuous Delivery vs Continuous Deployment | Edureka

Want to explore more about DevOps Stages?

So, now that you have an understanding of Continuous Deployment, let’s see a case study on Continuous Deployment.

Linkedln’s Case Study Of Continuous Deployment

LinkedIn is an employment-oriented service that is mainly used for professional networking. LinkedIn’s prior system before implementing Continuous Deployment was more traditional.

LinkedIn Traditional System - Continuous Deployment - Edureka

The system included various branches diverging from a single trunk developed in a parallel manner. So, a developer would write big batches of code with respect to various features and then wait for this feature branch to be merged into the trunk i.e. the master branch.

Once the feature was merged into the master branch, it had to be again tested to make sure that it did not break into any other code of a different feature at the same instance.

Since this system included several batches of code written in isolation by various teams, and then once written are merged into a single branch, this system was known as a feature branch system. This kind of system limited the scope and number of features, thus slowing down the company’s development life cycle.

Looking at the above conditions, Linkedln thus decided to move from its traditional feature-based development lifecycle to new Continuous deployment.

This required migrating the old code and built out the automated tools to make the new system work, thus halting Linkedln’s development for months.

LinkedIn Use Case For Continuous Deployment - Continuous Deployment - Edureka

LinkedIn’s framework after using continuous deployment included developers writing code in tidy, distinct chunks, and checking each chunk into the trunk shared amongst all LinkedIn developers. The newly-added code is then subjected to a series of automated tests to remove bugs.

Once the code passes the tests it is merged into trunk and listed out in a system that shows managers what features are ready to go live on the site or in newer versions of LinkedIn’s apps.

So, that was Linkedln’s success story!

Now, let me continue this discussion by telling you the basic benefits of Continuous Deployment.

Benefits of Continuous Deployment

The benefits that Continuous Deployment offers are as follows:

Speed – Development does not pause for releases so it is developed really fast.
Secure – Releases are less risky as before releasing testing is performed and all the bugs are solved.
Continuous Improvements – Continuous Deployment support continuous improvements which are visible to customers.

Subscribe to our youtube channel to get new updates..!

Hands-On

Problem Statement: Deploy an application in headless mode through Jenkins server, using selenium test files.

Solution: Follow the steps below to deploy the application in a headless mode.

Step 1: Open your Eclipse IDE and create a Maven Project. To create a maven project go to File -> New -> Maven Project. In the dialog box that opens up, mention the Group Id and the Artifact Id and then click on Finish.

Maven Project - Continuous Deployment - Edureka

Step 2: Once you create your maven project, include the code of Selenium App in the main Java file and make sure you have inserted the argument to deploy it in the headless mode. App Java File - Continuous Deployment - Edureka

Step 3: After that include the required dependencies in the pom.xml file.

Step 4: After this, your project is ready to run. Since we want to run it in headless mode, we have to deploy this application in the Jenkins Server.

Step 5: So, you have to export your project as a JAR file. To do that, go to File -> Export -> choose Runnable JAR file. After that click on Next.

Step 6: In the next dialog box, choose the App you want to launch configuration and then choose the directory where you want to export and then click on Finish.

Choose Launch Configuration - Continuous Deployment - Edureka

Step 7: After you export your project as a JAR file, you have to push it to a GitHub repository. To push it to the GitHub repository, first, create a new repository in your GitHub account.

Step 7.1: To do that, go to the Repositories tab and choose the option New.

Create New Repository In GitHub - Continuous Deployment - Edureka

Step 7.2: After that mention the repository name, and choose if you wish your project to be private or public and then finally click on Create Repository.

Give Repository Details In GitHub - Continuous Deployment - Edureka

Step 8: To push your project to this repository. Follow the below steps:

Step 8.1: Go to the directory where your jar file is present and initialize git using the command git init.

Step 8.2: After that, perform the git add operation using the command git add.

Step 8.3: Once you are done with that, commit the operation using the command git commit -m ‘Type in your message here’.

Step 8.4: Now connect your GitHub repository to local repository by using the command git remote add origin ‘Link of your repository’(Don’t include quotations)

Step 8.5: Now push your repository by using the command git push -u origin master

Step 9: Once the JAR file has been pushed to the local repository, you have to create a new Job in the Jenkins server. To do that, open your Jenkins Dashboard, and then go to New Item -> Type in the item name -> Click on OK.

Create New Job In Jenkins Server - Continuous Deployment - Edureka

Step 10: Once your job is created, click on the job and go to configure option.

Step 10.1: Go to Source Code Management tab -> Choose Git -> Mention the Repository URL.

Give Repository URL - Continuous Deployment - Edureka

Step 10.2: After that, go to the Build tab and choose the option Execute shell. In this mention the path of the jar file in your Jenkins workspace.

Give Jenkins Workspace Directory - Continuous Deployment - Edureka

Step 10.3: Once you’re done with the above two steps, save the changes.

Step 11: Click on Build Now, to build the project and see the output.

Output of Hands-On - Continuous Deployment - Edureka

Since it is continuous deployment, that is the case where this program can be deployed by any person working in the team and the others can only see the output that something has been, changed. They will not know who has deployed it directly onto the production servers.

But, if you run the same project in eclipse then it will run on the browsers which we don’t want in our case presently as in Jenkins we cannot open other browsers!

If you want the source code of the example shown, please comment out in the comments section.

If you found this Continuous Deployment blog relevant, check out the DevOps training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka DevOps Certification Training course helps learners gain expertise in various DevOps processes and tools such as Puppet, Jenkins, Nagios, Ansible, Docker, Kubernetes and GIT for automating multiple steps in SDLC.

Want to get certified in DevOps?

Got a question for me? Please mention it in the comments section and I will get back to you.

The post Continuous Deployment – A Comprehensive Guide With An Example appeared first on Edureka Blog.

Continuous Delivery:

Continuous Delivery is a process, where code changes are automatically built, tested, and prepared for a release to production. I hope you have enjoyed my previous blogs on Jenkins. Here, I will talk about the following topics::

What is Continuous Delivery?
Types of Software Testing
Difference Between Continuous Integration, Delivery, and Deployment
What is the need for Continuous Delivery?
Hands-on Using Jenkins and Tomcat

Let us quickly understand how Continuous Delivery works.

What Is Continuous Delivery?

It is a process where you build software in such a way that it can be released to production at any time. Consider the diagram below:

Continuous Delivery - Continuous Delivery - Edureka

Let me explain the above diagram:

Automated build scripts will detect changes in Source Code Management (SCM) like Git.
Once the change is detected, source code would be deployed to a dedicated build server to make sure build is not failing and all test classes and integration tests are running fine.
Then, the build application is deployed on the test servers (pre-production servers) for User Acceptance Test (UAT).
Finally, the application is manually deployed on the production servers for release.

Before I proceed, it will only be fair I explain to you the different types of testing.

Types of Software Testing:

Broadly speaking there are two types of testing:

Blackbox Testing: It is a testing technique that ignores the internal mechanism of the system and focuses on the output generated against any input and execution of the system. It is also called functional testing. It is basically used for validating the software.
Whitebox Testing: is a testing technique that takes into account the internal mechanism of a system. It is also called structural testing and glass box testing. It is basically used for verifying the software.

Whitebox testing:

There are two types of testing, that falls under this category.

Unit Testing: It is the testing of an individual unit or group of related units. It is often done by the programmer to test that the unit he/she has implemented is producing expected output against given input.
Integration Testing: It is a type of testing in which a group of components are combined to produce the output. Also, the interaction between software and hardware is tested if software and hardware components have any relation. It may fall under both white box testing and black box testing.

Blackbox Testing:

There are multiple tests that fall under this category. I will focus on a few, which are important for you to know, in order to understand this blog:

Functional/ Acceptance Testing: It ensures that the specified functionality required in the system requirements works. It is done to make sure that the delivered product meets the requirements and works as the customer expected
System Testing: It ensures that by putting the software in different environments (e.g., Operating Systems) it still works.
Stress Testing: It evaluates how the system behaves under unfavorable conditions.
Beta Testing: It is done by end users, a team outside development, or publicly releasing full pre-version of the product which is known as beta version. The aim of beta testing is to cover unexpected errors.

Now is the correct time for me to explain the difference between Continuous Integration, Delivery and Deployment.

Differences Between Continuous Integration, Delivery And Deployment:

Visual content reaches an individual’s brain in a faster and more understandable way than textual information. So I am going to start with a diagram which clearly explains the difference:

Continuous Integration vs Continuous Delivery vs Continuous Deployment - Continuous Delivery - Edureka

In Continuous Integration, every code commit is build and tested, but, is not in a condition to be released. I mean the build application is not automatically deployed on the test servers in order to validate it using different types of Blackbox testing like – User Acceptance Testing (UAT).

In Continuous Delivery, the application is continuously deployed on the test servers for UAT. Or, you can say the application is ready to be released to production anytime. So, obviously Continuous Integration is necessary for Continuous Delivery.

Continuous Deployment is the next step past Continuous Delivery, where you are not just creating a deployable package, but you are actually deploying it in an automated fashion.

Let me summarize the differences using a table:

Continuous Integration	Continuous Delivery	Continuous Deployment
Automated build for every, commit	Automated build and UAT for every, commit	Automated build, UAT and release to production for every, commit
Independent of Continuous Delivery and Continuous Deployment	It is the next step after Continuous Integration	it is one step further Continuous Delivery
By the end, the application is not in a condition to be released to production	By the end, the application is in a condition to be released to the production.	The application is continuously deployed
Includes Whitebox testing	Includes Blackbox and Whitebox testing	It includes the entire process required to deploy the application

In simple terms, Continuous Integration is a part of both Continuous Delivery and Continuous Deployment. And Continuous Deployment is like Continuous Delivery, except that releases happen automatically.

Learn How To Create CI/ CD Pipelines Using Jenkins On Cloud

But the question is, whether Continuous Integration is enough.

Why We Need Continuous Delivery?

Let us understand this with an example.

Imagine there are 80 developers working on a large project. They are using Continuous Integration pipelines in order to facilitate automated builds. We know build includes Unit Testing as well. One day they decided to deploy the latest build that had passed the unit tests into a test environment.

This must be a lengthy but controlled approach to deployment that their environment specialists carried out. However, the system didn’t seem to work.

What Might Be The Obvious Cause Of The Failure?

Well, the first reason that most of the people will think is that there is some problem with the configuration. Like most of the people even they thought so. They spent a lot of time trying to find what was wrong with the configuration of the environment, but they couldn’t find the problem.

One Perceptive Developer Took A Smart Approach:

Then one of the senior Developer tried the application on his development machine. It didn’t work there either.

He stepped back through earlier and earlier versions until he found that the system had stopped working three weeks earlier. A tiny, obscure bug had prevented the system from starting correctly. Although, the project had good unit test coverage. Despite this, 80 developers, who usually only ran the tests rather than the application itself, did not see the problem for three weeks.

Problem Statement:

Without running Acceptance Tests in a production-like environment, they know nothing about whether the application meets the customer’s specifications, nor whether it can be deployed and survive in the real world. If they want timely feedback on these topics, they must extend the range of their continuous integration process.

Let me summarize the lessons learned by looking at the above problems:

Unit Tests only test a developer’s perspective of the solution to a problem. They have only a limited ability to prove that the application does what it is supposed to from a users perspective. They are not enough to identify the real functional problems.
Deploying the application on the test environment is a complex, manually intensive process that was quite prone to error. This meant that every attempt at deployment was a new experiment — a manual, error-prone process.

Solution – Continuous Delivery Pipeline (Automated Acceptance Test):

They took Continuous Integration (Continuous Delivery) to the next step and introduced a couple of simple, automated Acceptance Tests that proved that the application ran and could perform its most fundamental function. The majority of the tests running during the Acceptance Test stage are Functional Acceptance Tests.

Continuous Delivery Pipeline - Continuous Delivery - Edureka

Basically, they built a Continuous Delivery pipeline, in order to make sure that the application is seamlessly deployed on the production environment, by making sure that the application works fine when deployed on the test server which is a replica of the production server.

Enough of the theory, I will now show you how to create a Continuous Delivery pipeline using Jenkins.

Continuous Delivery Pipeline Using Jenkins:

Here I will be using Jenkins to create a Continuous Delivery Pipeline, which will include the following tasks:

Steps involved in the Demo:

Fetching the code from GitHub
Compiling the source code
Unit testing and generating the JUnit test reports
Packaging the application into a WAR file and deploying it on the Tomcat server

Continuous Delivery Use Case - Continuous Delivery - Edureka

Pre-requisites:

CentOS 7 Machine
Jenkins 2.121.1
Docker
Tomcat 7

Step – 1 Compiling The Source Code:

Let’s begin by first creating a Freestyle project in Jenkins. Consider the below screenshot:

Jenkins New Project - Continuous Delivery - Edureka

Give a name to your project and select Freestyle Project:

Freestyle Project Jenkins - Continuous Delivery - Edureka

When you scroll down you will find an option to add source code repository, select git and add the repository URL, in that repository there is a pom.xml fine which we will use to build our project. Consider the below screenshot:

Git Jenkins Integration - Continuous Delivery - Edureka

Now we will add a Build Trigger. Pick the poll SCM option, basically, we will configure Jenkins to poll the GitHub repository after every 5 minutes for changes in the code. Consider the below screenshot:

Build Trigger In Jenkins - Continuous Delivery - Edureka

Before I proceed, let me give you a small introduction to the Maven Build Cycle.

Each of the build lifecycles is defined by a different list of build phases, wherein a build phase represents a stage in the lifecycle.

Following is the list of build phases:

validate – validate the project is correct and all necessary information is available
compile – compile the source code of the project
test – test the compiled source code using a suitable unit testing framework. These tests should not require the code be packaged or deployed
package – take the compiled code and package it in its distributable format, such as a JAR.
verify – run any checks on results of integration tests to ensure quality criteria are met
install – install the package into the local repository, for use as a dependency in other projects locally
deploy – done in the build environment, copies the final package to the remote repository for sharing with other developers and projects.

I can run the below command, for compiling the source code, unit testing and even packaging the application in a war file:

mvn clean package

You can also break down your build job into a number of build steps. This makes it easier to organize builds in clean, separate stages.

So we will begin by compiling the source code. In the build tab, click on invoke top level maven targets and type the below command:

compile

Consider the below screenshot:

Compile Source Code - Continuous Delivery - Edureka

This will pull the source code from the GitHub repository and will also compile it (Maven Compile Phase).

Click on Save and run the project.

Build A Freestyle Project - Continuous Delivery - Edureka

Now, click on the console output to see the result.

Compile Output - Continuous Delivery - Edureka

Step – 2 Unit Testing:

Now we will create one more Freestyle Project for unit testing.

Add the same repository URL in the source code management tab, like we did in the previous job.

Now, in the “Buid Trigger” tab click on the “build after other projects are built”. There type the name of the previous project where we are compiling the source code, and you can select any of the below options:

Trigger only if the build is stable
Trigger even if the build is unstable
Trigger even if the build fails

I think the above options are pretty self-explanatory so, select any one. Consider the below screenshot:

Build Trigger, After Other Projects - Continuous Delivery - Edureka In the Build tab, click on invoke top level maven targets and use the below command:

test

Jenkins also does a great job of helping you display your test results and test result trends.

The de facto standard for test reporting in the Java world is an XML format used by JUnit. This format is also used by many other Java testing tools, such as TestNG, Spock, and Easyb. Jenkins understands this format, so if your build produces JUnit XML test results, Jenkins can generate nice graphical test reports and statistics on test results over time, and also let you view the details of any test failures. Jenkins also keeps track of how long your tests take to run, both globally, and per test—this can come in handy if you need to track down performance issues.

So the next thing we need to do is to get Jenkins to keep tabs on our unit tests.

Go to the Post-build Actions section and tick “Publish JUnit test result report” checkbox. When Maven runs unit tests in a project, it automatically generates the XML test reports in a directory called surefire-reports. So enter “**/target/surefire-reports/*.xml” in the “Test report XMLs” field. The two asterisks at the start of the path (“**”) are a best practice to make the configuration a bit more robust: they allow Jenkins to find the target directory no matter how we have configured Jenkins to check out the source code.

**/target/surefire-reports/*.xml

XML Reports - Continuous Delivery - Jenkins

Again save it and click on Build Now.

Test Results - Continuous Delivery - Jenkins

Now, the JUnit report is written to /var/lib/jenkins/workspace/test/gameoflife-core/target/surefire-reports/TEST-behavior.

XML Reports Output - Continuous Delivery - Jenkins

In the Jenkins dashboard you can also notice the test results:

est Results In UI- Continuous Delivery - Jenkins

Further Test Results - Continuous Delivery - Edureka

Step – 3 Creating A WAR File And Deploying On The Tomcat Server:

Now, the next step is to package our application in a WAR file and deploy that on the Tomcat server for User Acceptance test.

Create one more freestyle project and add the source code repository URL.

Then in the build trigger tab, select build when other projects are built, consider the below screenshot:

Build When Other Projects Are Built - Continuous Delivery - Edureka

Basically, after the test job, the deployment phase will start automatically.

In the build tab, select shell script. Type the below command to package the application in a WAR file:

mvn package

Package Application - Continuous Delivery - Edureka

Next step is to deploy this WAR file to the Tomcat server. In the “Post-Build Actions” tab select deploy war/ear to a container. Here, give the path to the war file and give the context path. Consider the below screenshot:

Deploy The WAR File To The Tomcat Server - Continuous Delivery - Edureka

Select the Tomcat credentials and, notice the above screenshot. Also, you need to give the URL of your Tomcat server.

In order to add credentials in Jenkins, click on credentials option on the Jenkins dashboard.

Adding Jenkins Credentials - Continuous Delivery - Edureka

Click on System and select global credentials.

Global Credentials - Continuous Delivery - Edureka

Then you will find an option to add the credentials. Click on it and add credentials.

Adding Credentials - Continuous Delivery - Edureka

Add the Tomcat credentials, consider the below screenshot.

Tomcat Credentials In Jenkins - Continuous Delivery - Edureka

Click on OK.

Now in your Project Configuration, add the tomcat credentials which you have inserted in the previous step.

Deploy The WAR File To The Tomcat Server - Continuous Delivery - Edureka

Click on Save and then select Build Now.

Deploy Output - Continuous Delivery - Jenkins

Go to your tomcat URL, with the context path, in my case it is http://localhost:8081. Now add the context path in the end, consider the below Screenshot:

Deployed Application - Continuous Delivery - Edureka

Link - http://localhost:8081/gof

I hope you have understood the meaning of the context path.

Now create a pipeline view, consider the below screenshot:

Click on the plus icon, to create a new view.

Configure the pipeline the way you want, consider the below screenshot:

Build Pipeline Configuration - Continuous Delivery - Edureka

I did not change anything apart from selecting the initial job. So my pipeline will start from compile. Based on the way I have configured other jobs, after compile testing and deployment will happen.

Finally, you can test the pipeline by clicking on RUN. After every five minutes, if there is a change in the source code, the entire pipeline will be executed.

So we are able to continuously deploy our application on the test server for user acceptance test (UAT).

I hope you have enjoyed reading this post on Continuous Delivery. If you have any doubts, feel free to put them in the comment section below and I will get back with an answer at the earliest.

In Order To Build CI/ CD Pipelines You Need To Master Wide Variety Of Skills

The post Continuous Delivery Tutorial – Building A Continuous Delivery Pipeline Using Jenkins appeared first on Edureka Blog.

In a world full of Machine Learning and Artificial Intelligence, surrounding almost everything around us, Classification and Prediction is one the most important aspects of Machine Learning and Naive Bayes is a simple but surprisingly powerful algorithm for predictive modeling. So Guys, in this Naive Bayes Tutorial, I’ll be covering the following topics:

What is Naive Bayes?

Naive Bayes is among one of the most simple and powerful algorithms for classification based on Bayes’ Theorem with an assumption of independence among predictors. Naive Bayes model is easy to build and particularly useful for very large data sets. There are two parts to this algorithm:

Naive
Bayes

The Naive Bayes classifier assumes that the presence of a feature in a class is unrelated to any other feature. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that a particular fruit is an apple or an orange or a banana and that is why it is known as “Naive”.

Let’s move forward with our Naive Bayes Tutorial Blog and understand Bayes Theorem.

What is Bayes Theorem?

In Statistics and probability theory, Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It serves as a way to figure out conditional probability.

Given a Hypothesis H and evidence E, Bayes’ Theorem states that the relationship between the probability of Hypothesis before getting the evidence P(H) and the probability of the hypothesis after getting the evidence P(H|E) is :

Bayes-Theorem-Naive-Bayes-Tutorial

This relates the probability of the hypothesis before getting the evidence P(H), to the probability of the hypothesis after getting the evidence, P(H|E). For this reason, is called the prior probability, while P(H|E) is called the posterior probability. The factor that relates the two, P(H|E) / P(E), is called the likelihood ratio. Using these terms, Bayes’ theorem can be rephrased as:

“The posterior probability equals the prior probability times the likelihood ratio.”

Go a little confused? Don’t worry.
Let’s continue our Naive Bayes Tutorial blog and understand this concept with a simple concept.

Learn Python From Experts

Bayes’ Theorem Example

Let’s suppose we have a Deck of Cards, we wish to find out the “Probability of the Card we picked at random to be a King given that it is a Face Card“. So, according to Bayes Theorem, we can solve this problem. First, we need to find out the probability

P(King) which is 4/52 as there are 4 Kings in a Deck of Cards.
P(Face|King) is equal to 1 as all the Kings are face Cards.
P(Face) is equal to 12/52 as there are 3 Face Cards in a Suit of 13 cards and there are 4 Suits in total.

Naive-Bayes-Explanation-Naive-Bayes-Tutorial

Now, putting all the values in the Bayes’ Equation we get the result as 1/3

Game Prediction using Bayes’ Theorem

Let’s continue our Naive Bayes Tutorial blog and Predict the Future of Playing with the weather data we have.

So here we have our Data, which comprises of the Day, Outlook, Humidity, Wind Conditions and the final column being Play, which we have to predict.

Table-Naive-Bayes-Tutorial

First, we will create a frequency table using each attribute of the dataset.

Frequency-Table-Naive-Bayes-Tutorial

For each frequency table, we will generate a likelihood table.

Likelihood-Table-Naive-Bayes-Tutorial

Likelihood of ‘Yes’ given ‘Sunny‘ is:

P(c|x) = P(Yes|Sunny) = P(Sunny|Yes)* P(Yes) / P(Sunny) = (0.3 x 0.71) /0.36 = 0.591

Similarly Likelihood of ‘No’ given ‘Sunny‘ is:

P(c|x) = P(No|Sunny) = P(Sunny|No)* P(No) / P(Sunny) = (0.4 x 0.36) /0.36 = 0.40

Subscribe to our youtube channel to get new updates..!

Now, in the same way, we need to create the Likelihood Table for other attributes as well.

Likeihood-Naive-Bayes-Tutorial

Suppose we have a Day with the following values :

Outlook = Rain
Humidity = High
Wind = Weak
Play =?

So, with the data, we have to predict whether “we can play on that day or not”.

Likelihood of ‘Yes’ on that Day = P(Outlook = Rain|Yes)*P(Humidity= High|Yes)* P(Wind= Weak|Yes)*P(Yes)

= 2/9 * 3/9 * 6/9 * 9/14 = 0.0199

Likelihood of ‘No’ on that Day = P(Outlook = Rain|No)*P(Humidity= High|No)* P(Wind= Weak|No)*P(No)

= 2/5 * 4/5 * 2/5 * 5/14 = 0.0166

Now we normalize the values, then

P(Yes) = 0.0199 / (0.0199+ 0.0166) = 0.55

P(No) = 0.0166 / (0.0199+ 0.0166) = 0.45

Our model predicts that there is a 55% chance there will be a Game tomorrow.

Naive Bayes in the Industry

Now that you have an idea of What exactly is Naïve Bayes, how it works, let’s see where is it used in the Industry?

News Categorization:

News-Categorization-Naive-Bayes-Tutorial

Starting with our first industrial use, it is News Categorization, or we can use the term text classification to broaden the spectrum of this algorithm. News on the web is rapidly growing where each news site has its own different layout and categorization for grouping news. Companies use a web crawler to extract useful text from HTML pages of news article contents to construct a Full-Text-RSS. Each news article contents is tokenized(categorized). In order to achieve better classification result, we remove the less significant words i.e. stop – word from the document. We apply the naive Bayes classifier for classification of news contents based on news code.

Spam Filtering:

Spam-Filtering-Naive-Bayes-Tutorial

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use a bag of words features to identify spam e-mail, an approach commonly used in text classification. Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with a spam and non-spam e-mails and then using Bayes’ theorem to calculate a probability that an email is or is not spam.

Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word “Lottery” and “Luck Draw” in spam email, but will seldom see it in other emails. Each word in the email contributes to the email’s spam probability or only the most interesting words. This contribution is called the posterior probability and is computed using Bayes’ theorem. Then, the email’s spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.

Medical Diagnosis:

Medical-Diagnosis-Naive-Bayes-Tutorial

Nowadays modern hospitals are well equipped with monitoring and other data collection devices resulting in enormous data which are collected continuously through health examination and medical treatment. One of the main advantages of the Naive Bayes approach which is appealing to physicians is that “all the available information is used to explain the decision”. This explanation seems to be “natural” for medical diagnosis and prognosis i.e. is close to the way how physicians diagnose patients.

When dealing with medical data, Naïve Bayes classifier takes into account evidence from many attributes to make the final prediction and provides transparent explanations of its decisions and therefore it is considered as one of the most useful classifiers to support physicians’ decisions.

Weather Prediction:

Weather-Prediction-Naive-Bayes-Tutorial

Weather is one of the most influential factors in our daily life, to an extent that it may affect the economy of a country that depends on occupation like agriculture. Weather prediction has been a challenging problem in the meteorological department for years. Even after the technological and scientific advancement, the accuracy in prediction of weather has never been sufficient.

A Bayesian approach based model for weather prediction is used, where posterior probabilities are used to calculate the likelihood of each class label for input data instance and the one with maximum likelihood is considered resulting output.

The Best Place to study Data Science

Step By Step Implementation of Naive Bayes

Diabetic-Test-Naive-Bayes-Tutorial

Here we have a dataset comprising of 768 Observations of women aged 21 and older. The dataset describes instantaneous measurement taken from patients, like age, blood workup, the number of times pregnant. Each record has a class value that indicates whether the patient suffered an onset of diabetes within 5 years. The values are 1 for Diabetic and 0 for Non-Diabetic.

Now, Let’s continue our Naive Bayes Blog and understand all the steps one by one. I,ve broken the whole process down into the following steps:

Handle Data
Summarize Data
Make Predictions
Evaluate Accuracy

Step 1: Handle Data

The first thing we need to do is load our data file. The data is in CSV format without a header line or any quotes. We can open the file with the open function and read the data lines using the reader function in the CSV module.


import csv
import math
import random


def loadCsv(filename):
lines = csv.reader(open(r'C:\Users\Kislay\Desktop\pima-indians-diabetes.data.csv'))
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]]
return dataset

Now we need to split the data into training and testing dataset.


def splitDataset(dataset, splitRatio):
trainSize = int(len(dataset) * splitRatio)
trainSet = []
copy = list(dataset)
while len(trainSet) &amp;lt; trainSize:
index = random.randrange(len(copy))
trainSet.append(copy.pop(index))
return [trainSet, copy]

Step 2: Summarize the Data

The summary of the training data collected involves the mean and the standard deviation for each attribute, by class value. These are required when making predictions to calculate the probability of specific attribute values belonging to each class value.

We can break the preparation of this summary data down into the following sub-tasks:

Separate Data By Class


def separateByClass(dataset):
separated = {}
for i in range(len(dataset)):
vector = dataset[i]
if (vector[-1] not in separated):
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated

Calculate Mean


def mean(numbers):
return sum(numbers)/float(len(numbers))

Calculate Standard Deviation


def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return math.sqrt(variance)

Summarize Dataset


def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
del summaries[-1]
return summaries

Summarize Attributes By Class


def summarizeByClass(dataset):
separated = separateByClass(dataset)
summaries = {}
for classValue, instances in separated.items():
summaries[classValue] = summarize(instances)
return summaries

Step 3: Making Predictions

We are now ready to make predictions using the summaries prepared from our training data. Making predictions involves calculating the probability that a given data instance belongs to each class, then selecting the class with the largest probability as the prediction. We need to perform the following tasks:

Calculate Gaussian Probability Density Function


def calculateProbability(x, mean, stdev):
exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
return (1/(math.sqrt(2*math.pi)*stdev))*exponent

Calculate Class Probabilities


def calculateClassProbabilities(summaries, inputVector):
probabilities = {}
for classValue, classSummaries in summaries.items():
probabilities[classValue] = 1
for i in range(len(classSummaries)):
mean, stdev = classSummaries[i]
x = inputVector[i]
probabilities[classValue] *= calculateProbability(x, mean, stdev)
return probabilities

Make a Prediction


def predict(summaries, inputVector):
probabilities = calculateClassProbabilities(summaries, inputVector)
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items():
if bestLabel is None or probability &amp;gt; bestProb:
bestProb = probability
bestLabel = classValue
return bestLabel

Make Predictions


def getPredictions(summaries, testSet):
predictions = []
for i in range(len(testSet)):
result = predict(summaries, testSet[i])
predictions.append(result)
return predictions

Get Accuracy


def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet)))*100.0

Finally, we define our main function where we call all these methods we have defined, one by one to get the accuracy of the model we have created.


def main():
filename = 'pima-indians-diabetes.data.csv'
splitRatio = 0.67
dataset = loadCsv(filename)
trainingSet, testSet = splitDataset(dataset, splitRatio)
print('Split {0} rows into train = {1} and test = {2} rows'.format(len(dataset),len(trainingSet),len(testSet)))
#prepare model
summaries = summarizeByClass(trainingSet)
#test model
predictions = getPredictions(summaries, testSet)
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: {0}%'.format(accuracy))

main()

Output:

So here as you can see the accuracy of our Model is 66 %. Now, This value differs from model to model and also the split ratio.

Now that we have seen the steps involved in the Naive Bayes Classifier, Python comes with a library SKLEARN which makes all the above-mentioned steps easy to implement and use. Let’s continue our Naive Bayes Tutorial and see how this can be implemented.

Naive Bayes with SKLEARN

SKLEARN-Naive-Bayes-Tutorial For our research, we are going to use the IRIS dataset, which comes with the sklearn library. The dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. Here we are going to use the GaussianNB model, which is already available in the SKLEARN Library.

Importing Libraries and Loading Datasets


from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB

&amp;amp;nbsp;

dataset = datasets.load_iris()

Creating our Naive Bayes Model using Sklearn

Here we have a GaussianNB() method that performs exactly the same functions as the code explained above


model = GaussianNB()
model.fit(dataset.data, dataset.target)

Making Predictions


expected = dataset.target
predicted = model.predict(dataset.data)

Getting Accuracy and Statistics

Here we will create a classification report that contains the various statistics required to judge a model. After that, we will create a confusion matrix which will give us a clear idea of the Accuracy and the fitting of the model.


print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Classification Report:

Confusion Matrix:

As you can see all the hundreds of lines of code can be summarized into just a few lines of code with this powerful library.

So, with this, we come to the end of this Naive Bayes Tutorial Blog. I hope you enjoyed this blog. If you are reading this, Congratulations! You are no longer a newbie to Naive Bayes. Try out this simple example on your systems now.

Now that you have understood the basics of Naive Bayes, check out the Python Certification Training for Data Science by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. Edureka’s Python course helps you gain expertise in Quantitative Analysis, data mining, and the presentation of data to see beyond the numbers by transforming your career into Data Scientist role. You will use libraries like Pandas, Numpy, Matplotlib, Scikit and master the concepts like Python Machine Learning Algorithms such as Regression, Clustering, Decision Trees, Random Forest, Naïve Bayes and Q-Learning and Time Series. Throughout the Course, you’ll be solving real-life case studies on Media, Healthcare, Social Media, Aviation, HR and so on

Got a question for us? Please mention it in the comments section and we will get back to you.

The post Naive Bayes Classifier: Learning Naive Bayes with Python appeared first on Edureka Blog.

With the increasing demand for Big Data, and Apache Hadoop is at the heart of the revolution, it has changed the way we organize and compute the data. The need for organizations to align Hadoop with their business needs has fueled the emergence of the commercial distributions. Commercial Hadoop Distributions are usually packaged with features, designed to streamline the deployment of Hadoop. Cloudera Hadoop Distribution provides a scalable, flexible, integrated platform that makes it easy to manage rapidly increasing volumes and varieties of data in your enterprise.

In this blog on Cloudera Hadoop Distribution, we will be covering the following topics:

Cloudera Hadoop: Introduction to Hadoop

Hadoop is an Apache open-source framework that store and process Big Data in a distributed environment across the cluster using simple programming models. Hadoop provides parallel computation on top of distributed storage. To learn about Hadoop in detail, you can refer to this Hadoop tutorial blog.

After this short introduction to Hadoop, let me now explain the different types of Hadoop Distribution.

Cloudera Hadoop: Hadoop Distributions

Since Apache Hadoop is open source, many companies have developed distributions that go beyond the original open source code. This is very akin to Linux distributions such as RedHat, Fedora, and Ubuntu. Each of the Linux distributions supports its own functionalities and features like user-friendly GUI in Ubuntu. Similarly, Red Hat is popular within enterprises because it offers support and also provides ideology to make changes to any part of the system at will. Red Hat relieves you from software compatibility problems. This is usually a big issue for users who are transitioning from Windows.

Likewise, there are 3 main types of Hadoop distributions which have its own set of functionalities and features and are built under the base HDFS.

Cloudera vs MapR vs Hortonworks

Fig: MapR vs Hortonworks vs Cloudera

Cloudera Hadoop Distribution

Cloudera is the market trend in Hadoop space and is the first one to release commercial Hadoop distribution. It offers consulting services to bridge the gap between – “what does Apache Hadoop provides” and “what organizations need”.

Cloudera Distribution is:

Fast for business: From analytics to data science and everything in between, Cloudera delivers the performance you need to unlock the potential of unlimited data.
Makes Hadoop easy to manage: With Cloudera Manager, automated wizards let you quickly deploy your cluster, irrespective of the scale or deployment environment.
Secure without compromise: Meets stringent data security and compliance needs without sacrificing business agility. Cloudera provides an integrated approach to data security and governance.

Horton-Works Distribution

The Horton-Works Data Platform (HDP) is entirely an open source platform designed to maneuver data from many sources and formats. The platform includes various Hadoop tools such as the Hadoop Distributed File System (HDFS), MapReduce, Zookeeper, HBase, Pig, Hive, and additional components.

It also supports features like:

HDP makes Hive faster through its new Stinger project.
HDP avoids vendor lock-in by pledging to a forked version of Hadoop.
HDP is focused on enhancing the usability of the Hadoop platform.

MapR Distribution

MapR is a platform-focused Hadoop solutions provider, just like HortonWorks and Cloudera. MapR integrates its own database system, known as MapR-DB while offering Hadoop distribution services. MapR-DB is claimed to be four to seven times faster than the stock Hadoop database, i.e. HBase, that is executed on other distributions.

It has its intriguing features like:

It is the only Hadoop distribution that includes Pig, Hive, and Sqoop without any Java dependencies – since it relies on MapR-File System.
MapR is the most production ready Hadoop distribution with many enhancements that make it more user-friendly, faster and dependable.

Now let’s discuss the Cloudera Hadoop Distribution in depth.

Subscribe to our YouTube channel to get new updates...

Cloudera Hadoop: Cloudera Distribution

Cloudera is the best-known player in the Hadoop space to release the first commercial Hadoop distribution.

Fig: Cloudera Hadoop Distribution

Cloudera Hadoop Distribution supports the following set of features:

Cloudera’s CDH comprises all the open source components, targets enterprise-class deployments, and is one of the most popular commercial Hadoop distributions.
Known for its innovations, Cloudera was the first to offer SQL-for-Hadoop with its Impala query engine.
The management console – Cloudera Manager, is easy to use and implement with the rich user interface displaying all the cluster information in an organized and clean way.
In CDH you can add services to the up and running cluster without any disruption.
Other additions of Cloudera includes security, user interface, and interfaces for integration with third-party applications.
CDH provides Node Templates i.e. it allows the creation of a group of nodes in a Hadoop cluster with varying configuration. It eradicates the use of the same configuration throughout the Hadoop cluster.
It also supports:
- Reliability
  Hadoop vendors promptly act in response whenever a bug is detected. With the intent to make commercial solutions more stable, patches and fixes are deployed immediately.
- Support
  Cloudera Hadoop vendors provide technical guidance and assistance that makes it easy for customers to adopt Hadoop for enterprise level tasks and mission-critical applications.
- Completeness
  Hadoop vendors couple their distributions with various other add-on tools which help customers customize the Hadoop application to address their specific tasks.

Cloudera distributions come up with 2 different types of editions.

Cloudera Express Edition
Cloudera Enterprise Edition

Now let’s look at the differences between them.

Features	Cloudera-Express	Cloudera-Enterprise
Cluster Management
1. Multi-Cluster Management	Yes	Yes
2. Resource Management	Yes	Yes
Deployment
1. Support for CDH 4 and 5	Yes	Yes
2. Rolling upgrade of CDH	No	Yes
Service and Configuration Management
1. Manage HDFS, MapReduce, YARN, Impala, HBase, Hive, Hue, Oozie, Zookeeper, Solr, Spark, and Accumulo services	Yes	Yes
2. Rolling restart of services	No	Yes
Security
1. LDAP Authentication	No	Yes
2. SAML Authentication	No	Yes
Monitoring and Diagnostics
1. Health History	Yes	Yes
Alert Management
1. Alert via email	Yes	Yes
2. Alert via SNMP	No	Yes
Advanced Management Features
1. Automated backup and recovery	No	Yes
2. File browsing and searching	No	Yes
3. MapReduce, Impala, HBase, Yarn usage reports	No	Yes

Cloudera Hadoop: Cloudera Manager

According to Cloudera, Cloudera Manager is the best way to install, configure, manage, and monitor the Hadoop stack.

It provides:

Automated deployment and configuration
Customizable monitoring and reporting
Effortless robust troubleshooting
Zero – Downtime maintenance

Get in-depth Knowledge about Cloudera Hadoop and its various tools

Demonstration of Cloudera Manager

Let’s explore the Cloudera Manager.

1. Below figure shows the number of services that are currently running in the Cloudera Manager. You can also view the charts about cluster CPU usage, Disk IO usage, etc.

Fig: Homepage of Cloudera Manager

2. Below image demonstrates the HBase cluster. It gives you charts and graphs about the health conditions of the currently running HBase REST server.

Fig: Health Conditions of the HBase server

3. Now, let’s have a look at the Instances tab of HBase cluster where you can check the status and the IP configuration.

Fig: Status and IP address of the Host Server of the HBase cluster

4. Next, you have Configuration tab. Here you can see all the configuration parameters and change their values.

Fig: Configuration of the HBase cluster

Now, let’s understand what are Parcels in Cloudera.

Cloudera Hadoop: Parcels

A parcel is a binary distribution format containing the program files, along with additional metadata used by Cloudera Manager.

Parcels are self-contained and installed in a versioned directory, which means that multiple versions of a given service can be installed side-by-side.

Below are the benefits of using Parcel:

It provides distribution of CDH as a single object i.e. instead of having a separate package for each part of CDH, parcels just have a single object to install.
It offers internal consistency (as the complete CDH is distributed as a single parcel, all the CDH components are matched and there will be no risk of different parts coming from different versions of CDH).
You can install, upgrade, downgrade, distribute, and activate the parcels in CDH using few clicks.

Now, let’s see how to install and activate Kafka service in CDH using Parcels.

Go to Cloudera manager homepage >> Hosts >>Parcels as shown below
Fig: Selecting parcels from the hosts

2. If you do not see Kafka in the list of parcels, you can add the parcel to the list.

Find the parcel of the Kafka version you want to use. If you do not see it, you can add the parcel repository to the list.
Find the parcel for the version of Kafka you want to install – Cloudera Distribution of Apache Kafka Versions.
Below figure demonstrates the same.

Fig: Repository path for the parcel.

3. Copy the link as shown in the above figure and add it to the Remote Parcel Repository as shown below.

Fig: Addition of the Kafka path from the repository

4. After adding the path, Kafka will be ready for download. You can just click on the download button and download the Kafka.

Fig: Downloading the Kafka

5. Once Kafka is downloaded, all you need to do is to distribute and activate it.

Fig: Activating the Kafka

Once it is activated, you can go ahead and view the Kafka in the services tab in Cloudera manager.

Fig: Kafka service

Cloudera Hadoop: Creating an Oozie Workflow

Creating a workflow by manually writing the XML code and then executing it, is complicated. You can refer this Scheduling the Oozie job blog, to know about the traditional approach.

You can see the below image, where we have written an XML file to create a simple Oozie workflow. Fig: Creating an Oozie workflow using a Traditional approach

As you can see even to create a simple Oozie scheduler we had to write huge XML code which is time-consuming, and debugging every single line becomes cumbersome. In order to overcome this, Cloudera Manager introduced a new feature called Hue which provides a GUI and a simple drag and drop features to create and execute Oozie workflows.

Now let’s see how Hue performs the same task in a simplified way.

Before creating a workflow, let’s first create input files, i.e. clickstream.txt and user.txt.
In the user.txt file, we have User Id, Name, Age, Country, Gender as shown below. We need this user file to know the user counts and clicks on the URL(mentioned in the clickstream file) based on the User Id.

Fig: Creating a text file

In order to know the number of clicks by the user on each URL, we have a clickstream containing the User Id and URL.

Fig: Clickstream file

Let’s write the queries in the script file.

Fig: Script file

After creating the user file, clickstream file, and script file next, we can go ahead and create the Oozie workflow.

1. You can simply drag and drop the Oozie workflow as shown in the image.

Fig: Drag and drop feature of creating the Oozie workflow

2. Soon after dropping your action you have to specify the paths to the script file and add the parameters mentioned in the script file. Here you need to add OUTPUT, CLICKSTREAM, and USER parameters and specify the path to each of the parameters.

Fig: Adding script file and the required Parameters to execute the action

Fig: Adding a script file and the required Parameters to execute the action

3. Once you have specified the paths and added the parameters, now simply save and submit the workflow as shown in the below image.

Fig: Saving and submitting the Oozie action

4. Once you submit the task, your job is completed. Execution and the other steps are taken care by Hue.

Fig: Execution status of the Oozie job

5. Now that we have executed the Oozie job, let’s take a look at the action tab. It contains the user ID and the status of the workflow. It also shows error codes if they’re any, the start and end time of the action item.

Fig: Elements present in the action tab of the Oozie workflow

6. Next to the action tab is the details tab. In this, we can see the start time and the last modified time of the job.

Fig: Details of the Oozie workflow.

7. Next to Details tab, we have the Configuration tab of the workflow.

Fig: Configuration settings of the Oozie workflow

7. While executing the action item, if there are any errors, it will be listed in the Log tab. You can refer to the error statements and debug it accordingly.

Fig: Log file that contains error codes and error statements

8. Here is the XML code of the workflow that is automatically generated by Hue.

Fig: XML code of the Oozie workflow

9.1. As you have already specified the path for the output directory in step 2, here you have the output directory in the HDFS Browser as shown below.

output file of the HDFS browser-Cloudera Hadoop-Edureka

Fig: Output directory of the HDFS Browser

9.2 Once you click on the output directory, you will find a text file named as output.txt and that text file contains the actual output as shown in the below figure.

Fig: Final output text

This is how Hue makes our work simple by providing the drag and drop options to create an Oozie workflow.

I hope this blog was useful for understanding the Cloudera Distribution and the different Cloudera Components.

Want to take part in Big Data revolution?

Now that you have understood Cloudera Hadoop Distribution check out the Hadoop training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka Big Data Hadoop Certification Training course helps learners become expert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domain.

Got a question for us? Please mention it in the comments section and we will get back to you.

The post Cloudera Hadoop: Getting started with CDH Distribution appeared first on Edureka Blog.

In today’s world, enterprises have become keen in containerization which requires strong networking skills to properly configure a container architecture, and thus, this introduces the concept of Docker Networking.

In this blog on Docker Networking, you will go through the following topics:

What Is Docker?

To understand Docker, you need to know about the history of how applications were deployed before and then how applications are being deployed using containers now.

Deployment Of Applications In Old Way And New Way - Docker Networking - Edureka

As you can see in the above diagram, the old way had applications on the host. So, n number of applications share the libraries present in that operating system. But, with containerization, the operating system will have a kernel which is the only thing that’s going to be common between all the applications. So, applications cannot access each other’s libraries.

So, Docker in simple terms is an open platform for developing, shipping, and running applications, enabling the user to separate applications from infrastructures with the help of containers to deliver software quickly.

So, how do these containers communicate with each other in various situations?

Well, that comes through Docker Networking.

Docker Networking

Before I deep dive into Docker Networking let me show you the workflow of Docker.

Docker Workflow - Docker Networking - Edureka

As you can see in the above diagram. A developer writes a code that stipulates application requirements or the dependencies in an easy to write Docker File and this Docker File produces Docker Images. So, whatever dependencies are required for a particular application are present in this image.

Now, Docker Containers are nothing but the runtime instance of Docker Image. These images are uploaded onto the Docker Hub(Git repository for Docker Images) which contains public/private repositories.

So, from public repositories, you can pull your image as well and you can upload your own images onto the Docker Hub. Then, from Docker Hub, various teams such as Quality Assurance or Production teams will pull that image and prepare their own containers. These individual containers, communicate with each other through a network to perform the required actions, and this is nothing but Docker Networking.

So, you can define Docker Networking as a communication passage through which all the isolated containers communicate with each other in various situations to perform the required actions.

What do you think are the goals of Docker Networking?

Goals of Docker Networking

Goals Of Docker Networking - Docker Networking - Edureka

Flexibility – Docker provides flexibility by enabling any number of applications on various platforms to communicate with each other.

Cross-Platform – Docker can be easily used in cross-platform which works across various servers with the help of Docker Swarm Clusters.

Scalability – Docker is a fully distributed network, which enables applications to grow and scale individually while ensuring performance.

Decentralized – Docker uses a decentralized network, which enables the capability to have the applications spread and highly available. In the event that a container or a host is suddenly missing from your pool of resource, you can either bring up an additional resource or pass over to services that are still available.

User – Friendly – Docker makes it easy to automate the deployment of services, making them easy to use in day-to-day life.

Support – Docker offers out-of-the-box supports. So, the ability to use Docker Enterprise Edition and get all of the functionality very easy and straightforward, makes Docker platform to be very easy to be used.

To enable the above goals, you need something known as the Container Network Model.

Want To Explore Various DevOps Stages?

Container Network Model(CNM)

Before I tell you what exactly is a Container Network Model, let me brief you about Libnetwork that is needed before you understand CNM.

Libnetwork is an open source Docker library which implements all of the key concepts that make up the CNM.

Architecture of Container Networking Model - Docker Networking - Edureka

So, Container Network Model (CNM) standardizes the steps required to provide networking for containers using multiple network drivers. CNM requires a distributed key-value store like console to store the network configuration.

The CNM has interfaces for IPAM plugins and network plugins.

The IPAM plugin APIs are used to create/delete address pools and allocate/deallocate container IP addresses, whereas the network plugin APIs are used to create/delete networks and add/remove containers from networks.

A CNM has mainly built on 5 objects: Network Controller, Driver, Network, Endpoint, and Sandbox.

Container Network Model Objects

Network Controller: Provides the entry-point into Libnetwork that exposes simple APIs for Docker Engine to allocate and manage networks. Since Libnetwork supports multiple inbuilt and remote drivers, Network Controller enables users to attach a particular driver to a given network.

Driver: Owns the network and is responsible for managing the network by having multiple drivers participating to satisfy various use-cases and deployment scenarios.

Network: Provides connectivity between a group of endpoints that belong to the same network and isolate from the rest. So, whenever a network is created or updated, the corresponding Driver will be notified of the event.

Endpoint: Provides the connectivity for services exposed by a container in a network with other services provided by other containers in the network. An endpoint represents a service and not necessarily a particular container, Endpoint has a global scope within a cluster as well.

Sandbox: Created when users request to create an endpoint on a network. A Sandbox can have multiple endpoints attached to different networks representing container’s network configuration such as IP-address, MAC-address, routes, DNS.

So, those were the 5 main objects of CNM.

Now, let me tell you the various network drivers involved in Docker networking.

Want To Take DevOps Learning To A Next Level?

Network Drivers

There are mainly 5 network drivers: Bridge, Host, None, Overlay, Macvlan

Bridge: The bridge network is a private default internal network created by docker on the host. So, all containers get an internal IP address and these containers can access each other, using this internal IP. The Bridge networks are usually used when your applications run in standalone containers that need to communicate.

Bridge Network - Docker Networking - Edureka

Host: This driver removes the network isolation between the docker host and the docker containers to use the host’s networking directly. So with this, you will not be able to run multiple web containers on the same host, on the same port as the port is now common to all containers in the host network.

Host Network - Docker Networking - Edureka

None: In this kind of network, containers are not attached to any network and do not have any access to the external network or other containers. So, this network is used when you want to completely disable the networking stack on a container and, only create a loopback device.

None Network - Docker Networking - Edureka

Overlay: Creates an internal private network that spans across all the nodes participating in the swarm cluster. So, Overlay networks facilitate communication between a swarm service and a standalone container, or between two standalone containers on different Docker Daemons.

Overlay Network - Docker Networking - Edureka

Macvlan: Allows you to assign a MAC address to a container, making it appear as a physical device on your network. Then, the Docker daemon routes traffic to containers by their MAC addresses. Macvlan driver is the best choice when you are expected to be directly connected to the physical network, rather than routed through the Docker host’s network stack.

Macvlan Network - Docker Networking - Edureka

Alright, so that was all the theory required to understand Docker Networking. Now, let me move on and show you practically how the networks are created and containers communicate with each other.

Hands-On

So, with an assumption that all of you have installed Docker on your systems, I have a scenario to showcase.

Suppose you want to store courses name and courses ID, for which you will need a web application. Basically, you need one container for web application and you need one more container as MySQL for the backend, that MySQL container should be linked to the web application container.

How about I execute the above-stated example practically.

Steps involved:

Initialize Docker Swarm to form a Swarm cluster.
Create an Overlay Network
Create services for both web application and MySQL
Connect the applications through the network

Let’s get started!

Step 1: Initialize Docker Swarm on the machine.

docker swarm init --advertise-addr 192.168.56.101

Snapshot Of Hands On - Docker Networking - Edureka

The –advertise-addr flag configures the manager node to publish its address as 192.168.56.101. The other nodes in the swarm must be able to access the manager at the IP address.

Step 2: Now, if you want to join this manager node to the worker node, copy the link that you get when you initialize swarm on the worker node.
Snapshot Of Hands On - Docker Networking - Edureka Step 3: Create an overlay network.

docker network create -d overlay myoverlaynetwork

Snapshot Of Hands On - Docker Networking - Edureka

Where myoverlay is the network name and -d enables Docker Daemon to run in the background.

Step 4.1: Create a service webapp1 and use the network you have created to deploy this service over the swarm cluster.

docker service create --name webapp1 -d --network myoverlaynetwork -p 8001:80 hshar/webapp

Snapshot Of Hands On - Docker Networking - Edureka

Where -p is for port forwarding, hshar is the account name on Docker Hub, and webapp is the name of the web application already present on Docker Hub.

Step 4.2: Now, check if the service is created or not.

docker service ls

Snapshot Of Hands On - Docker Networking - Edureka

Step 5.1: Now, create a service MySQL and use the network you have created to deploy the service over the swarm cluster.

docker service create --name mysql -d --network myoverlaynetwork -p 3306:3306 hshar/mysql:5.5

Snapshot Of Hands On - Docker Networking - Edureka
Step 5.2: Now, check if the service is created or not.

docker service ls

Snapshot Of Hands On - Docker Networking - Edureka

Step 6.1: After that, check which container is running on your master node and go into the hshar/webapp container.

docker ps

Snapshot Of Hands On - Docker Networking - Edureka

Step 6.2: So, you can see that only the webapp service is on the manager node. So, get into the webapp container.

docker exec -it container_id bash
nano var/www/html/index.php

Snapshot Of Hands On - Docker Networking - Edureka

The docker ps command will list both your containers with their respective container id. The second command will enable that container in an interactive mode.

Step 7: Now, change the $servername from localhost to mysql and $password from “”” to “edureka”, and also change all fill in the database details required and save your index.php file by using the keyboard shortcut Ctrl+x and after that y to save, and press enter.

Snapshot Of Hands On - Docker Networking - Edureka

Step 8: Now, go into the mysql container which is running on another node.

docker exec -it container_id bash

Snapshot Of Hands On - Docker Networking - Edureka

Step 9: Once you go inside the mysql container, enter the below commands to use the database in MySQL.

Step 9.1: Get an access to use the mysql container.

mysql -u root -pedureka

Where -u represents the user and -p is the password of your machine.

Step 9.2: Create a database in mysql which will be used to get data from webapp1.

CREATE DATABASE HandsOn;

Snapshot Of Hands On - Docker Networking - Edureka

Step 9.3: Use the created database.

USE HandsOn;

Snapshot Of Hands On - Docker Networking - Edureka

Step 9.4: Create a table in this database which will be used to get data from webapp1.

CREATE TABLE course_details (course_name VARCHAR(10), course_id VARCHAR(11));

Snapshot Of Hands On - Docker Networking - Edureka

Step 9.5: Now, exit MySQL and container as well using the command exit.

Step 10: Go to your browser and enter the address as localhost:8001/index.php. This will open up your web application. Now, enter the details of courses and click on Submit Query.

Snapshot Of Hands On - Docker Networking - Edureka

Step 11: Once you click on Submit Query, go to the node in which your MySQL service is running and then go inside the container.

docker exec -it container_id bash
mysql -u root -pedureka
USE HandsOn;
SHOW tables;
select * from course_details;

This will show you the output of all the courses, of which you have filled in the details.

Snapshot Of Hands On - Docker Networking - Edureka

Here, I end my Docker Networking blog. I hope you have enjoyed this post. You can check other blogs in the series too, which deal with the basics of Docker.

If you found this Docker Container blog relevant, check out the DevOps training by Edureka, a trusted online learning company with a network of more than 450,000 satisfied learners spread across the globe. The Edureka DevOps Certification Training course helps learners gain expertise in various DevOps processes and tools such as Puppet, Jenkins, Docker, Nagios, Ansible, and GIT for automating multiple steps in SDLC.

Looking For Certification in DevOps?

Got a question for me? Please mention it in the comments section and I will get back to you.

The post Docker Networking – Explore How Containers Communicate With Each Other appeared first on Edureka Blog.

Encryption is essentially important because it secures data and information from unauthorized access and thus maintains the confidentiality. Here’s a blog post to help you understand ” what is cryptography “ and how can it be used to protect corporate secrets, secure classified information, and personal information to guard against things like identity theft.

Here’s what I have covered in this blog:

What is Cryptography?

Now, I’m going to take help of an example or a scenario to explain what is cryptography?

Let’s say there’s a person named Andy. Now suppose Andy sends a message to his friend Sam who is on the other side of the world. Now obviously he wants this message to be private and nobody else should have access to the message. He uses a public forum, for example, WhatsApp for sending this message. The main goal is to secure this communication. sending message over network-what is cryptography-edureka

Let’s say there is a smart guy called Eaves who secretly got access to your communication channel. Since this guy has access to your communication, he can do much more than just eavesdropping, for example, he can try to change the message. Now, this is just a small example. What if Eave gets access to your private information? The result could be catastrophic.

So how can Andy be sure that nobody in the middle could access the message sent to Sam? That’s where Encryption or Cryptography comes in. Let me tell you ” What is Cryptography “.

Cybersecurity Is Interesting & Exciting

What Is Cryptography?

Cryptography is the practice and study of techniques for securing communication and data in the presence of adversaries.

Alright, now that you know ” what is cryptography ” let’s see how cryptography can help secure the connection between Andy and Sam.

So, to protect his message, Andy first convert his readable message to unreadable form. Here, he converts the message to some random numbers. After that, he uses a key to encrypt his message, in Cryptography, we call this ciphertext.

Andy sends this ciphertext or encrypted message over the communication channel, he won’t have to worry about somebody in the middle of discovering his private messages. Suppose, Eaves here discover the message and he somehow manages to alter it before it reaches Sam. encryption-what is cryptography-edureka

Now, Sam would need a key to decrypt the message to recover the original plaintext. In order to convert the ciphertext into plain text, Sam would need to use the decryption key. Using the key he would convert the ciphertext or the numerical value to the corresponding plain text.

After using the key for decryption what will come out is the original plaintext message, is an error. Now, this error is very important. It is the way Sam knows that message sent by Andy is not the same as the message that he received. Thus, we can say that encryption is important to communicate or share information over the network.

Now, based on the type of keys and encryption algorithms, cryptography is classified under the following categories:

Encryption Algorithms

Cryptography is broadly classified into two categories: Symmetric key Cryptography and Asymmetric key Cryptography (popularly known as public key cryptography).

encryption algorithms-what is cryptography-edureka

Now Symmetric key Cryptography is further categorized as Classical Cryptography and Modern Cryptography.

Further drilling down, Classical Cryptography is divided into Transposition Cipher and Substitution Cipher. On the other hand, Modern Cryptography is divided into Stream Cipher and Block Cipher.

So, let’s understand these algorithms with examples.

How various Cryptographic Algorithms Works?

Let’s start with the Symmetric key encryption

Symmetric Key Cryptography

An encryption system in which the sender and receiver of a message share a single, common key that is used to encrypt and decrypt the message. The most popular symmetric–key system is the Data Encryption Standard (DES)

symmetric key-what is cryptography-edureka

Transposition Ciphers

In Cryptography, a transposition cipher is a method of encryption by which the positions held by units of plaintext (which are commonly characters or groups of characters) are shifted according to a regular system, so that the ciphertext constitutes a permutation of the plaintext.

That is, the order of the units is changed (the plaintext is reordered). Mathematically, a bijective function is used on the characters’ positions to encrypt and an inverse function to decrypt.

Example:

Substitution Cipher

Method of encryption by which units of plaintext are replaced with ciphertext, according to a fixed system; the “units” may be single letters (the most common), pairs of letters, triplets of letters, mixtures of the above, and so forth.

Example:

Consider this example shown on the slide: Using the system just discussed, the keyword “zebras” gives us the following alphabets:

substitution cipher example-what is cryptography-edureka

Stream Cipher

Symmetric or secret-key encryption algorithm that encrypts a single bit at a time. With a Stream Cipher, the same plaintext bit or byte will encrypt to a different bit or byte every time it is encrypted.

Block Cipher

An encryption method that applies a deterministic algorithm along with a symmetric key to encrypt a block of text, rather than encrypting one bit at a time as in stream ciphers

Block cipher-aht is cryptography-edureka

Example: A common block cipher, AES, encrypts 128-bit blocks with a key of predetermined length: 128, 192, or 256 bits. Block ciphers are pseudorandom permutation (PRP) families that operate on the fixed size block of bits. PRPs are functions that cannot be differentiated from completely random permutations and thus, are considered reliable until proven unreliable.

Asymmetric Key Encryption (or Public Key Cryptography)

The encryption process where different keys are used for encrypting and decrypting the information. Keys are different but are mathematically related, such that retrieving the plain text by decrypting ciphertext is feasible. public-key encryption-what is cryptography-edureka

RSA is the most widely used form of public key encryption,

RSA Algorithm

RSA stands for Rivest, Shamir, and Adelman, inventors of this technique
Both public and private key are interchangeable
Variable Key Size (512, 1024, or 2048 bits)

Here’s how keys are generated in RSA algorithm

Alright, this was it for “What is Cryptography” blog. To safeguard your information and data shared over the internet it is important to use strong encryption algorithms, to avoid any catastrophic situations.

If you wish to learn Cybersecurity and build a colorful career in cybersecurity, then check out our Cybersecurity Certification Training which comes with instructor-led live training and real-life case studies experience. This training will help you in becoming a Cybersecurity expert.

Cybersecurity Is An Important Defense

Got a question for us? Please mention it in the comments section and we will get back to you.

The post What is Cryptography? – An Introduction to Cryptographic Algorithms appeared first on Edureka Blog.

CI CD Pipeline implementation or the Continuous Integration/Continuous Deployment software is the backbone of the modern DevOps environment. It bridges the gap between development and operations teams by automating build, test and deployment of applications. In this blog, we will know What is CI CD pipeline and how it works.

Before moving onto the CI CD pipeline’s working, let’s start by understanding DevOps.

What is DevOps?

What is Devops - CI CD Pipeline - Edureka DevOps is a software development approach which involves continuous development, continuous testing, continuous integration, continuous deployment and continuous monitoring of the software throughout its development life cycle. This is exactly the process adopted by all the top companies to develop high-quality software and shorter development life cycles, resulting in greater customer satisfaction, something that every company wants.

DevOps Stages

Your understanding of what is DevOps, is incomplete without learning about its life cycle. Let us now look at the DevOps lifecycle and explore how they are related to the software development stages.

CI CD Pipeline Using Jenkins | DevOps Tutorial | Edureka

What is CI CD Pipeline?

CI stands for Continuous Integration and CD stands for Continuous Delivery and Continuous Deployment. You can think of it as a process which is similar to a software development lifecycle.
Now let us see how does it work.

CI CD Pipeline - CI CD Pipeline - Edureka The above pipeline is a logical demonstration of how a software will move along the various phases or stages in this lifecycle, before it is delivered to the customer or before it is live on production.

Let’s take a scenario of CI CD Pipeline. Imagine you’re going to build a web application which is going to be deployed on live web servers. You will have a set of developers who are responsible for writing the code which will further go on and build the web application. Now, when this code is committed into a version control system(such as git, svn) by the team of developers. Next, it goes through the build phase which is the first phase of the pipeline, where developers put in their code and then again code goes to the version control system having a proper version tag.

CI CD Pipeline - CI CD Pipeline - Edureka Suppose we have a Java code and it needs to be compiled before execution. So, through the version control phase, it again goes to build phase where it gets compiled. You get all the features of that code from various branches of the repository, which merge them and finally use a compiler to compile it. This whole process is called the build phase.

Testing Phase:

CI CD Pipeline - CI CD Pipeline - Edureka Once the build phase is over, then you move on to the testing phase. In this phase, we have various kinds of testing, one of them is the unit test (where you test the chunk/unit of software or for its sanity test).

Deploy Phase:

CI CD Pipeline - CI CD Pipeline - Edureka When the test is completed, you move on to the deploy phase, where you deploy it into a staging or a test server. Here, you can view the code or you can view the app in a simulator.

Auto Test Phase:

CI CD Pipeline - CI CD Pipeline - Edureka

Once the code is deployed successfully, you can run another set of a sanity test. If everything is accepted, then it can be deployed to production.

Deploy to Production:

CI CD Pipeline - CI CD Pipeline - Edureka

Meanwhile in every step, if there is some error, you can shoot a mail back to the development team so that they can fix them. Then they will push it into the version control system and goes back into the pipeline.

Once again if there is any error reported during testing, again the feedback goes to the dev team where they fix it and the process re-iterates if required.

Measure+Validate:

CI CD Pipeline - CI CD Pipeline - Edureka

So, this lifecycle continues until we get a code or a product which can be deployed in the production server where we measure and validate the code.

We have understood CI CD Pipeline and its working, now we will move on to understand what Jenkins is and how we can deploy the demonstrated code using Jenkins and automate the entire process.

Learn To Build, Test, Deliver and Deploy Your Applications

Jenkins – The Ultimate CI Tool and Its Importance in CI CD Pipeline

Our task is to automate the entire process, from the time the development team gives us the code and commits it to the time we get it into production.

Our task is to automate the pipeline in order to make the entire software development lifecycle on the dev-ops mode/ automated mode. For this, they would need automation tools.

Importance of Jenkins in CI CD Pipeline - CI CD Pipeline - Edureka Jenkins provides us with various interfaces and tools in order to automate the entire process.

So what happens, we have a git repository where the development team will commit the code. Then Jenkins takes over from there which is front-end tool where you can define your entire job or the task. Our job is to ensure the continuous integration and delivery process for that particular tool or for the particular application.

From Git, Jenkins pulls the code and then moves it to the commit phase, where the code is committed from every branch. Then Jenkins moves it into the build phase where we compile the code. If it is Java code, we use tools like maven in Jenkins and then compile that code, which we can be deployed to run a series of tests. These test cases are overseen by Jenkins again.

Then it moves on to the staging server to deploy it using docker. After a series of Unit Tests or sanity test, it moves to the production.

This is how the delivery phase is taken care by a tool called Jenkins, which automate everything. Now in order to deploy it, we will need an environment which will replicate the production environment, I.e., Docker.

Docker

Importance of Docker in CI CD Pipeline - CI CD Pipeline - Edureka Docker is just like a virtual environment in which we can create a server. It takes a few seconds to create an entire server and deploy the artifacts which we want to test. But here the question arises,

Why do we use docker?

As said earlier, you can run the entire cluster in a few seconds. We have storage registry for images where you build your image and store it forever. You can use it anytime on any environment which can replicate itself.

Subscribe to our youtube channel to get new updates...

Hands-On: Building CI CD Pipeline Using Docker and Jenkins

Step 1: Open your terminal in your VM. Start Jenkins and Docker using the commands “systemctl start jenkins“, “systemctl enable jenkins“, “systemctl start docker“.

Note: Use sudo before the commands if it display “privileges error”.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 2: Open your Jenkins on your specified port. Click on New Item to create a Job.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 3: Select freestyle project and provide the item name (here I have given Job1) and click OK.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 4: Select Source Code Management and provide the Git repository. Click on Apply and Save button.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 5: Then click on Build->Select Execute Shell.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 6: Provide the shell commands. Here it will build the archive file to get a war file. After that, it will get the code which is already pulled and then it uses maven to install the package. So, it simply installs the dependencies and compiles the application.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 7: Create the new Job by clicking on New Item.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 8: Select freestyle project and provide the item name (here I have given Job2) and click on OK.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 9: Select Source Code Management and provide the Git repository. Click on Apply and Save button.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 10: Then click on Build->Select Execute Shell.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 11: Provide the shell commands. Here it will start the integration phase and build the Docker Container.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 12: Create the new Job by clicking on New Item.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 13: Select freestyle project and provide the item name (here I have given Job3) and click on OK.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 14: Select Source Code Management and provide the Git repository. Click on Apply and Save button.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 15: Then click on Build->Select Execute Shell.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 16: Provide the shell commands. Here it will check for the Docker Container file and then deploy it on port number 8180. Click on Save button.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 17: Now click on Job1 -> Configure.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 18: Click on Post-build Actions -> Build other projects.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 19: Provide the project name to build after Job1 (here is Job2) and then click on Save.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 20: Now click on Job2 -> Configure.

CI CD Pipeline Hands-on - CI CD Pipeline - Edureka

Step 21: Click on Post-build Actions -> Build other projects.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 22: Provide the project name to build after Job2 (here is Job3) and then click on Save.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 23: Now we will be creating a Pipeline view. Click on ‘+’ sign.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

Step 24: Select Build Pipeline View and provide the view name (here I have provided CI CD Pipeline).

CI CD Pipeline Hands-on - CI CD Pipeline - Edureka

Step 25: Select the initial Job (here I have provided Job1) and click on OK.

CI CD Pipeline Hands-on - CI CD Pipeline - Edureka

Step 26: Click on Run button to start the the CI CD process.

CI CD Pipeline Hands-on - CI CD Pipeline - Edureka

Step 27: After successful build open localhost:8180/sample.text. It will run the application.

CI CD Pipeline Hands-on - CI CD Pipeline - edureka

So far, we have learned how to create CI CD Pipeline using Docker and Jenkins. The intention of DevOps is to create better-quality software more quickly and with more reliability while inviting greater communication and collaboration between teams. Check out the DevOps training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka DevOps Certification Training course helps learners to understand what is DevOps and gain expertise in various DevOps processes and tools such as Puppet, Jenkins, Nagios, Ansible, Chef, Saltstack and GIT for automating multiple steps in SDLC.

Get In-depth Knowledge of DevOps along with its Industry Level Use Cases

Got a question for us? Please mention it in the comments section of ” CI CD Pipeline” blog and we will get back to you ASAP.

The post CI CD Pipeline – Learn how to Setup a CI CD Pipeline from Scratch appeared first on Edureka Blog.

Apache Spark is one of the best frameworks when it comes to Big Data analytics. No sooner this powerful technology integrates with a simple yet efficient language like Python, it gives us an extremely handy and easy to use API called PySpark. In this article, I am going to throw some light on one of the building blocks of PySpark called Resilient Distributed Dataset or more popularly known as PySpark RDD.

By the end of this PySpark RDD tutorial, you would have an understanding of the below topics:

Why RDDs?

Iterative distributed computing, i.e., processing of data over multiple jobs requires reusing and sharing of data among them. Before RDDs came into the picture, frameworks like Hadoop faced difficulty in processing multiple operations/jobs. Also, a stable and distributed intermediate data store was needed, like HDFS or Amazon S3. These media for Data sharing helped in performing various computations like Logistic Regression, K-means clustering, Page rank algorithms, ad-hoc queries etc. But nothing comes for free, data sharing leads to slow data processing because of multiple I/O operation like replication and serialization. This scenario is depicted below:

Shared Memory - PySpark RDDs - Edureka

Thus, there was a need for something which can overcome the issue of multiple I/O operations through data sharing and reduce its number. This is where RDDs exactly fit into the picture.

You may go through the webinar recording of PytSpark RDDs where our instructor has explained the topics in a detailed manner with various examples.

PySpark RDD Tutorial | PySpark Online Training | Edureka

What are PySpark RDDs?

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that helps a programmer to perform in-memory computations on large clusters that too in a fault-tolerant manner.

RDDs are considered to be the backbone of PySpark. It’s one of the pioneers in the fundamental schema-less data structure, that can handle both structured and unstructured data. The in-memory data sharing makes RDDs 10-100x faster than network and disk sharing.

In-built Memory - PySpark RDDs - Edureka Now you might be wondering about its working. Well, the data in an RDD is split into chunks based on a key. RDDs are highly resilient, i.e, they are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. Thus, even if one executor node fails, another will still process the data. This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes.

Partitions - PySpark RDDs - Edureka Moreover, once you create an RDD it becomes immutable. By immutable I mean, an object whose state cannot be modified after it is created, but they can surely be transformed.

Before I move ahead with this PySpark RDD Tutorial, let me lay down few more intriguing features of PySpark.

Features Of RDDs

PySpark RDD Features - PySpark RDDs - Edureka

In-Memory Computations: It improves the performance by an order of magnitudes.
Lazy Evaluation: All transformations in RDDs are lazy, i.e, doesn’t compute their results right away.
Fault Tolerant: RDDs track data lineage information to rebuild lost data automatically.
Immutability: Data can be created or retrieved anytime and once defined, its value can’t be changed.
Partitioning: It is the fundamental unit of parallelism in PySpark RDD.
Persistence: Users can reuse PySpark RDDs and choose a storage strategy for them.
Coarse-Grained Operations: These operations are applied to all elements in data sets through maps or filter or group by operation.

In the next section of PySpark RDD Tutorial, I will introduce you to the various operations offered by PySpark RDDs.

RDD Operations in PySpark

RDD supports two types of operations namely:

Transformations: These are the operations which are applied to an RDD to create a new RDD. Transformations follow the principle of Lazy Evaluations (which means that the execution will not start until an action is triggered). This allows you to execute the operations at any time by just calling an action on the data. Few of the transformations provided by RDDs are:
- map
- flatMap
- filter
- distinct
- reduceByKey
- mapPartitions
- sortBy
Actions: Actions are the operations which are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver. Few of the actions include:
- collect
- collectAsMap
- reduce
- countByKey/countByValue
- take
- first

Let me help you to create an RDD in PySpark and apply few operations on them.

Creating and displaying an RDD

myRDD = sc.parallelize([('JK', 22), ('V', 24), ('Jimin',24), ('RM', 25), ('J-Hope', 25), ('Suga', 26), ('Jin', 27)])
myRDD.take(7)

Reading data from a text file and displaying the first 4 elements

New_RDD = sc.textFile("file:///home/edureka/Desktop/Sample")
New_RDD.take(4)

output 2 - PySpark RDDs - Edureka

Changing minimum number of partitions and mapping the data from a list of strings to list of lists

CSV_RDD = (sc.textFile("file:///home/edureka/Downloads/fifa_players.csv", minPartitions= 4).map(lambda element: element.split("\t")))
CSV_RDD.take(3)

output 3 - PySpark RDDs - Edureka

Counting the total number of rows in RDD

CSV_RDD.count()

count output - PySpark RDDs - Edureka

Creating a function to convert the data into lower case and splitting it

def Func(lines):
lines = lines.lower()
lines = lines.split()
return lines
Split_rdd = New_RDD.map(Func)
Split_rdd.take(5)

output 4 - PySpark RDDs - Edureka

Creating a new RDD with flattened data and filtering out the ‘stopwords’ from the entire RDD

stopwords = ['a','all','the','as','is','am','an','and','be','been','from','had','I','I’d','why','with']
RDD = New_RDD.flatMap(Func)
RDD1 = RDD.filter(lambda x: x not in stopwords)
RDD1.take(4)

output 5 - PySpark RDDs - Edureka

Filtering the words starting with ‘c’

import re
filteredRDD = RDD.filter(lambda x: x.startswith('c'))
filteredRDD.distinct().take(50)

output 6 - PySpark RDDs - Edureka

Grouping the data by key and then sorting it

rdd_mapped = RDD.map(lambda x: (x,1))
rdd_grouped = rdd_mapped.groupByKey()
rdd_frequency = rdd_grouped.mapValues(sum).map(lambda x: (x[1],x[0])).sortByKey(False)
rdd_frequency.take(10)

output 8 - PySpark RDDs - Edureka

Creating RDDs with key-value pair

a = sc.parallelize([('a',2),('b',3)])
b = sc.parallelize([('a',9),('b',7),('c',10)])

Performing Join operation on the RDDs

c = a.join(b)
c.collect()

output 9 - PySpark RDDs - Edureka

Creating an RDD and performing a lambda function to get the sum of elements in the RDD

num_rdd = sc.parallelize(range(1,5000))
num_rdd.reduce(lambda x,y: x+y)

Using ReduceByKey transformation to reduce the data

data_keydata_key = sc.parallelize([('a', 4),('b', 3),('c', 2),('a', 8),('d', 2),('b', 1),('d', 3)],4)
data_keydata_key.reduceByKey(lambda x, y: x + y).collect()

output 11 - PySpark RDDs - Edureka

Saving the data in a text file

RDD3.saveAsTextFile("file:///home/edureka/Desktop/newoutput.txt")

Sorting the data based on a key

test = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
sc.parallelize(test).sortByKey(True, 1).collect()

Performing Set Operations

##Creating two new RDDs

rdd_a = sc.parallelize([1,2,3,4])
rdd_b = sc.parallelize([3,4,5,6])

Intersection

rdd_a.intersection(rdd_b).collect()

output 12 - PySpark RDDs - Edureka

Subtraction

rdd_a.subtract(rdd_b).collect()

output 13 - PySpark RDDs - Edureka

Cartesian

rdd_a.cartesian(rdd_b).collect()

output 14 - PySpark RDDs - Edureka

Union

rdd_a.union(rdd_b).collect()

output 8 - PySpark RDDs - Edureka

Subscribe to our YouTube channel to learn more..!

I hope you are familiar with PySpark RDDs by now. So let’s dive deeper and see how you can use these RDDs to solve a real-life use case.

PySpark RDD Use Case

WebPage Ranking - PySpark RDDs - Edureka

Problem Statement

You have to calculate the page rank of a set of web pages based on the illustrated webpage system. Below is a diagram representing four web pages, Amazon, Google, Wikipedia, and Youtube, in our system. For the ease of access, let’s name them a,b,c, and d respectively. Here, the web page ‘a’ has outbound links to pages b, c, and d. Similarly, page ‘b’ has an outbound link to pages d and c. Web page ‘c’ has an outbound link to page b, and page ‘d’ has an outbound link to pages a and c.

Web Page System - PySpark RDDs - Edureka

Solution

To solve this, we will be implementing the page-rank algorithm that was developed by Sergey Brin and Larry Page. This algorithm helps in determining the rank of a particular web page within a group of web pages. Higher the page rank, higher it will appear in a search result list. Thus, will hold more relevance.

The contribution to page rank is given by the following formula:

Page Contribution Formula - PySpark RDDs - Edureka

Let me break it down for you:

PR_t+1(P_i) = Page rank of a site

PR_t(P_j) = Page rank of an inbound link

C(P_j) = Number of links on that page

In our problem statement, it is shown that the web page ‘a’ has three outbound links. So, according to the algorithm, the contribution to page rank of page d by page a is PR(a) / 3. Now we have to calculate the contribution of page b to page d. Page b has two outbound links: the first to page c, and the second to page d. Hence, the contribution by page b is PR(b) / 2.

So the page rank of page d will be updated as follows, where s is known as the damping factor :

PR(d) = 1 – s + s × (PR(a)/3 + PR(b)/2)

Let’s now execute this using PySpark RDDs.

##Creating Nested Lists of Web Pages with Outbound Links
pageLinks = [['a', ['b','c','d']],
['c', ['b']],['b', ['d','c']],['d', ['a','c']]]

##Initializing Rank #1 to all the webpages
pageRanks = [['a',1],['c',1],['b',1],['d',1]]

##Defining the number of iterations for running the page rank
###It will return the contribution to the page rank for the list of URIs
def rankContribution(uris, rank):
numberOfUris = len(uris)
rankContribution = float(rank) / numberOfUris
newrank =[]
for uri in uris:
newrank.append((uri, rankContribution))
return newrank

##Creating paired RDDs of link data
pageLinksRDD = sc.parallelize(pageLinks, 2)
pageLinksRDD.collect()

webrank output 1 - PySpark RDDs - Edureka

##Creating the paired RDD of our rank data 
pageRanksRDD = sc.parallelize(pageRanks, 2)
pageRanksRDD.collect()

webrank output 2 - PySpark RDDs - Edureka

##Defining the number of iterations and the damping factor, s
numIter = 20
s = 0.85

##Creating a Loop for Updating Page Rank
for i in range(numIter):
linksRank = pageLinksRDD.join(pageRanksRDD)
contributedRDD = linksRank.flatMap(lambda x : rankContribution(x[1][0],x[1][1]))
sumRanks = contributedRDD.reduceByKey(lambda v1,v2 : v1+v2)
pageRanksRDD = sumRanks.map(lambda x : (x[0],(1-s)+s*x[1]))

pageRanksRDD.collect()

output final - PySpark RDDs - Edureka

This gives us the result that ‘c’ has the highest page rank followed by ‘a’, ‘d’ and ‘b’.

With this, we come to an end of this PySpark RDD. Hope it helped in adding some value to your knowledge.

Get In-depth Knowledge of PySpark & its Diverse Applications

If you found this PySpark RDD blog, relevant, check out the PySpark Certification Training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe.

Got a question for us? Please put those on our edureka community and our experts will revert at the earliest!

The post RDDs in PySpark – Building Blocks Of PySpark appeared first on Edureka Blog.

When you set out to learn a new skill, the biggest hurdle is finding the relevant content. After finding it, a bigger hurdle is, learning it thoroughly.

Lack of a dedicated instructor means, doubts that arise midway, will go unanswered and your take away will be half-learned concepts. Even if you enrol into a physical classroom (which is old-school), the instructor guiding you may not be the best in the world; maybe best in that area.

Let’s be honest here, half-good is never good enough, especially when its career related.

This haphazard learning is a global trend. Majority of the people experience unstructured learning amidst chaos which neither benefits themselves nor, the company they are working for.

This prompted us to establish law and order (pun intended), and the resultant is Edureka’s training model. In this blog, I’ll tell you how we came up with Edureka’s pedagogy and reason why this is what you need.

Key factors influencing Edureka’s Pedagogy

Lack of discipline

We all love our comfort zone, don’t we? Staying there will render no benefits and that is the ugly truth. A famous quote goes like this,

A comfort zone is a beautiful place. But nothing ever grows there.

Similarly, when you self-learn, you’ll procrastinate by putting off your learning schedule. However, with an online course driven by an instructor, you will be motivated to learn because of reminders, batchmates and the investment made.

Alignment with Industry needs

When you’re self-learning (recorded videos/ others), you’ll be learning anything and everything. And you may miss out on skills the industry expects. Self-learning in a way is like ‘one blind man leading the other’.

At Edureka, the course curriculum is tailor-made to fit industry needs. You may ask how?
Well, after analyzing the JDs & requirements of 15+ companies, followed by multiple iterations and improvements suggested by industry experts, we finalize our Course Curriculum. This way, you get the best of both.

Course Completion Rate

Prime reason behind Edureka promoting instructor-led online training is because of complacency creeping into our learners. Proof is our learner report, which states that Self-paced (recorded) videos are not watched fully because of multiple distractions making it passive learning, and also because there are no interactions, the videos become boring and eventually learners drop out.

With an instructor-led session, not only will the sessions be more interesting, but you’ll also be able to clear doubts immediately, interact with other learners in the class and make friends for life.

These factors lead to a higher course completion rate, meaning your learning is most through this training model. Edureka’s Live Classes boast an impressive 80% completion rate.

Complete your learning with Edureka

Lack of exposure to Practicals

When you’re learning on your own, you can run through the theory alright, but, what about the practicals? You will not have the infrastructure to perform hands-on. Theoretical knowledge will only take you to one level. To get a strong hold of the concepts, exposure to practicals is a must!

At Edureka, we give utmost importance to practicals. The same can be noticed from our course schedule. 10 hrs for completing the assignments and 10 hrs for working on the practicals is a mandate for learning well.

As far as infrastructure for performing practicals is concerned, we’ll be providing access to a platform called ‘Cloud Labs’ for 3 months (2160 hrs). This is a replica of the actual work environment, and you can take this as a good experience before starting work.

Real-life use cases

Unless you’re told about success stories, you’ll never realize the magnitude of impact a technology has had on mankind. While self-learning, real-life use cases will not even feature in your list of topics.

We on the other hand will be sharing a number of case studies about the technology, companies using it and the level of success they attained. Reading it will make you more equipped and prepared for practicing them on real projects.

For perfect learning, one rule must be diligently followed, Learn One, Learn Well.

For more details on why an Edureka course will be your life-changer, read this blog on 7 Important Characteristics Of An Effective Online IT Training.

The post Why Edureka’s Pedagogy results in a steep learning curve appeared first on Edureka Blog.

Python and Apache Spark are the hottest buzzwords in the analytics industry. Apache Spark is a popular open source framework that ensures data processing with lightning speed and supports various languages like Scala, Python, Java, and R. It then boils down to your language preference and scope of work. Through this PySpark programming article, I would be talking about Spark with Python to demonstrate how Python leverages the functionalities of Apache Spark.

Before we embark on our journey of PySpark Programming, let me list down the topics that I will be covering in this article:

So, let get started with the first topic on our list, i.e., PySpark Programming.

PySpark Programming

PySpark is the collaboration of Apache Spark and Python.

Apache Spark is an open-source cluster-computing framework, built around speed, ease of use, and streaming analytics whereas Python is a general-purpose, high-level programming language. It provides a wide range of libraries and is majorly used for Machine Learning and Real-Time Streaming Analytics.

In other words, it is a Python API for Spark that lets you harness the simplicity of Python and the power of Apache Spark in order to tame Big Data.

PySpark - PySpark Programming - Edureka

You might be wondering, why I chose Python to work with Spark when there are other languages available. To answer this, I have listed down few of the advantages that you will enjoy with Python:

Python is very easy to learn and implement.
It provides simple and comprehensive API.
With Python, the readability of code, maintenance, and familiarity is far better.
It provides various options for data visualization, which is difficult using Scala or Java.
Python comes with a wide range of libraries like numpy, pandas, scikit-learn, seaborn, matplotlib etc.
It is backed up by a huge and active community.

Now that you know the advantages of PySpark programming, let’s simply dive into the fundamentals of PySpark.

PySpark Programming | PySpark Training | Edureka

Resilient Distributed Datasets (RDDs)

RDDs are the building blocks of any Spark application. RDDs Stands for:

Resilient: It is fault tolerant and is capable of rebuilding data on failure.
Distributed: Data is distributed among the multiple nodes in a cluster.
Dataset: Collection of partitioned data with values.

It is a layer of abstracted data over the distributed collection. It is immutable in nature and follows lazy transformations.

With RDDs, you can perform two types of operations:

Transformations: These operations are applied to create a new RDD.
Actions: These operations are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver.

DataFrame

Dataframe in PySpark is the distributed collection of structured or semi-structured data. This data in Dataframe is stored in rows under named columns which is similar to the relational database tables or excel sheets.

It also shares some common attributes with RDD like Immutable in nature, follows lazy evaluations and is distributed in nature. It supports a wide range of formats like JSON, CSV, TXT and many more. Also, you can load it from the existing RDDs or by programmatically specifying the schema.

PySpark SQL

PySpark SQL is a higher-level abstraction module over the PySpark Core. It is majorly used for processing structured and semi-structured datasets. It also provides an optimized API that can read the data from the various data source containing different files formats. Thus, with PySpark you can process the data by making use of SQL as well as HiveQL. Because of this feature, PySparkSQL is slowly gaining popularity among database programmers and Apache Hive users.

Subscribe to our YouTube channel to get new updates...

PySpark Streaming

PySpark Streaming is a scalable, fault-tolerant system that follows the RDD batch paradigm. It is basically operated in mini-batches or batch intervals which can range from 500ms to larger interval windows.

In this, Spark Streaming receives a continuous input data stream from sources like Apache Flume, Kinesis, Kafka, TCP sockets etc. These streamed data are then internally broken down into multiple smaller batches based on the batch interval and forwarded to the Spark Engine. Spark Engine processes these data batches using complex algorithms expressed with high-level functions like map, reduce, join and window. Once the processing is done, the processed batches are then pushed out to databases, filesystems, and live dashboards.

The key abstraction for Spark Streaming is Discretized Stream (DStream). DStreams are built on RDDs facilitating the Spark developers to work within the same context of RDDs and batches to solve the streaming issues. Moreover, Spark Streaming also integrates with MLlib, SQL, DataFrames, and GraphX which widens your horizon of functionalities. Being a high-level API, Spark Streaming provides fault-tolerance “exactly-once” semantics for stateful operations.

NOTE: “exactly-once” semantics means events will be processed “exactly once” by all operators in the stream application, even if any failure occurs.

Below diagram, represents the basic components of Spark Streaming. Spark Streaming Components - PySpark Programming - Edureka

As you can see, Data is ingested into the Spark Stream from various sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and many more. Further, this data is processed using complex algorithms expressed with high-level functions like map, reduce, join, and window. Finally, this processed data is pushed out to various file systems, databases, and live dashboards for further utilization.

I hope this gave you a clear picture of how PySpark Streaming works. Let’s now move on to the last but most enticing topic of this PySpark Programming article, i.e. Machine Learning.

Machine Learning

As you already know, Python is a mature language that is being heavily used for data science and machine learning since ages. In PySpark, machine learning is facilitated by a Python library called MLlib (Machine Learning Library). It is nothing but a wrapper over PySpark Core that performs data analysis using machine-learning algorithms like classification, clustering, linear regression and few more.

One of the enticing features of machine learning with PySpark is that it works on distributed systems and is highly scalable.

MLlib exposes three core machine learning functionalities with PySpark:

Data Preparation: It provides various features like extraction, transformation, selection, hashing etc.
Machine Learning Algorithms: It avails some popular and advanced regression, classification, and clustering algorithms for machine learning.
Utilities: It has statistical methods such as chi-square testing, descriptive statistics, linear algebra and model evaluation methods.

Let me show you how to implement machine learning using classification through logistic regression.

Here, I will be performing a simple predictive analysis on a food inspection data of Chicago City.

##Importing the required libraries
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import Row
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import *

##creating a RDD by importing and parsing the input data
def csvParse(s):
import csv
from StringIO import StringIO
sio = StringIO(s)
value = csv.reader(sio).next()
sio.close()
return value

food_inspections = sc.textFile('file:////home/edureka/Downloads/Food_Inspections_Chicago_data.csv')\
.map(csvParse)

##Display data format
food_inspections.take(1)

output 1 - PySpark Programming - Edureka

#Structuring the data
schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), False),
StructField("results", StringType(), False),
StructField("violations", StringType(), True)])
#creating a dataframe and a temporary table (Results) required for the predictive analysis. 
##sqlContext is used to perform transformations on structured data
ins_df = spark.createDataFrame(food_inspections.map(lambda l: (int(l[0]), l[1], l[12], l[13])) , schema)
ins_df.registerTempTable('Count_Results')
ins_df.show()

output 2 - PySpark Programming - Edureka

##Let's now understand our dataset
#show the distinct values in the results column
result_data = ins_df.select('results').distinct().show()

output 3 - PySpark Programming - Edureka

##converting the existing dataframe into a new dataframe 
###each inspection is represented as a label-violations pair. 
####Here 0.0 represents a failure, 1.0 represents a success, and -1.0 represents some results besides those two
def label_Results(s):
if s == 'Fail':
return 0.0
elif s == 'Pass with Conditions' or s == 'Pass':
return 1.0
else:
return -1.0
ins_label = UserDefinedFunction(label_Results, DoubleType())
labeled_Data = ins_df.select(ins_label(ins_df.results).alias('label'), ins_df.violations).where('label >= 0')
labeled_Data.take(1)

output 4 - PySpark Programming - Edureka

##Creating a logistic regression model from the input dataframe
tokenizer = Tokenizer(inputCol="violations", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
model = pipeline.fit(labeled_Data)
## Evaluating with Test Data

test_Data = sc.textFile('file:////home/edureka/Downloads/Food_Inspections_test.csv')\
.map(csvParse) \
.map(lambda l: (int(l[0]), l[1], l[12], l[13]))
test_df = spark.createDataFrame(test_Data, schema).where("results = 'Fail' OR results = 'Pass' OR results = 'Pass with Conditions'")
predict_Df = model.transform(test_df)
predict_Df.registerTempTable('Predictions')
predict_Df.columns

output 5 - PySpark Programming - Edureka

## Printing 1st row
predict_Df.take(1)

output 6 - PySpark Programming - Edureka

## Predicting the final result
numOfSuccess = predict_Df.where("""(prediction = 0 AND results = 'Fail') OR
(prediction = 1 AND (results = 'Pass' OR
results = 'Pass with Conditions'))""").count()
numOfInspections = predict_Df.count()
print "There were", numOfInspections, "inspections and there were", numOfSuccess, "successful predictions"
print "This is a", str((float(numOfSuccess) / float(numOfInspections)) * 100) + "%", "success rate"

output 7 - PySpark Programming - Edureka

With this, we come to the end of this blog on PySpark Programming. Hope it helped in adding some value to your knowledge.

Learn to analyse the data using Python with Spark!

If you found this PySpark Programming blog, relevant, check out the PySpark Certification Training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe.

Got a question for us? Please mention it in the comments section and we will get back to you.

The post PySpark Programming – Integrating Speed With Simplicity appeared first on Edureka Blog.

Creating accurate Machine Learning Models which are capable of identifying and localizing multiple objects in a single image remained a core challenge in computer vision. But, with recent advancements in Deep Learning, Object Detection applications are easier to develop than ever before. TensorFlow’s Object Detection API is an open source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models. So guys, in this Object Detection Tutorial, I’ll be covering the following topics:

Real-Time Object Detection with TensorFlow | Edureka

What is Object Detection?

Object Detection is the process of finding real-world object instances like car, bike, TV, flowers, and humans in still images or Videos. It allows for the recognition, localization, and detection of multiple objects within an image which provides us with a much better understanding of an image as a whole. It is commonly used in applications such as image retrieval, security, surveillance, and advanced driver assistance systems (ADAS).

Object Detection can be done via multiple ways:

Feature-Based Object Detection
Viola Jones Object Detection
SVM Classifications with HOG Features
Deep Learning Object Detection

In this Object Detection Tutorial, we’ll focus on Deep Learning Object Detection as Tensorflow uses Deep Learning for computation.

Detection-Object Detection Tutorial

Subscribe to our youtube channel to get new updates..!

Let’s move forward with our Object Detection Tutorial and understand it’s various applications in the industry.

Applications Of Object Detection

Facial Recognition:

Face-Recognition-Object Detection Tutorial

A deep learning facial recognition system called the “DeepFace” has been developed by a group of researchers in the Facebook, which identifies human faces in a digital image very effectively. Google uses its own facial recognition system in Google Photos, which automatically segregates all the photos based on the person in the image. There are various components involved in Facial Recognition like the eyes, nose, mouth and the eyebrows.

People Counting:

People-Count-Object Detection Tutorial

Object detection can be also used for people counting, it is used for analyzing store performance or crowd statistics during festivals. These tend to be more difficult as people move out of the frame quickly.

It is a very important application, as during crowd gathering this feature can be used for multiple purposes.

Industrial Quality Check:

Quality-checks-Object Detection Tutorial

Object detection is also used in industrial processes to identify products. Finding a specific object through visual inspection is a basic task that is involved in multiple industrial processes like sorting, inventory management, machining, quality management, packaging etc.

Inventory management can be very tricky as items are hard to track in real time. Automatic object counting and localization allows improving inventory accuracy.

Self Driving Cars:

Self-Driving-Car-Object Detection Tutorial

Self-driving cars are the Future, there’s no doubt in that. But the working behind it is very tricky as it combines a variety of techniques to perceive their surroundings, including radar, laser light, GPS, odometry, and computer vision.

Advanced control systems interpret sensory information to identify appropriate navigation paths, as well as obstacles and once the image sensor detects any sign of a living being in its path, it automatically stops. This happens at a very fast rate and is a big step towards Driverless Cars.

Security:

security-Object Detection Tutorial

Object Detection plays a very important role in Security. Be it face ID of Apple or the retina scan used in all the sci-fi movies.

It is also used by the government to access the security feed and match it with their existing database to find any criminals or to detect the robbers’ vehicle.

The applications are limitless.

Object Detection Workflow

Every Object Detection Algorithm has a different way of working, but they all work on the same principle.

Feature Extraction: They extract features from the input images at hands and use these features to determine the class of the image. Be it through MatLab, Open CV, Viola Jones or Deep Learning.

Now that you have understood the basic workflow of Object Detection, let’s move ahead in Object Detection Tutorial and understand what Tensorflow is and what are its components?

What is TensorFlow?

Tensorflow is Google’s Open Source Machine Learning Framework for dataflow programming across a range of tasks. Nodes in the graph represent mathematical operations, while the graph edges represent the multi-dimensional data arrays (tensors) communicated between them.

TensorFlow-Object Detection Tutorial

Tensors are just multidimensional arrays, an extension of 2-dimensional tables to data with a higher dimension. There are many features of Tensorflow which makes it appropriate for Deep Learning. So, without wasting any time, let’s see how we can implement Object Detection using Tensorflow.

Object Detection Tutorial

Getting Prerequisites

Before working on the Demo, let’s have a look at the prerequisites. We will be needing:
- Python
- TensorFlow
- Tensorboard
- Protobuf v3.4 or above

Setting up the Environment

Now to Download TensoFlow and TensorFlow GPU you can use pip or conda commands:

# For CPU
pip install tensorflow
# For GPU
pip install tensorflow-gpu

For all the other libraries we can use pip or conda to install them. The code is provided below:

pip install --user Cython
pip install --user contextlib2
pip install --user pillow
pip install --user lxml
pip install --user jupyter
pip install --user matplotlib

Next, we have Protobuf: Protocol Buffers (Protobuf) are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data, – think of it like XML, but smaller, faster, and simpler. You need to Download Protobuf version 3.4 or above for this demo and extract it.
Now you need to Clone or Download TensorFlow’s Model from Github. Once downloaded and extracted rename the “models-masters” to just “models“.
Now for simplicity, we are going to keep “models” and “protobuf” under one folder “Tensorflow“.
Next, we need to go inside the Tensorflow folder and then inside research folder and run protobuf from there using this command:

"path_of_protobuf's bin"./bin/protoc object_detection/protos/

To check whether this worked or not, you can go to the protos folder inside models>object_detection>protos and there you can see that for every proto file there’s one python file created.

Main Code

Coding-Object Detection Tutorial

After the environment is set up, you need to go to the “object_detection” directory and then create a new python file. You can use Spyder or Jupyter to write your code.

First of all, we need to import all the libraries

import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile

from collections import defaultdict
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image

sys.path.append("..")
from object_detection.utils import ops as utils_ops

from utils import label_map_util

from utils import visualization_utils as vis_util

Next, we will download the model which is trained on the COCO dataset. COCO stands for Common Objects in Context, this dataset contains around 330K labeled images. Now the model selection is important as you need to make an important tradeoff between Speed and Accuracy. Depending upon your requirement and the system memory, the correct model must be selected.

Inside “models>research>object_detection>g3doc>detection_model_zoo” contains all the models with different speed and accuracy(mAP).

Object Detection Tutorial

Next, we provide the required model and the frozen inference graph generated by Tensorflow to use.

MODEL_NAME = 'ssd_mobilenet_v1_coco_2017_11_17'
MODEL_FILE = MODEL_NAME + '.tar.gz'
DOWNLOAD_BASE = 'http://download.tensorflow.org/models/object_detection/'

PATH_TO_CKPT = MODEL_NAME + '/frozen_inference_graph.pb'

PATH_TO_LABELS = os.path.join('data', 'mscoco_label_map.pbtxt')

NUM_CLASSES = 90

This code will download that model from the internet and extract the frozen inference graph of that model.

opener = urllib.request.URLopener()
opener.retrieve(DOWNLOAD_BASE + MODEL_FILE, MODEL_FILE)
tar_file = tarfile.open(MODEL_FILE)
for file in tar_file.getmembers():
  file_name = os.path.basename(file.name)
  if 'frozen_inference_graph.pb' in file_name:
    tar_file.extract(file, os.getcwd())

detection_graph = tf.Graph()
with detection_graph.as_default():
  od_graph_def = tf.GraphDef()
  with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
    serialized_graph = fid.read()
    od_graph_def.ParseFromString(serialized_graph)
    tf.import_graph_def(od_graph_def, name='')

Next, we are going to load all the labels


label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)

Now we will convert the images data into a numPy array for processing.

def load_image_into_numpy_array(image):
  (im_width, im_height) = image.size
  return np.array(image.getdata()).reshape(
      (im_height, im_width, 3)).astype(np.uint8)

The path to the images for the testing purpose is defined here. Here we have a naming convention “image[i]” for i in (1 to n+1), n being the number of images provided.

PATH_TO_TEST_IMAGES_DIR = 'test_images'
TEST_IMAGE_PATHS = [ os.path.join(PATH_TO_TEST_IMAGES_DIR, 'image{}.jpg'.format(i)) for i in range(1, 8) ]

This code runs the inference for a single image, where it detects the objects, make boxes and provide the class and the class score of that particular object.


def run_inference_for_single_image(image, graph):
  with graph.as_default():
    with tf.Session() as sess:
    # Get handles to input and output tensors
      ops = tf.get_default_graph().get_operations()
      all_tensor_names = {output.name for op in ops for output in op.outputs}
      tensor_dict = {}
      for key in [
          'num_detections', 'detection_boxes', 'detection_scores',
          'detection_classes', 'detection_masks'
     ]:
        tensor_name = key + ':0'
        if tensor_name in all_tensor_names:
          tensor_dict[key] = tf.get_default_graph().get_tensor_by_name(
            tensor_name)
      if 'detection_masks' in tensor_dict:
        # The following processing is only for single image
        detection_boxes = tf.squeeze(tensor_dict['detection_boxes'], [0])
        detection_masks = tf.squeeze(tensor_dict['detection_masks'], [0])
        # Reframe is required to translate mask from box coordinates to image coordinates and fit the image size.
        real_num_detection = tf.cast(tensor_dict['num_detections'][0], tf.int32)
        detection_boxes = tf.slice(detection_boxes, [0, 0], [real_num_detection, -1])
        detection_masks = tf.slice(detection_masks, [0, 0, 0], [real_num_detection, -1, -1])
        detection_masks_reframed = utils_ops.reframe_box_masks_to_image_masks(
            detection_masks, detection_boxes, image.shape[0], image.shape[1])
        detection_masks_reframed = tf.cast(
            tf.greater(detection_masks_reframed, 0.5), tf.uint8)
        # Follow the convention by adding back the batch dimension
        tensor_dict['detection_masks'] = tf.expand_dims(
            detection_masks_reframed, 0)
        image_tensor = tf.get_default_graph().get_tensor_by_name('image_tensor:0')

        # Run inference
        output_dict = sess.run(tensor_dict,
            feed_dict={image_tensor: np.expand_dims(image, 0)})

        # all outputs are float32 numpy arrays, so convert types as appropriate
        output_dict['num_detections'] = int(output_dict['num_detections'][0])
        output_dict['detection_classes'] = output_dict[
          'detection_classes'][0].astype(np.uint8)
        output_dict['detection_boxes'] = output_dict['detection_boxes'][0]
        output_dict['detection_scores'] = output_dict['detection_scores'][0]
        if 'detection_masks' in output_dict:
          output_dict['detection_masks'] = output_dict['detection_masks'][0]
return output_dict

Our Final loop, which will call all the functions defined above and will run the inference on all the input images one by one, which will provide us the output of images in which objects are detected with labels and the percentage/score of that object being similar to the training data.


for image_path in TEST_IMAGE_PATHS:
  image = Image.open(image_path)
  # the array based representation of the image will be used later in order to prepare the
  # result image with boxes and labels on it.
  image_np = load_image_into_numpy_array(image)
  # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
  image_np_expanded = np.expand_dims(image_np, axis=0)
  # Actual detection.
  output_dict = run_inference_for_single_image(image_np, detection_graph)
  # Visualization of the results of a detection.
  vis_util.visualize_boxes_and_labels_on_image_array(
      image_np,
      output_dict['detection_boxes'],
      output_dict['detection_classes'],
      output_dict['detection_scores'],
      category_index,
      instance_masks=output_dict.get('detection_masks'),
      use_normalized_coordinates=True,
      line_thickness=8)
plt.figure(figsize=IMAGE_SIZE)
plt.imshow(image_np)

Detected-Pics-Object Detection Tutorial

Now, let’s move ahead in our Object Detection Tutorial and see how we can detect objects in Live Video Feed.

Discover the Hype about AI and Deep Learning

Live Object Detection Using Tensorflow

For this Demo, we will use the same code, but we’ll do a few tweakings. Here we are going to use OpenCV and the camera Module to use the live feed of the webcam to detect objects.

Add the OpenCV library and the camera being used to capture images. Just add the following lines to the import library section.

import cv2
cap = cv2.VideoCapture(0)

Next, we don’t need to load the images from the directory and convert it to numPy array as OpenCV will take care of that for us

Remove This

for image_path in TEST_IMAGE_PATHS:
image = Image.open(image_path)
# the array based representation of the image will be used later in order to prepare the
# result image with boxes and labels on it.
image_np = load_image_into_numpy_array(image)

With

while True:
ret, image_np = cap.read()

We will not use matplotlib for final image show instead, we will use OpenCV for that as well. Now, for that,

Remove This


plt.figure(figsize=IMAGE_SIZE)
plt.imshow(image_np)

With

cv2.imshow('object detection', cv2.resize(image_np, (800,600)))
if cv2.waitKey(25) & 0xFF == ord('q'):
  cv2.destroyAllWindows()
  break

This code will use OpenCV that will, in turn, use the camera object initialized earlier to open a new window named “Object_Detection” of the size “800×600”. It will wait for 25 milliseconds for the camera to show images otherwise, it will close the window.

Final Code with all the changes:


import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile

from collections import defaultdict
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image

import cv2
cap = cv2.VideoCapture(0)

sys.path.append("..")

from utils import label_map_util

from utils import visualization_utils as vis_util

MODEL_NAME = 'ssd_mobilenet_v1_coco_11_06_2017'
MODEL_FILE = MODEL_NAME + '.tar.gz'
DOWNLOAD_BASE = 'http://download.tensorflow.org/models/object_detection/'

# Path to frozen detection graph. This is the actual model that is used for the object detection.
PATH_TO_CKPT = MODEL_NAME + '/frozen_inference_graph.pb'

# List of the strings that is used to add correct label for each box.
PATH_TO_LABELS = os.path.join('data', 'mscoco_label_map.pbtxt')

NUM_CLASSES = 90

opener = urllib.request.URLopener()
opener.retrieve(DOWNLOAD_BASE + MODEL_FILE, MODEL_FILE)
tar_file = tarfile.open(MODEL_FILE)
for file in tar_file.getmembers():
  file_name = os.path.basename(file.name)
  if 'frozen_inference_graph.pb' in file_name:
    tar_file.extract(file, os.getcwd())

detection_graph = tf.Graph()
with detection_graph.as_default():
  od_graph_def = tf.GraphDef()
  with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
    serialized_graph = fid.read()
    od_graph_def.ParseFromString(serialized_graph)
    tf.import_graph_def(od_graph_def, name='')

label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)

with detection_graph.as_default():
  with tf.Session(graph=detection_graph) as sess:
    while True:
    ret, image_np = cap.read()
    # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
    image_np_expanded = np.expand_dims(image_np, axis=0)
    image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
    # Each box represents a part of the image where a particular object was detected.
    boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
    # Each score represent how level of confidence for each of the objects.
    # Score is shown on the result image, together with the class label.
    scores = detection_graph.get_tensor_by_name('detection_scores:0')
    classes = detection_graph.get_tensor_by_name('detection_classes:0')
    num_detections = detection_graph.get_tensor_by_name('num_detections:0')
    # Actual detection.
    (boxes, scores, classes, num_detections) = sess.run(
      [boxes, scores, classes, num_detections],
      feed_dict={image_tensor: image_np_expanded})
    # Visualization of the results of a detection.
    vis_util.visualize_boxes_and_labels_on_image_array(
        image_np,
        np.squeeze(boxes),
        np.squeeze(classes).astype(np.int32),
        np.squeeze(scores),
        category_index,
        use_normalized_coordinates=True,
        line_thickness=8)

    cv2.imshow('object detection', cv2.resize(image_np, (800,600)))
    if cv2.waitKey(25) 0xFF == ord('q'):
      cv2.destroyAllWindows()
      break

Live-Object-Detection-Object Detection Tutorial

Now with this, we come to an end to this Object Detection Tutorial. I Hope you guys enjoyed this article and understood the power of Tensorflow, and how easy it is to detect objects in images and video feed. So, if you have read this, you are no longer a newbie to Object Detection and TensorFlow. Try out these examples and let me know if there are any challenges you are facing while deploying the code.

Now that you have understood the basics of Object Detection, check out the AI and Deep Learning With Tensorflow by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. This Certification Training is curated by industry professionals as per the industry requirements & demands. You will master the concepts such as SoftMax function, Autoencoder Neural Networks, Restricted Boltzmann Machine (RBM) and work with libraries like Keras & TFLearn.

Got a question for us? Please mention it in the comments section of “Object Detection Tutorial” and we will get back to you.

Get in-depth Knowledge of Tensorflow and it's Applications

The post Object Detection Tutorial in TensorFlow: Real-Time Object Detection appeared first on Edureka Blog.

Currently more and more organisations are trying to digitally transform their business by migrating to cloud. This isn’t surprising but it does pose a challenge to the IT team who are responsible for the “effective” delivery of cloud services as well as the impact to the business if these services are impaired. So, how do they make sure that the services are not impaired? Simple answer is by employing a cloud monitoring tool.

In this blog we will be discussing about one such versatile monitoring tool called Amazon CloudWatch. The topics which I will be covering in this blog are as follows:

Why Do We Need Cloud Based Monitoring?

Cloud monitoring is a broad category that includes monitoring of web and cloud applications, infrastructure, networks, platform, application, and micro-services. Monitoring is crucial to make sure that all the services which you are using on cloud are running smoothly and efficiently.

Have a look at image below. What are your thoughts regarding this image?

Cloudwatch-Edureka

The image depicts two scenarios.

Scenario 1: You have deployed a messenger application on cloud and you have set of solutions. Questions are as follows:

How much bandwidth does my application use on a day to day basis?
How does my website’s traffic looks like?
How is the performance of my app on cloud?
Are customers satisfied with the current features of my app or do I have to make any improvements?

But you don’t have answers to any of the above questions since you are not using any kind of monitoring platform. So you have no idea if your app needs any improvement. Because of which sales and revenue of your product has decreased rapidly.

Scenario 2: You have deployed an application on Cloud and you have same set of questions again. You are using a monitoring tool to keep a track of your application’s health.

So you know how well your application is performing on Cloud. You are now better-equipped to track down and address bottlenecks, improving your application’s up-time and performance. This in-turn encourages your users to engage more with your business!

A little bit of monitoring can go a long way to helping your business grow!

While it is still possible to build high level tools to track and monitor the overall state of the AWS environment, but as the systems grows larger it becomes complex to carry out manual monitoring. So Amazon provides a versatile monitoring tool called Amazon CloudWatch that enables robust monitoring of AWS infrastructure for us. Now let us explore CloudWatch in detail.

Want To Be A Certified AWS Architect?

What Is Amazon CloudWatch?

Amazon CloudWatch is the component of Amazon Web Services that provides real time monitoring of AWS resources and customer applications running on Amazon infrastructure.

The following image shows the different AWS resources monitored by Amazon CloudWatch.

Amazon CloudWatch allows administrators to to easily monitor multiple instances and resources from one console by performing the below tasks :

Enables robust monitoring of resources like :

1. Virtual instances hosted in Amazon EC2
2. Databases located in Amazon RDS
3. Data stored in Amazon S3
4. Elastic Load Balancer
5. Auto-Scaling Groups
6. Other resources

Monitors, stores and provides access to system and application log files
Provides a catalog of standard reports that you can use to analyze trends and monitor system performance
Provides various alert capabilities, including rules and triggers high resolutions alarms and sends notifications
Collects and provides real-time presentation of operational data in form of key metrics like CPU utilization, disk storage etc.

Now we know why users choose CloudWatch, which is, for its automatic integration with AWS services, its flexibility, and its ability to scale quickly. But how does Amazon CloudWatch achieve this?

Amazon CloudWatch In Action

Before learning how Amazon CloudWatch operates there are certain primary concepts that you need to know. Lets have a look at those concepts.

Metrics

Metrics represents a time-ordered set of data points that are published to CloudWatch
You can relate metric to a variable that is being monitored and data points to the value of that variable over time
Metrics are uniquely defined by a name, a namespace, and zero or more dimensions
Each data point has a time-stamp.

Dimensions

A dimension is a name/value pair that uniquely identifies a metric
Dimensions can be considered as categories of characteristics that describe a metric
Because dimensions are unique identifiers for a metric, whenever you add a unique name/value pair to one of your metrics, you are creating a new variation of that metric.

Statistics

Statistics are metric data aggregations over specified periods of time
Aggregations are made using the namespace, metric name, dimensions within the time period you specify
Few available statistics are maximum, minimum, sum, average and sample count.

Alarm

An alarm can be used to automatically initiate actions on your behalf
It watches a single metric over a specified time period, and performs one or more specified actions
The action is a simply a notification that is sent to Amazon SNS topic.

Now lets have a look at how Amazon CloudWatch works. The following diagram shows the conceptual view of how CloudWatch provides robust monitoring.

AWS-CloudWatch-Edureka

Amazon CloudWatch has system wide visibility into your AWS resources and applications. It will monitor your resource files and generate key metrics based on your application’s log files. Key metrics include CPU usage, CPU latency, Network traffic, Disk storage etc. Based on these metrics it provides a real-time summary of system activity and individual resources.

CloudWatch also provides a comprehensive at-a-glance view of AWS infrastructure to keep track of application performance, spot trends and troubleshoot operational issues. In addition Amazon CloudWatch configures high resolution alarms and sends real time notifications in case of sudden operational changes in AWS environment.

Now that you are familiar with Amazon CloudWatch concepts and its operation let’s have a look at how you can use Amazon CloudWatch to monitor your Amazon EC2 instance.

Use Case: Configure Amazon CloudWatch to send notification when CPU Utilization of an instance is lower than 15%.

Lets go through various steps involved.

Step 1 : Creating a CPU utilization metric

Go to Amazon CloudWatch Management Console and select metrics from navigation pane.

CloudWatch-Metrics-Edureka

On the metrics page type CPU Utilization in the search bar.
From the displayed list of instances choose the instance for which you want to create a metric.

CloudWatch-Metrics-Edureka

Step 2 : Creating an alarm to notify when CPU Utilization metric of the instance is lower than 15%

Now select the Graphed Metrics option on same page. Then set the time period according to your need. And choose alarm icon located beside the selected instance.

CloudWatch-Metrics-Edureka

Configure the alarm in the displayed dialog box. Give your alarm a name and description. Set the Threshold condition.

You want AWS to send you email notification whenever the alarm condition is satisfied. The notification is sent through Amazon SNS Topic.

Select New List option if want to add new email recipient, or If you want to choose the existing one choose Enter List and enter the name of SNS topic.

CloudWatch-Alarm-Edureka

Click Create Alarm.

Congratulations, you have successfully configured Amazon CloudWatch Alarm to monitor your instance.You will receive the notification through an e-mail on the mail-id you have specified when the alarm condition is met.

Want To Take Your 'Cloud' Knowledge To Next Level?

Now we will talk about the two most important segments of Amazon CloudWatch, which are :

Amazon CloudWatch Events
Amazon CloudWatch Logs

Amazon CloudWatch Events

Amazon CloudWatch Events deliver a real-time stream of system events from AWS resources to AWS Lambda functions, Amazon SNS Topics, Amazon SQS queues and other target types.

CloudWatch Events enable you to create a set of rules that you can match certain events with. Then you can route these events to one or more targets like Lambda Function, SNS Topic etc. Whenever there are operational changes in your AWS environment, CloudWatch Events capture these changes and perform remedial actions by sending notifications, activating Lambda functions etc.

Let’s talk about certain topics that you need to understand before using CloudWatch Events.

Events

An event indicates a change in the AWS environment. AWS resources generate events when their state changes. Amazon allows you to generate your own custom application-level events and publish them to CloudWatch Events.

Rules

Rules are nothing but constraints. They evaluate every incoming event to determine if out-of-bounds scenario exist. If yes the event is then routed to target for processing. A single rule can route to multiple targets, all of which are processed in parallel.

Targets

A target processes events. Targets can include Amazon EC2 instances, AWS Lambda functions, Kinesis streams, Amazon ECS tasks, Amazon SNS topics, Amazon SQS queues, and built-in targets. A target receives events in JSON format.

Now lets have a look at situations where we can use Amazon CloudWatch Events.

Use Case 1: You can log the changes in the state of a Amazon EC2 instance by using CloudWatch Events with assistance of AWS Lambda function.

UseCase1-Edureka

Use Case 2: You can log the object-level API operations on your S3 buckets using CloudWatch Events. But prior to that you should use AWS CloudTrail to set up a trail configured to receive these operations.

CloudWatch-Events-Edureka

Well these are just two use-cases which I have specified here so that you will have an idea about capability of Amazon CloudWatch Events. To describe Amazon CloudWatch Events in one sentence, it is a service that allows you to track changes to your AWS resources with less overhead and more efficiency.

Amazon CloudWatch Logs

Amazon CloudWatch Logs is used to monitor, store and access log files from AWS resources like Amazon EC2 instances, Amazon CloudTrail, Route53 and others.

Lets take a look at few basic concepts of Amazon CloudWatch Logs. The below table gives overview of those concepts.

Log Events	Log Event is a record of some activity recorded by the application or resource being monitored
Log Streams	A log stream is a sequence of log events that share the same source. It represent the sequence of events coming from the application instance
Log Groups	Log groups represent groups of log streams that share the same retention, monitoring, and access control settings. Each log stream has to belong to one log group.

With Amazon CloudWatch Logs you can troubleshoot your system errors and maintain and store the respective log files automatically. You can configure an alarm so that a notification will be sent when some error occurs in your system log. You can then troubleshoot the errors within minutes by accessing the original log data stored by CloudWatch Logs. Moreover you can use Amazon CloudWatch Logs to:

Store your log data in highly durable storage
Monitor your application log files in real-time for specific phrases, values or patterns
Log information about the DNS queries that Route 53 receives
Adjust the retention policy for each log group, by choosing a retention periods between 10 years and one day.

Now that we have foundation of Amazon CloudWatch lets go ahead and look at few reasons as to why it is the most famous cloud monitoring tool.

Benefits of Amazon CloudWatch

Amazon CloudWatch allows you to access all your data from single platform. It is natively integrated with more than 70 AWS services. Vodafone company uses Amazon CloudWatch with Auto Scaling groups to monitor CPU usage and to scale from three Amazon EC2 instances to nine during peak periods automatically.

Provides real time insights so that you can optimize operational costs and AWS resources. Kellogg company uses Amazon CloudWatch for monitoring, which helps the company make better decisions around the capacity they need, so that they can avoid wastage.

Provides complete visibility across your applications, infrastructure stack and AWS services. Atlassian uses Amazon CloudWatch to monitor RAM usage and bandwidth, so they can more easily optimize their application.

These are just few benefits of using CloudWatch. If you want to know more about it then take a look at below video on Amazon CloudWatch by Edureka.

So this is it! I hope this blog was informative and added value to your knowledge. Now you what Amazon CloudWatch is and how you can employ it to monitor your applications and resources that are currently active on cloud. If you are interested to take your knowledge on Amazon Web Services to the next level then enrol for the AWS Architect Certification Training course by Edureka

Got a question for us? Please mention it in the comments section and we will get back to you.

The post Amazon CloudWatch – A monitoring tool by Amazon appeared first on Edureka Blog.

“Cloud Computing is not only the future of computing but the present and the entire past of computing.” says Ellison,Co-founder and former CEO of Oracle. ‘Cloud Computing’ has become quite a buzzword these days. It has evolved from personal cloud storage to organizations moving their entire data to the cloud. We can see acceleration in adoption of Cloud Computing Services every year, a trend that wont cease anytime soon.

Through this blog, I will help you understand what Cloud Computing is and different types of Cloud Computing Services available to us.

Before that, let me give you quick insight as to what you will be learning in this blog. I will be covering the below mentioned topics in detail.

Let’s get started!

What Is Cloud Computing?

Cloud Computing in layman terms refers to computing over internet. In other words, it provides a means for you to store/access your data and applications over internet.

Cloud Computing can be defined as a model that delivers on-demand, self-sufficient Cloud Computing Services like:

Virtualization
Storage
Network
Operating System
Middleware
Databases
Security
Applications

through Wide Area Network(WAN) or dedicated network. Users can utilize these services with a little or no interaction with service providers.

We have some major companies delivering the Cloud Computing Services. Some notable examples include the following:

Amazon: Amazon Web Service(AWS) is one of the best Cloud Computing Service provider which offers a wide set of infrastructure services like database storage, computing power, networking etc.
Google: Google Cloud Platform allows clients to build, test, and deploy applications on Google’s highly-scalable and reliable infrastructure.
Microsoft: Microsoft Azure is used for deploying, designing and managing the applications through a worldwide network.

Want To Be A Certified AWS Architect?

Now the way these different Cloud Computing Services are delivered to users differ based on user’s requirements. Cloud Computing provides users with three distinctive types of Cloud Computing Services via the internet.

Types Of Cloud Computing Services

First let us go through the definition of each Cloud Computing Service type:

SaaS(Software-as-a-Service):

Saas provides clients with ability to use software applications over the internet via subscription basis. Clients can access applications from anywhere via web.

Examples: Google Applications and Salesforce.

PaaS(Platform-as-a-Service):

PaaS provides a platform where the clients can deploy their own applications and host them. The client is free from hassles of setting up infrastructure, managing storage, servers, network etc.

Examples: Amazon Web Services and Rackspace.

IaaS(Infrastructure-as-a-Service):

The IaaS provides just the hardware and network, the clients should install and develop software and applications.

Examples: IBM, Google and Amazon Web Services.

Now that we have gone through the definition, let us go ahead and understand each of these Cloud Computing Services in detail with the help of a use case. Consider a scenario where you have made travel plans. And you have decided car as your mode of transport. Now based on your requirements you have 4 options to choose from. Those are:

Take a taxi (SaaS)
Hire a car (PaaS)
Lease a car (IaaS)

SaaS(Software-as-a-Service)

Use case: Suppose you choose to take a taxi, the car agency is responsible for car finance, servicing of the car. Besides that they take care of insurance and road tax. The driver, fuel requirements is taken care as well. You just need to pay for your ride.

Similarly Software-as-a-Service provider delivers software applications over the Internet, on demand and basically on a subscription basis. You just need to pay for the service you are utilizing. Entire software and hardware stack is hosted by the provider and made available to users over the Wide Area Network(WAN) like Internet or other dedicated networks.

SaaS eliminates the need for hardware acquisition, provisioning and maintenance, as well as software licensing, installation and support. Provides scalability, flexible payments and auto updates.

Examples: Google Applications like Gmail, Google Docs.

Cloud-Computing-Services-SAAS-Edureka

PaaS(Platform-as-a-Service)

Use case: You plan to travel to a nearby place so decided to rent a car, then you might have to take care of fuel needs, road tolls and hire a driver as well. Rest of the work like finance of the car, car service, insurance, road tax, garage etc is responsibility of the car renting agency.

Likewise Platform-as-a-Service provider offers core computing services like storage, virtualization and network. In addition, hosts OS, middleware frameworks or other development services such as web services, database management system and SD’kits compatible with various programming languages. The service provider builds and renders a secure and optimized environment on which users can install applications and data sets.

The prime benefits of this type of service include its simplicity and convenience for users–the Platform-as-a-Service users can focus on creating and running applications rather than constructing and maintaining underlying infrastructural stack and services.

Examples: Google app Engine, Microsoft Azure, Salesforce.

Cloud-Computing-Services-PAAS-Edureka

IaaS(Infrastructure-as-a-Service)

Use case: You made long travel plans to a far away place so chose to lease a car. Here you have to worry about servicing a car, road tax, insurance and garage requirements, pay for the fuel, road tolls and hire a driver. Most of the work is done by you. The car agency takes care of just the finance related to leasing a car.

Similarly Infrastructure-as-a-Service provider offers end users with bare computing resources like storage capacity, virtualization, networking, security and maintenance on a pay-as-you-use basis. The users are no longer concerned with location and purchase costs. Furthermore IaaS provider supplies additional services that complement the above features like load balancing, billing details, data backup, recovery and storage.

IaaS model users handle most of the workload like installing, maintaining and managing software layers.

Example: Amazon AWS, Rackspace, Flexiscale and Google Cloud Platform are some well known IaaS providers.

Cloud-Computing-Services-IaaS-Edureka

The below picture summarises what we have learnt about cloud computing services.

Cloud-Computing-Services-Edureka

There are certain features that all these three Cloud Computing Service models have in common. Some of them are listed below.

Cloud Computing Service Features And Benefits

Provider’s Responsibility

The cloud service provider purchases, hosts and maintains either a part or complete infrastructure stack, necessary software and hardware in their own facility. As a result service users are spared from the complexity of dealing with the hardware and software on-premise.

Pay-for-Use

Service users can just pay for the resources and services they use. By doing so they can maximise the cost savings unlike in traditional approach where the user has to pay complete cost irrespective of usage.

Limitless Scalability

Cloud computing service providers usually provide the infrastructure in such a way so as to meet the increasing demands. The resources can be scaled up and scaled down according to enterprise requirements.

Migration Facility and Workload Resilience

Cloud computing makes it possible to move data easily. Even more, the cloud computing service users need not worry about losing the data since cloud provides with multiple data backups.

Self Service Provisioning

The end users can scale up and scale down resources depending on their business needs, update the services they are currently using, manage the billing details etc with little or no interaction with the cloud provider.

Want To Take Your 'Cloud' Knowledge To Next Level?

We have learnt about different Cloud Computing Services and their features. Usually these Cloud Computing Services are made available to users via various deployment models. Each deployment model is identified with specific features that support the user’s requirement of services. Let us learn about types of Cloud Deployment Models in detail.

Cloud Deployment Models

There are 3 fundamental Deployment Models of cloud computing: Public Cloud, Private Cloud and Hybrid Cloud.

Public cloud

In Public Cloud model, services and infrastructure are hosted on premise of cloud provider and are provisioned for open use by general public. The end users can access the services via public network like internet. Public Cloud services are delivered mostly on demand. Popular for hosting everyday apps like email, CRM and other business support apps.

Public Cloud model offers high scalability, automated maintenance but more vulnerable to attacks due to high levels of accessibility.

Common Public Cloud providers include Amazon Web Services and Microsoft Azure.

Private Cloud

Private Cloud model provides cloud services and infrastructure exclusively to a single tenant. The tenant can control and customize it to his need. The cloud infrastructure can be monitored either by cloud provider or the tenant. Many companies are migrating their data centers to Private Cloud to run core business fields like research, manufacturing human resource etc.

The Private Cloud model offers great levels of security and control, though cost benefits ought to be sacrificed to some extent.

Common Private Cloud providers include VMware and Openstack.

Hybrid Cloud

As the name suggests Hybrid Cloud is composition of both Public Cloud and Private Cloud infrastructure. The company can use Private Cloud to run mission critical operations and Private Cloud to run non sensitive high demand operations.

The companies using Hybrid Cloud model benefit with the security and control aspect of Private Cloud and off-hand management and cost benefits of Public Cloud.

Want To Learn Azure From Industry Experts?

e And I hope you have enjoyed reading this blog. Now you know what Cloud Computing is and what are its different Services. To get in-depth knowledge on Cloud Computing and take your skills to the next level, you can enrol for the Cloud Masters Program by Edureka. If you have any questions please mention it in the comments section of this Cloud Computing Services blog and we will get back to you as soon as possible.

The post Cloud Computing Services: A Deeper Dive Into Cloud Computing appeared first on Edureka Blog.

What is Linear Regression?

A linear regression is one of the easiest statistical models in machine learning. It is used to show the linear relationship between a dependent variable and one or more independent variables.

Linear Regression - AI vs Machine Learning vs Deep Learning - Edureka

Before we drill down to linear regression in depth, let me just give you a quick overview of what is a regression as Linear Regression is one of a type of Regression algorithm

What is Regression?

Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent and independent variable

Types of Regression

Linear Regression
Logistic Regression
Polynomial Regression
Stepwise Regression

Linear Regression vs Logistic Regression

Basis	Linear Regression	Logistic Regression
Core Concept	The data is modelled using a straight line	The data is modelled using a sigmoid
Used with	Continuous Variable	Categorical Variable
Output/Prediction	Value of the variable	Probability of occurrence of an event
Accuracy and Goodness of Fit	Measured by loss, R squared, Adjusted R squared etc.	Measured by Accuracy, Precision, Recall, F1 score, ROC curve, Confusion Matrix, etc

Machine Learning Certification Training Using Python

Where is Linear Regression Used?

1. Evaluating Trends and Sales Estimates

Impact of Price Change - Linear Regression from scratch using Python - edureka

Linear regressions can be used in business to evaluate trends and make estimates or forecasts.
For example, if a company’s sales have increased steadily every month for the past few years, conducting a linear analysis on the sales data with monthly sales on the y-axis and time on the x-axis would produce a line that that depicts the upward trend in sales. After creating the trend line, the company could use the slope of the line to forecast sales in future months.

2. Analyzing the Impact of Price Changes

Linear regression can also be used to analyze the effect of pricing on consumer behaviour.

For example, if a company changes the price on a certain product several times, it can record the quantity it sells for each price level and then performs a linear regression with quantity sold as the dependent variable and price as the explanatory variable. The result would be a line that depicts the extent to which consumers reduce their consumption of the product as prices increase, which could help guide future pricing decisions.

3. Assessing Risk

Risk Analysis - Linear Regression from scratch using Python - edureka

Linear regression can be used to analyze risk.

For example

A health insurance company might conduct a linear regression plotting number of claims per customer against age and discover that older customers tend to make more health insurance claims. The results of such an analysis might guide important business decisions made to account for risk.

How do Linear Regression Algorithm works?

###video###

Least Square Method – Finding the best fit line

Least squares is a statistical method used to determine the best fit line or the regression line by minimizing the sum of squares created by a mathematical function. The “square” here refers to squaring the distance between a data point and the regression line. The line with the minimum value of the sum of square is the best-fit regression line.

Regression Line, y = mx+c where,

y = Dependent Variable

x= Independent Variable ; c = y-Intercept

Least Square Method – Implementation using Python

For the implementation part, I will be using a dataset consisting of head size and brain weight of different people.

# Importing Necessary Libraries

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (20.0, 10.0)

# Reading Data
data = pd.read_csv('headbrain.csv')
print(data.shape)
data.head()


# Collecting X and Y
X = data['Head Size(cm^3)'].values
Y = data['Brain Weight(grams)'].values

In order to find the value of m and c, you first need to calculate the mean of X and Y

# Mean X and Y
mean_x = np.mean(X)
mean_y = np.mean(Y)

# Total number of values
n = len(X)

# Using the formula to calculate m and c
numer = 0
denom = 0
for i in range(n):
numer += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
m = numer / denom
c = mean_y - (m * mean_x)

# Print coefficients
print(m, c)

The value of m and c from above will be added to this equation

BrainWeight = c + m ∗ HeadSize

Plotting Linear Regression Line

Now that we have the equation of the line. So for each actual value of x, we will find the predicted values of y. Once we get the points we can plot them over and create the Linear Regression Line,

# Plotting Values and Regression Line
max_x = np.max(X) + 100
min_x = np.min(X) - 100
# Calculating line values x and y
x = np.linspace(min_x, max_x, 1000)
y = c + m * x 

# Ploting Line
plt.plot(x, y, color='#52b920', label='Regression Line')
# Ploting Scatter Points
plt.scatter(X, Y, c='#ef4423', label='Scatter Plot')

plt.xlabel('Head Size in cm3')
plt.ylabel('Brain Weight in grams')
plt.legend()
plt.show()

R Square Method – Goodness of Fit

R–squared value is the statistical measure to show how close the data are to the fitted regression line

Calculation of R-square - Linear Regression Algorithm - Edureka

y = actual value

y ̅ = mean value of y

yp = predicted value of y

R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

R square – Implementation using Python

#ss_t is the total sum of squares and ss_r is the total sum of squares of residuals(relate them to the formula).
ss_t = 0
ss_r = 0
for i in range(m):
y_pred = c + m * X[i]
ss_t += (Y[i] - mean_y) ** 2
ss_r += (Y[i] - y_pred) ** 2
r2 = 1 - (ss_r/ss_t)
print(r2)

Linear Regression – Implementation using scikit learn

If you have reached up here, I assume now you have a good understanding of Linear Regression Algorithm using Least Square Method. Now its time that I tell you about how you can simplify things and implement the same model using a Machine Learning Library called scikit-learn

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Cannot use Rank 1 matrix in scikit learn
X = X.reshape((m, 1))
# Creating Model
reg = LinearRegression()
# Fitting training data
reg = reg.fit(X, Y)
# Y Prediction
Y_pred = reg.predict(X)

# Calculating R2 Score
r2_score = reg.score(X, Y)

print(r2_score)

Machine Learning Certification Training Using Python

This was all about the kNN Algorithm using python. In case you are still left with a query, don’t hesitate in adding your doubt to the blog’s comment section.

The post Linear Regression Algorithm from Scratch appeared first on Edureka Blog.

Top 10 Reasons To Learn Cybersecurity

Cybersecurity has become a key area of job growth in the last few years, which has resulted from an influx of people opting for a Cybersecurity career. Even so, there are a number of people who are still having second thoughts as to whether they should jump into the unknown waters of Cybersecurity for their professional life. This blog, addressing the “Top 10 Reasons To Learn Cybersecurity” should definitely help you confused folks make up your mind.

Below are the key factors, that have made Cybersecurity such a brilliant career choice for many in my opinion:

Now let’s discuss these factors in detail

10. Cybersecurity – An Evergreen Industry

Evergreen Industry - Top 10 Reasons to Learn Cybersecurity - Edureka

Cybersecurity has slowly transformed into an evergreen industry. Like air pollution was a by-product of the industrial revolution, cyber attacks are a similar by-product of the digital revolution. Keeping this situation in mind, and looking at the advances we have made as a community since the invention of the internet, I think it’s an obvious conclusion that Cybersecurity as a viable career option is here to stay. With the advent of topics like Big Data, Internet of Things and Cloud Computing the permanent stature of Cybersecurity and the magnitude of its importance has been very well set in stone. So if you wish to learn cybersecurity in today’s age, it’s definitely a good idea.

9. Travel the World with Cybersecurity

Travel the World - Top 10 Reasons to Learn Cybersecurity - Edureka

For those of you, who aspire to travel the globe, cybersecurity might just be the perfect career path. Thousands of home-grown cybersecurity experts or working to protect businesses, government agencies, and general consumers. On a global scale, the rise in cyber-attacks is outpacing the supply of cyber-defenders. This results in plenty of opportunities for cybersecurity professionals and experts to travel overseas to serve their skills which are in high demand. Hence, if you have ever wanted to work in a different country, then a career in cybersecurity might just be your perfect passport to success!

8. A Career that Serves the Greater Good

Cybersecurity Companies have defended us time and time again against a variety of cyber attacks that to compromise our confidentiality, availability, and integrity. Even so, the number of cyber crimes are only increasing day by day. Millions are falling prey to phishing scams, ransomware & spyware, DDoS attacks. The online threat to companies, big or small and individuals too is large and growing. Around the world, National Crime Agencies, Police Forces, Company SecurityTteamsare all fighting this menace – but they need more help. They need people like you. If you want the satisfaction of doing a rewarding job and if you want to make a real difference, learn cybersecurity and join industry!

Greater Good - Top 10 Reasons to Learn Cybersecurity - Edureka

7. A Chance to Work with Secret Agencies

It’s certain that Cybersecurity Professionals have a clear shot at working with prestigious fortune 500 companies like Dell, Accenture, InfoTech etc, but the potential doesn’t end there. Experts who prove to be worthy of their skills might earn the chance to work with top-secret government agencies and intelligence agencies eg MI6, Mossad, NSA. So if you learn cybersecurity, you might just become a top-secret agent!

6. No Math!

No Math - Top 10 Reasons to Learn Cybersecurity - Edureka

It’s a known fact that not everyone shares the same love and affection for maths that some people seem to have. If you recognize yourself as a person who always had an aversion to mathematics then a career in cybersecurity should be right up your alley. Cybersecurity courses are completely free from mathematics. Instead, you spend time honing skills like programming and networking which helps you build a career specific skill set!

A Cybersecurity career seems interesting, right? Check out our exhaustive live-online course!

5. Unlimited Potential for Personal Growth

Growth Potential - Top 10 Reasons to Learn Cybersecurity - Edureka Cyber attacks are getting smarter by the day. Cybersecurity professionals are always busy outsmarting black hat hackers, patching vulnerabilities and analyzing the risk of an organization. Tackling such attacks in an ever-advancing industry only comes with continuous study and thorough research. This means after you learn cybersecurity and start working, your knowledge is continuously enriched and with experience, your wisdom continuously gets honed and thus the sky is the limit when we are talking about personal growth in the cybersecurity industry.

4. Plenty of Opportunities

We Want You - Top 10 Reasons to Learn Cybersecurity - Edureka

There are over a million companies in this world spread across a variety of sectors and industries and a large proportion of them share one thing in common today i.e. need for an internet connection. More than 400,000 people already work in the information security industry and demand for cyber skills is growing fast in every type of company and government department. So, whether you dream of working in sports or fashion, media or the emergency services, finance or retail, cyber skills could your gateway as everyone needs someone, to defend their sensitive data.

3. A Variety of Industries to Choose

As a cybersecurity professional you are not confined to a singular industry unlike a majority of the professional world. Digitalisation is taking place across a lot of industries. With advancements in the field of IOT, Big Data, Automation, and Cloud Computing we could say we are going through a Digital revolution. So being a cybersecurity doesn’t stop you from working in a hospital, school, government agencies, top-secret military agencies. The gates are wide open as almost everybody wants to be secure on the digital front.

Digital Revolution - Top 10 Reasons to Learn Cybersecurity - Edureka

2. A Job that Never Gets Boring

Challenge - Top 10 Reasons to Learn Cybersecurity - Edureka!

Due to the unpredictable nature of the future, a career in cybersecurity is not and cannot be static and stale. You will be challenged on a regular basis. There will be new and unexpected failures as well as amazing and surprising discoveries. One certainty is that attackers will continue to develop new exploits on a constant basis and it is your job to find creative, and optimized solutions to the arising problems. As a cybersecurity professional, you will be solving new puzzles, fighting off new demons, and supporting new activities on a regular basis. So if you tend to easily get bored due to things being monotonous, fret not, Cybersecurity never gets boring!

1. Fat Pay Cheques

Money Makes the World Go Round - Top 10 Reasons to Learn Cybersecurity - Edureka

I think we all can agree that ‘money makes the world go round‘. The world has realized the sheer importance of cybersecurity, with stories in the news almost every week of new cyber attacks. Faced with online attacks, business and government agencies are looking for experts who can protect their systems from cybercriminals – and they are willing to pay high salaries and provide training and development. There are great opportunities for anyone starting a career in cybersecurity:

Salaries in cyber security have a greater growth potential than 90% other industries
For senior security professionals, earnings can surpass the average median by a vast amount
Earnings are based on merit, not your sex, age or ethnicity.

Dive Deep into the World of Cybersecurity Today

I hope my blog on “Top 10 reasons to Learn Cybersecurity” was relevant for you, and helped make up your mind. To get an in-depth knowledge of Cybersecurity down to its intricacies, check out our interactive, live-online Cybersecurity Certification Course, that comes with 24*7 support to guide you throughout your learning period.

The post Top 10 Reasons To Learn Cybersecurity appeared first on Edureka Blog.

Just like the entire universe and our galaxy is said to have formed due to the Big Bang explosion, similarly, due to so many technological advancements, data has also been growing exponentially leading to the Big Data explosion. This data comes in from various sources, has different formats, is generated at a variable rate and may also contain inconsistencies. Thus, we can simply term the explosion of such data as Big Data. I will be explaining the following topics in this blog to give you insights on Big Data Analytics :

Why Big Data Analytics?

Before I jump on to tell you about what is Big Data Analytics, let me tell you guys about why it is needed. Let me also reveal to you guys that we create about 2.5 quintillion bytes of data every day! So now that we have accumulated Big Data, neither can we ignore it nor can we let it stay idle and make it go to waste.

Various organizations and sectors all across the globe started adopting Big Data Analytics in order to gain numerous benefits. Big Data Analytics gives insights which many companies are turning into actions and making huge profits as well as discoveries. I am going to list down four such reasons along with interesting examples.

The first reason is,

Making Smarter and More Efficient Organisation
Let me tell you about one such organisation, the New York Police Department (NYPD). The NYPD brilliantly uses Big Data and analytics to detect and identify crimes before they occur. They analyse historical arrest patterns and then maps them with events such as federal holidays, paydays, traffic flows, rainfall etc. This aids them in analyzing the information immediately by utilizing these data patterns. Big Data and analytics strategy helps them identify crime locations, through which they deploy their officers to these locations. Thus by reaching these locations before the crimes were committed, they prevent the occurrence of crime.
Optimize Business Operations by Analysing Customer Behaviour Most organisations use behavioural analytics of customers in order to provide customer satisfaction and hence, increase their customer base. The best example of this is Amazon. Amazon is one of the best and most widely used e-commerce websites with a customer base of about 300 million. They use customer click-stream data and historical purchase data to provide them with customized results on customized web pages. Analysing the clicks of every visitor on their website aids them in understanding their site-navigation behaviour, paths the user took to buy the product, paths that led them to leave the site and more. All this information helps Amazon to improve their user experience, thereby improving their sales and marketing.
Cost Reduction Big data technologies and technological advancements like cloud computing bring significant cost advantages when it comes to store and process Big Data. Let me tell you how healthcare utilizes Big Data Analytics to reduce their costs. Patients nowadays are using new sensor devices when at home or outside, which send constant streams of data that can be monitored and analysed in real-time to help patients avoid hospitalization by self-managing their conditions. For hospitalized patients, physicians can use predictive analytics to optimize outcomes and reduce readmissions. Parkland Hospital uses analytics and predictive modelling to identify high-risk patients and predict likely outcomes once patients are sent home. As a result, Parkland reduced 30-day readmissions for patients with heart failure, by 31%, saving $500,000 annually.

New Generation Products

With the ability to gauge customer needs and satisfaction through analytics, comes the power to give customers what they want. I have found three such interesting products to cite here. New Generation Products - Big Data Analytics - Edureka First, Google’s self-driving car which makes millions of calculations on every trip that help the car decide when and where to turn, whether to slow down or speed up and when to change lanes — the same decisions a human driver makes behind the wheel.

The second one is Netflix which committed for two seasons of its extremely popular show House of Cards, by completely trusting Big Data Analytics! Last year, Netflix grew its US subscriber base by 10% and added nearly 20 million subscribers from around the globe.

The third example is one of the really cool new things I have come across, is a smart yoga mat. The first time you use your Smart Mat, it will take you through a series of movements to calibrate your body shape, size and personal limitations. This personal profile information is stored in your Smart Mat App and will help Smart Mat detect when you’re out of alignment or balance. Over time, it will automatically evolve with updated data as you improve your Yoga practice.

Master Big Data with Edureka

What is Big Data Analytics?

Now let us formally define “What is Big Data Analytics?” Big data analytics examines large and different types of data to uncover hidden patterns, correlations and other insights. Basically, Big Data Analytics is largely used by companies to facilitate their growth and development. This majorly involves applying various data mining algorithms on the given set of data, which will then aid them in better decision making.

Stages in Big Data Analytics

These are the following stages involved in the Big Data Analytics process:

Stages in Big Data Analytics - Big Data Analytics - Edureka

Types of Big Data Analytics

There are four types:

Descriptive Analytics: It uses data aggregation and data mining to provide insight into the past and answer: “What has happened?” The descriptive analytics does exactly what the name implies they “describe” or summarize raw data and make it interpretable by humans.
Predictive Analytics: It uses statistical models and forecasts techniques to understand the future and answer: “What could happen?” Predictive analytics provides companies with actionable insights based on data. It provides estimates about the likelihood of a future outcome.
Prescriptive Analytics: It uses optimization and simulation algorithms to advice on possible outcomes and answers: “What should we do?” It allows users to “prescribe” a number of different possible actions and guide them towards a solution. In a nutshell, this analytics is all about providing advice.
Diagnostic Analytics: It is used to determine why something happened in the past. It is characterized by techniques such as drill-down, data discovery, data mining and correlations. Diagnostic analytics takes a deeper look at data to understand the root causes of the events.

Diagnostic Analytics - Big Data Analytics - Edureka

Learn Big Data From Experts

Big Data Tools

These are some of the following tools used for Big Data Analytics: Hadoop, Pig, Apache HBase, Apache Spark, Talend, Splunk, Apache Hive, Kafka.

Tools of Big Data Analytics - Big Data Analytics - Edureka

Big Data Domains

Domains of Big Data Analytics - Big Data Analytics - Edureka

Healthcare: Healthcare is using big data analytics to reduce costs, predict epidemics, avoid preventable diseases and improve the quality of life in general. One of the most widespread applications of Big Data in healthcare is Electronic Health Record(EHRs).
Telecom: They are one of the most significant contributors to Big Data. Telecom industry improves the quality of service and routes traffic more effectively. By analysing call data records in real-time, these companies can identify fraudulent behaviour and act on them immediately. The marketing division can modify its campaigns to better target its customers and use insights gained to develop new products and services.
Insurance: These companies use big data analytics for risk assessment, fraud detection, marketing, customer insights, customer experience and more.
Government: The Indian government used big data analytics to get an estimate of trade in the country. They used Central sales tax invoices to analyse the extent to which states trade with each other.
Finance: Banks and financial services firms use analytics to differentiate fraudulent interactions from legitimate business transactions. The analytics systems suggest immediate actions, such as blocking irregular transactions, which stops fraud before it occurs and improves profitability.
Automobile: Rolls Royce which has embraced Big Data by fitting hundreds of sensors into its engines and propulsion systems, which record every tiny detail about their operation. The changes in data in real-time are reported to engineers who will decide the best course of action such as scheduling maintenance or dispatching engineering teams.
Education: This is one field where Big Data Analytics is being absorbed slowly and gradually. Opting for big data powered technology as a learning tool instead of traditional lecture methods, enhanced the learning of students as well as aided the teachers to track their performance better.
Retail: Retail including e-commerce and in-stores are widely using Big Data Analytics to optimize their business. For example, Amazon, Walmart etc.

Check Out Our Big Data Course

Big Data Use Cases

The first use case that I have taken here is of Starbucks.

Starbucks Use Case - Big Data Analytics - Edureka

The second use case I want to share with you guys is of Procter&Gamble.

P&G Use Case - Big Data Analytics - Edureka

Trends in Big Data Analytics

The image below depicts the market revenue of Big Data in billion U.S. dollars from the year 2011 to 2027.

Big Data Market Revenue - What is Big Data - Edureka

Here are some Facts and Statistics by Forbes:

Facts & Statistics of Big Data Analytics - Big Data Analytics - Edureka

Career prospects in Big Data Analytics:

Career Prospects in Big Data Analytics - Big Data Analytics - Edureka

Salary Aspects: The average salary of the analytics jobs is around $94,167. Data Scientist has been named the best job in America for three years running, with a median base salary of $110,000 and 4,524 job openings. In India the percentage of analytics professionals commanding salaries less than INR 10 Lakhs has gone lower; percentage of analytics professionals earning more than INR 15 Lakhs has increased from 17% in 2016 to 21% in 2017 to 22.3% in 2018.
Huge Job Opportunities: Companies like Google, Apple, IBM, Adobe, Qualcomm and many more hire Big Data Analytics Professionals.

Skillset

These are some of the skills which are required depending upon the role in the field of Big Data Analytics :

Skill Set Required for Big Data Analytics - Big Data Analytics - Edureka

Basic programming: One should have knowledge about at least some general purpose programming language such as Java and Python.
Statistical and Quantitative Analysis: Having an idea about statistics and quantitative analysis is ideal.
Data Warehousing: Knowledge of SQL and NoSQL databases is required.
Data Visualization: It is very important to know how to visualize the data in order to be able to understand the insights and apply it in action.
Specific Business Knowledge: One must necessarily be aware of the business where they are applying analytics in order to optimize their operations.
Computational Frameworks: Preferably one should know about at least one or two tools which are required for Big Data Analytics.

Know More About Big Data

Now that you know Big Data Analytics, check out the Big Data Course by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka Big Data Hadoop Certification Training course helps learners become expert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domain.

Got a question for us? Please mention it in the comments section and we will get back to you.

The post Big Data Analytics – Turning Insights Into Action appeared first on Edureka Blog.

Containers have become the definitive way to develop applications because they provide packages that contain everything you need to run your applications. In this blog, we will discuss Kubernetes architecture and the moving parts of Kubernetes and also what are the key elements, what are the roles and responsibilities of them in Kubernetes architecture.

Kubernetes: An Overview

Kubernetes is an open-source Container Management tool which automates container deployment, container (de)scaling & container load balancing.

Written on Golang, it has a huge community because it was first developed by Google & later donated to CNCF
Can group ‘n’ no of containers into one logical unit for managing & deploying them easily

Note: Kubernetes is not a containerization platform. It is a multi-container management solution.

Going by the definition, you might feel Kubernetes is very ordinary and unimportant. But trust me, this world needs Kubernetes for managing containers, as much as it needs Docker for creating them. Let me tell you why! If you would favor a video explanation on the same, then you can go through the below video.

Want To Explore More About Kubernetes

Features Of Kubernetes

For a detailed explanation, check this blog.

Kubernetes Architecture/Kubernetes Components

Kubernetes Architecture - Kubernetes Architecture - Edureka Kubernetes Architecture has the following main components:

Master nodes
Worker/Slave nodes
Distributed key-value store(etcd.)

Master Node

Master Node - Kubernetes Architecture - Edureka It is the entry point for all administrative tasks which is responsible for managing the Kubernetes cluster. There can be more than one master node in the cluster to check for fault tolerance. More than one master node puts the system in a High Availability mode, in which one of them will be the main node which we perform all the tasks.

For managing the cluster state, it uses etcd in which all the master nodes connect to it.

Let us discuss the components of a master node. As you can see in the diagram it consists of 4 components:

API server:

Performs all the administrative tasks through the API server within the master node.
In this REST commands are sent to the API server which validates and processes the requests.
After requesting, the resulting state of the cluster is stored in the distributed key-value store.

Scheduler:

The scheduler schedules the tasks to slave nodes. It stores the resource usage information for each slave node.
It schedules the work in the form of Pods and Services.
Before scheduling the task, the scheduler also takes into account the quality of the service requirements, data locality, affinity, anti-affinity, etc.

Controller manager:

Also known as controllers.
It is a daemon which regulates the Kubernetes cluster which manages the different non-terminating control loops.
It also performs lifecycle functions such as namespace creation and lifecycle, event garbage collection, terminated-pod garbage collection, cascading-deletion garbage collection, node garbage collection, etc.
Basically, a controller watches the desired state of the objects it manages and watches their current state through the API server. If the current state of the objects it manages does not meet the desired state, then the control loop takes corrective steps to make sure that the current state is the same as the desired state.

What is the ETCD?

etcd is a distributed key-value store which stores the cluster state.
It can be part of the Kubernetes Master, or, it can be configured externally.
etcd is written in the Go programming language. In Kubernetes, besides storing the cluster state (based on the Raft Consensus Algorithm) it is also used to store configuration details such as subnets, ConfigMaps, Secrets, etc.
A raft is a consensus algorithm designed as an alternative to Paxos. The Consensus problem involves multiple servers agreeing on values; a common problem that arises in the context of replicated state machines. Raft defines three different roles (Leader, Follower, and Candidate) and achieves consensus via an elected leader

Now you have understood the functioning of Master node. Let’s see what is the Worker/Minions node and its components.

Worker Node (formerly minions)

Worker Node - Kubernetes Architecture - Edureka It is a physical server or you can say a VM which runs the applications using Pods (a pod scheduling unit) which is controlled by the master node. On a physical server (worker/slave node), pods are scheduled. For accessing the applications from the external world, we connect to nodes.

Let’s see what are the following components:

Container runtime:

To run and manage a container’s lifecycle, we need a container runtime on the worker node. Some examples of container runtimes are:
Sometimes, Docker is also referred to as a container runtime, but to be precise, Docker is a platform which uses containers as a container runtime.

Kubelet:

It is an agent which communicates with the Master node and executes on nodes or the worker nodes. It gets the Pod specifications through the API server and executes the containers associated with the Pod and ensures that the containers described in those Pod are running and healthy.

Kube-proxy:

Kube-proxy runs on each node to deal with individual host sub-netting and ensure that the services are available to external parties.
It serves as a network proxy and a load balancer for a service on a single worker node and manages the network routing for TCP and UDP packets.
It is the network proxy which runs on each worker node and listens to the API server for each Service endpoint creation/deletion.
For each Service endpoint, kube-proxy sets up the routes so that it can reach to it.

Pods

A pod is one or more containers that logically go together. Pods run on nodes. Pods run together as a logical unit. So they have the same shared content. They all share the same IP address but can reach other Pods via localhost, as well as shared storage. Pods don’t need to all run on the same machine as containers can span more than one machine. One node can run multiple pods.

Use Case: How Luminis Technologies used Kubernetes in production

Problem: Luminis is a software technology company which used AWS for deployment for their applications. For deploying the applications, it required custom scripts and tools to automate which was not easy for teams other than operations. And for small teams didn’t have the resources to learn all of the details about the scripts and tools.

Main Issue: There was no unit-of-deployment which created a gap between the development and the operations teams.

Solution:

How did they Deploy in Kubernetes:

Use case - Kubernetes Architecture - Edureka They used a blue-green deployment mechanism to reduce the complexity of handling multiple concurrent versions. (As there’s always only one version of the application running in the background)

In this, a component called “Deployer” that orchestrated the deployment was created by their team by open sourcing their implementation under the Apache License as part of the Amdatu umbrella project. This mechanism performed the health checking on the pods before re-configuring the load balancer because they wanted each component that was deployed to provide a health check.

How did they Automate Deployments?

With the Deployer in place, they were able to engage up deployments to a build pipeline. After a successful build, their build server pushed a new Docker image to a registry on Docker Hub. Then the build server invoked the Deployer to automatically deploy the new version to a test environment. That same image was promoted to production by triggering the Deployer on the production environment.

Subscribe to our youtube channel to get new updates..!

So, that’s the Kubernetes architecture in a simple fashion. So that brings an end to this blog on Kubernetes Architecture. Do look out for other blogs in this series which will explain the various other aspects of Kubernetes.

Got a question for us? Please mention it in the comments section and we will get back to you.

The post Understanding Kubernetes Architecture appeared first on Edureka Blog.

Kubernetes Dashboard is a general purpose, web-based UI for Kubernetes clusters. It allows users to manage applications running in the cluster and troubleshoot them, as well as manage the cluster itself.

So before moving on let us see what are the topics, we will be covering in this blog:

What is Kubernetes Dashboard?

A Kubernetes dashboard is a web-based Kubernetes user interface which is used to deploy containerized applications to a Kubernetes cluster, troubleshoot the applications, and manage the cluster itself along with its attendant resources.

Uses of Kubernetes Dashboard

To get an overview of applications running on your cluster.
To create or modify the individual Kubernetes resources for example Deployments, Jobs, etc.
It provides the information on the state of Kubernetes resources in your cluster, and on any errors that may have occurred.

Implement Kubernetes Dashboard On Your Cluster

Installing the Kubernetes Dashboard

How to Deploy Kubernetes Dashboard?

Run the following command to deploy the dashboard:

kubectl create -f https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yaml

Accessing Dashboard using the kubectl

kubectl proxy

It will proxy server between your machine and Kubernetes API server.

Now, to view the dashboard in the browser, navigate to the following address in the browser of your Master VM:

http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/

You will then be prompted with this page, to enter the credentials:

Create a Cluster Admin service account

In this step, we will create the service account for the dashboard and get its credentials.
Note: Run all these commands in a new terminal, otherwise your kubectl proxy command will stop.

Run the following commands:

This command will create a service account for a dashboard in the default namespace

kubectl create serviceaccount dashboard -n default

Add the cluster binding rules to your dashboard account

kubectl create clusterrolebinding dashboard-admin -n default \ --clusterrole=cluster-admin \ --serviceaccount=default:dashboard

Copy the secret token required for your dashboard login using the below command:

kubectl get secret $(kubectl get serviceaccount dashboard -o jsonpath="{.secrets[0].name}") -o jsonpath="{.data.token}" | base64 --decode

token - Kubernetes dashboard - Edureka 4

Copy the secret token and paste it in Dashboard Login Page, by selecting a token option

After Sign In you will land to Kubernetes Homepage.

Home Page
You’ll see the home/welcome page in which you can view which system applications are running by default in the kube-system namespace of your cluster, for example, the Dashboard itself.

Views of the Kubernetes Dashboard UI

Kubernetes Dashboard consists of following dashboard views:

Admin View
Workloads View
Services View
Storage and Config View

Let’s start with the admin view.

Admin View

It lists Nodes, Namespaces, and Persistent Volumes which has a detailed view of them, where node list view contains CPU and memory usage metrics aggregated across all Nodes and the details view shows the metrics for a Node, its specification, status, allocated resources, events, and pods running on the node.

Node detail view

Workloads View

It is the entry point view that shows all applications running in the selected namespace. It summarizes the actionable information about the workloads, for example, the number of ready pods for a Replica Set or current memory usage for a Pod.

Workloads view

Services View

It shows the shows Kubernetes resources that allow for exposing services to the external world and discovering them within a cluster.

Service list partial view

Storage and Config View

The Storage view shows Persistent Volume Claim resources which are used by applications for storing data whereas config view is used to shows all the Kubernetes resources that are used for live configuration of applications running in clusters.

Secret detail view

Subscribe to our youtube channel to get new updates..!

Got a question for us? Please mention it in the Continuous Integration Tools comments section and we will get back to you.

The post Kubernetes Dashboard Installation and Views appeared first on Edureka Blog.