Big Data: Hadoop ecosystem and HP Haven

Hadoop was born nearly ten years ago. This little cute yellow elephant was a game changing technology with almost everything we need to store and distribute large data that cannot fit into a single computer.It was developed to support the Nutch web crawler project in distributing the data into commodity hardware instead of large servers.

Hadoop is the technology that changes the way we store and analyse both real-time and batch processed data. However, it is amazing that some people still think that Hadoop is a big data yet we have been hearing a lot about big data and Hadoop technology.

Data on the move (Mobile):

Today, human beings and machines are amazingly interconnected.  For instance, 28 years ago, almost nobody had a mobile phone.  Today, everybody have one or even more. All these mobile phones produce data that is important for businesses. Hadoop is the only technology that can efficiently and effectively help collect, store and analyze these large quantities of data from both individuals and machines. Moreover, the benefits of Hadoop’s hardware are huge because it simultaneously uses multiple computers and disks. A research firm predicted that Hadoop alone will hit revenues of more than $24 billion by 2016.

Precisely, Hadoop is an open-source distributed file system that can serve as a great storage ground (or “data space lot”) No matter the format of a file, whether the data is structured, semi-structured or unstructured, Hadoop can collect, store, process and manage a large dataset across different commodity hardware. It has ecosystem tools that are native to work in Hadoop. We refer these features as Hadoop connectors” just like IDOL connectors because they help in real time storage of data into Hadoop HDFS.

Real time analytics with Hadoop frameworks

Hadoop was designed for processing a batch files across different computers. However, as the time passes, Hadoop will improve its capabilities especially for real time data processing connectors. It can also handle real-time data ingesting and analysing at the same time using Hadoop ecosystem connectors. ” live real time data, not batch-copied data only”

Data distribution

Open source technologies have many ways of transferring data into Hadoop HDFS before Hadoop MapReduce starts replicating the data into multiple computers, disks and YARN start managing the process. Hadoop connectors include Apache Storm, Apache Flume, Apache Kafka, Apache Nutch, Apache Sqoop and many more to collect data from different sources such as social network sites and Enterprise data warehouse (EDW), or any BI technologies. After collecting data into HDFS, it can be analysed using in-built machine learning algorithm for Hadoop or simply using IDOL.

Hadoop Security

Many people said Hadoop’s security is not great and data stored in Hadoop HDFS is not safe. However, the truth is that Hadoop is secure. This is because it has security components such as Apache Ranger and Apache Knox. Apache Knox API Gateway is designed as a reverse proxy while Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. Both Ranger and Knox can can help secure the data and distribute data safely across multiple computers and disks simultaneously in real time or as secured batch files.  For governance Hadoop, uses Apache Atlas, Apache Falcon and Apache Proxy Plug-in for Tomcat Application Server. These features are the real tigers as far as security measures are concerned.

Visualization tools for Big Data Hadoop

There are many tools out there that can help you show graphs, charts, maps, tables, shapes for your big data. Most popular visualization tools are Tibco Spotfire, Tableau. However, there are other free tools that are equality powerful such as Lucidworks based visualization Banana – SiLK and many more. Also you can use D3JS tool or HTML5 to build one from scratch in few days. Hadoop support Nifi dataflow tool that can help you visualize the two or more Hadoop clusters with governance flow.

Web spider history

This technology has been operational in the market for many years. From 2002 until today, Apache Nutch was designed to collect billions of web pages, distribute and store whole data for future analysing. Getting the web content, metadata, and hyperlinking links from websites as well as following site pages using a robot to get more content from subdomain links and other websites is its main purpose.

Nutch is not alone in this space. For instance, HP Autonomy has developed K2 verity with Autonomy Ultraseek technology back 2000 which is now called HTTPConnector. This connector is now among 400  IDOL connectors. It was designed differently from Nutch because it spins the page contents rather than downloading the site like Nutch does. It can index a large unstructured information from different web sites and then store the data into IDOL for further analysis.

Data connectors

Big data technologies like Hadoop and Haven requires one to first collect data using connectors then store and analyse the data to get business benefits. Moreover, storing data into Hadoop HDFS is not the biggest challenge today because there are handful open source connectors out there that can handle data collection in real time or batch.

Similarly, data analytics can use Hive to query data while data scientist use MapReduce, Spark, R and Mahout to mine the data. This implies that the data can be populated either on the dashboard or search presentation layer.

HP IDOL technologies

HP Haven is a very powerful solution for big data Hadoop world simply because it combines some of the most sophisticated technologies such as Hadoop, Autonomy IDOL, Vertica and many more. Today, we are looking at IDOL and Hadoop solution in this blog. Therefore, let us look at Hadoop ecosystem and the power it consumes and examine Hadoop connectors in details.

HP IDOL is the core technology of Haven big data platform for deep data analysis in real time. IDOL’s 400 connectors and 1000 file formats enables one to collect a large data from different sources and store them in IDOL Engine. On the other hand, this offers a powerful compliments to Hadoop’s connectors such as Nutch, Storm, Flume, Kafka, and Sqoop.

These open source connectors collect large data yet IDOL’s data formats are around 1000 file formats and has almost no limit about how many data it can collect with any format, anywhere without security limitations on the repositories because IDOL connectors inherit the security from original sources rather creating one.

  • Data analyzing: Hadoop’s machine learning mahout has a powerful libraries for deep data analysing that can help sentiment analysing and data classifications. Furthermore IDOL’s out of the box technologies can help sentiment analysis, similarity matching and many more.
  • Velocity: The speed of data indexing and analyzing in real time IDOL is faster for GB of data compared to Hadoop, but Spark on Hadoop is even the fastest for large data lake analysis because of its in-built memory capability. Spark is everywhere these days and it’s a safe technology but there are some weaknesses that is acceptable for production environment. Spark replacing MapReduce
  • What is real time data processing? In the past real, time data processing was used mainly by Point of Sale (POS) in commercial systems to update inventory in real time, provide inventory history to indicate whether items are sold or still in the store, and sales of a particular item. This allowed an organization to run different payments in real time. Therefore, today’s real time data processing is essential for organizations decision making and many more.
  • Hadoop and IDOL benefits: Hadoop is the best for storing and processing a very large dataset across different sources to commodity hardware. However, IDOL is one of the most unbeatable analytics and enterprise search software in the world. There is a huge business benefits for using both technologies for analysing medium size and large data lakes. If you’re looking for deeply relevant data analysis without developing a sophisticated machine learning on your own, you can consider using HP IDOL technologies because of their unmatched algorithm for real time data governance and analysis.

On-demand skills

Getting an expert that can help you with both Hadoop and HP Haven is hard. It is also almost impossible to get one with a short notice yet they are the one who can solve big data analytics problems and create a meaningful data. This is because of the Hadoop and HP Haven experts are people with over 15+ years’ experience in information technology and with a deep dive technical knowledge in different tools. They should not necessarily be data scientists or data analysts. Therefore, don’t confuse the difference between the data scientist and (SME) Subject matter expertise because they are very different both in speciality and expertise.

Who is a data scientist?

The profession of a data scientist is the sexiest job in the IT world. However, technical skills and expertise are very essential in this field. Such is because most of the data scientist must have experience with excellent programming, proficiency in Java, Pig, Python, HiveQL, Scale for spark, R, machine learning, relational database ETL tools, data management and other programming languages.

Who is a Haven SME?

This niche skill Haven specialist is still on demand. HP Haven SME is a person with an excellent experience in enterprise search, search services and able to support Hadoop, Autonomy, Vertica technologies components. This person would have a design, development, and deployment experience for all Haven technologies and will be able to integrate Haven big data platform to third party technologies and frameworks like Hadoop ecosystem. This person has good knowledge for IDOL’s 400 connectors, applications, programming languages, knowledge management systems and record management systems.

Because technologies are different, you still need to carefully select people with right skills in a right job at the right time.