Thursday, February 13, 2014

Some Important Definitions using in Hadoop

Distributed Processing :
Distributed processing is a phrase used to refer to a variety ofcomputer systems that use more than one computer (or processor) to run an application. This includes parallel processing in which a single computer uses more than one CPU to execute programs.
More often, however, distributed processing refers to local-area networks (LANs) designed so that a single program can run simultaneously at various sites. Most distributed processingsystems contain sophisticated software that detects idle CPUs on the network and parcels out programs to utilize them.
Another form of distributed processing involves distributed databases. This is databases in which the data is stored across two or more computer systems. The database system keeps track of where the data is so that the distributed nature of the database is not apparent to users.

Parallel Processing :
The simultaneous use of more than one CPU to execute aprogram. Ideally, parallel processing makes a program run faster because there are more engines (CPUs) running it. In practice, it is often difficult to divide a program in such a way that separate CPUs can execute different portions without interfering with each other.
Most computers have just one CPU, but some models have several. There are even computers with thousands of CPUs. With single-CPU computers, it is possible to perform parallel processing by connecting the computers in a network. However, this type of parallel processing requires very sophisticated software calleddistributed processing software.
Note that parallel processing differs from multitasking, in which a single CPU executes several programs at once.
Parallel processing is also called parallel computing.

MPP :
Short for massively parallel processing, a type of computing that uses many separate CPUs running in parallel to execute a singleprogram. MPP is similar to symmetric processing (SMP), with the main difference being that in SMP systems all the CPUs share the same memory, whereas in MPP systems, each CPU has its own memory. MPP systems are therefore more difficult to program because the application must be divided in such a way that all the executing segments can communicate with each other. On the other hand, MPP don't suffer from the bottleneck problems inherent in SMP systems when all the CPUs attempt to access the same memory at once.

Scalable Parallel Processor :
Abbreviated as SPP, a computer that utilizes parallel processingthat can be upgraded by adding more CPUs to it, effectively increasing its computing power.
Basically, scalability is determined by the ability to add to (or subtract from) an environment without having any adverse (mainly performance based) problems.

Scalability :
Basically, scalability is determined by the ability to add to (or subtract from) an environment without having any adverse (mainly performance based) problems.

One more definition of Distributed Processing :
The distribution of applications and business logic across multiple processing platforms.  Distributed processing implies that processing will occur on more than one processor in order for a transaction to be completed. In other words, processing is distributed across two or more machines and the processes are most likely not running at the same time, i.e. each process performs part of an application in a sequence. Often the data used in a distributed processing environment is also distributed across platforms.


Data Warehouse Definition

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.

More concise definition of a data warehouse:
A data warehouse is a copy of transaction data specifically structured for query and analysis.

In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a database used for reporting (1) and data analysis (2). Integrating data from one or more disparate sources creates a central repository of data, a data warehouse (DW). Data warehouses store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.

One more definition:
Abbreviated DW, a collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time.

Development of a data warehouse includes development of systems to extract data from operating systems plus installation of a warehouse database system that provides managers flexible access to the data.

The term data warehousing generally refers to the combination of many different databases across an entire enterprise. Contrast withdata mart.
 

My Profile

My photo
can be reached at 09916017317