Big data glossary pdf

Thursday, March 21, 2019 admin Comments(0)

We have come up with a list of Big Data glossary, that would serve as a guide for beginners. Our list comprises of extensive terminologies, from. Privacy and Big Data · Read more IEEE Standard Glossary of Data Management Terminology (Ansi) Ethics of Big Data: Balancing Risk and Innovation. Big Data Solutions Reference Glossary (14 pages). Very brief http://www BigTable.

Language: English, Spanish, Indonesian
Country: Lesotho
Genre: Art
Pages: 620
Published (Last): 22.03.2016
ISBN: 257-1-79503-841-8
ePub File Size: 19.83 MB
PDF File Size: 8.74 MB
Distribution: Free* [*Regsitration Required]
Downloads: 28641
Uploaded by: PEDRO

Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of. O'Reilly Media, Inc. Big Data Glossary, the image of an . To help you navigate the large number of new data tools available, this guide describes 60 of the most recent innovations, from NoSQL databases and. BIG DATA GLOSSARY. ADVANCED to big data, including its potential to be mined or analyzed for valuable uploads//02/Update-Januarypdf.

Retrieving the data will take longer Comparative analysis It ensures a step-by-step procedure of comparisons and calculations to detect patterns within very large data sets. Cassandra is a distributed and Open Source database. Data profiling The process of collecting statistics and information about data in an existing source. Storm Storm is a system of real-time distributed computing, open source and free, born into Twitter. Grid computing typically involves large files and are most often used for multiple applications.

Automatic Identification and Data Capture AIDC refers to a broad set of technologies used to glean data from an object, image, or sound without manual intervention. Algorithm refers to a mathematical formula placed in software that performs an analysis on a set of data.

Artificial Intelligence: Artificial Intelligence refers to the process of developing intelligence machines and software that can perceive the environment and take the corresponding action as and when required and even learn from those actions. ACID Test: These four attributes are the benchmarks for ensuring the validity of a data transaction.

Apache Avro: Apache Avro is a row-oriented object container storage format for Hadoop as well as a remote procedure call and data serialization framework. Avro is optimized for write operations and includes a wire format for communication between nodes. Avro ensures simpler translation between different nodes by way of the data definition and serialized permanent data. Avro uses JavaScript object notation to define protocols and data types, as well as serializes data into a compact binary format.

B Batch Processing: Batch processing is a standard computing strategy that involves processing data in large sets.

This practice becomes imperative for non-time sensitive work that operates on very large datasets. The process is scheduled and at a later time, the results are retrieved by the system. Big Data: Big Data is an umbrella term used for huge volumes of heterogeneous datasets that cannot be processed by traditional computers or tools due to their varying volume, velocity, and variety.

Biometrics implies using analytics and technology in identifying people by one or many of their physical characteristics, such as fingerprint recognition, facial recognition, iris recognition, etc. It is most commonly used in modern smartphones. Business Intelligence: Business Intelligence is the general term used for the identification, extraction, and analysis of data.

CDRs include data that a telecommunications company collects about phone calls. This may include call duration and time when the call was made or received. This data is used in any number of analytical applications. Cascading refers to s a higher level of abstraction for Hadoop, that allows developers to create complex jobs quickly, easily, and in several different languages that run in the JVM, including Ruby, Scala, and more. Cassandra is a popular open source database management system managed by The Apache Software Foundation.

Cassandra was designed to handle large volumes of data across distributed servers. Clickstream Analytics: Clojure is particularly suitable for parallel data processing. Cluster computing: Clustered Computing is the practice of segmenting the resources of multiple machines and managing their collective capabilities to complete tasks in a more simplified manner. Computer clusters require a cluster management layer which handles communication between the individual nodes and coordinates work assignment.

Big data means big business and every industry will reap the benefits from big data. However, users prefer to use HDFS remotely over the heavy client side native libraries. For example, some applications need to load data in and out of the cluster, or to externally interact with the HDFS data.

Real-time weather data is now widely available for organizations to use in a variety of ways. For example, a logistics company can monitor local weather conditions to optimize the transport of goods. A utility company can adjust energy distribution in real time. XML databases are often linked to document-oriented databases. The data stored in an XML database can be queried, exported and serialized into any format needed. ZooKeeper is a software project of the Apache Software Foundation, a service that provides centralized configuration and open code name registration for large distributed systems.

ZooKeeper is a subproject of Hadoop. He is also a regular contributor at RoboticsBiz. Elton John is my favourite UK singer of the world. Check Elton John tickets Minneapolis website to get your best ticket for the farewell Elton John tour.

Elton John is my favourite singer of the world. ACID test A test applied to data for atomicity, consistency, isolation, and durability Aggregation A process of searching, gathering and presenting data Algorithm A mathematical formula placed in software that performs an analysis on a set of data.

Anonymization The severing of links between people in a database and their records to prevent the discovery of the source of the records. Artificial Intelligence Developing intelligence machines and software that are capable of perceiving the environment and take corresponding action when required and even learn from those actions. Automatic identification and capture AIDC Any method of automatically identifying and collecting data on items, and then storing the data in a computer system.

Avro Avro is a data serialization system that allows for encoding the schema of Hadoop files. Big Data Scientist Someone who is able to develop the algorithms to make sense out of big data. Business Intelligence BI The general term used for the identification, extraction, and analysis of data.

Cassandra Cassandra is a distributed and Open Source database. Cell phone data Cell phones generate a tremendous amount of data, and much of it is available for use with analytical applications. Classification analysis A systematic process for obtaining important and relevant information about data, also meta data called; data about data. Cloud computing A distributed computing system over a network used for storing data off-premises Clustering analysis The process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data.

Cold data storage Storing old data that is hardly used on low-power servers. Retrieving the data will take longer Comparative analysis It ensures a step-by-step procedure of comparisons and calculations to detect patterns within very large data sets.

Chukwa Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Cloud A broad term that refers to any Internet-based application or service that is hosted remotely. Columnar database or column-oriented database A database that stores data by column rather than by row.

Comparators Two ways you may compare your keys is by implementing the interface or by implementing the RawComparator interface. Confabulation The act of making an intuition-based decision appear to be data-based. Cross-channel analytics Analysis that can attribute sales, show average order value, or the lifetime value.

Glossary big pdf data

Data access The act or method of viewing or retrieving stored data. Dashboard A graphical representation of the analyses performed by the algorithms Data aggregation The act of collecting data from multiple sources for the purpose of reporting or analysis. Data architecture and design How enterprise data is structured. Database A digital collection of data and the structure around which the data is organized. Database administrator DBA A person, often certified, who is responsible for supporting and maintaining the integrity of the structure and content of a database.

Database as a service DaaS A database hosted in the cloud and sold on a metered basis. Database management system DBMS Software that collects and provides access to data in a structured format. Data center A physical facility that houses a large number of servers and data storage devices.

Big Data A to Z: A glossary of Big Data terminology

Data cleansing The act of reviewing and revising data to remove duplicate entries, correct misspellings, add missing data, and provide more consistency. Data collection Any process that captures any type of data.

Data custodian A person responsible for the database structure and the technical environment, including the storage of data. Data-directed decision making Using data to support making crucial decisions.

Data exhaust The data that a person creates as a byproduct of a common activity—for example, a cell call log or web search history. Data feed A means for a person to receive a stream of data.

Data governance A set of processes or rules that ensure the integrity of the data and that data management best practices are met. Data integration The process of combining data from different sources and presenting it in a single view.

Data integrity The measure of trust an organization has in the accuracy, completeness, timeliness, and validity of the data. Data mart The access layer of a data warehouse used to provide data to users. Data migration The process of moving data between different storage types or formats, or between different computer systems. Data mining The process of deriving patterns or knowledge from large data sets. Data model, data modeling A data model defines the structure of the data for the purpose of communicating between functional and technical people to show data needed for business processes, or for communicating a plan to develop how data is stored and accessed among application development team members.

Data point An individual item on a graph or a chart. Data profiling The process of collecting statistics and information about data in an existing source.

Data quality The measure of data to determine its worthiness for decision making, planning, or operations. Data replication The process of sharing information to ensure consistency between redundant sources. Data repository The location of permanently stored data. Data science A recent term that has multiple definitions, but generally accepted as a discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning, and database engineering to solve complex problems.

Data scientist A practitioner of data science. Data security The practice of protecting data from destruction or unauthorized access. Data set A collection of data, typically in tabular form. Data source Any provider of data—for example, a database or a data stream.

Data steward A person responsible for data stored in a data field. Data structure A specific way of storing and organizing data. Data visualization A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively. Data warehouse A place to store data for the purpose of reporting and analysis. De-identification The act of removing all data that links a person to a particular piece of information.

Demographic data Data relating to the characteristics of a human population. Distributed cache A data cache that is spread across multiple systems but works as one. Distributed object A software module designed to work with other distributed objects stored on other computers. Distributed processing The execution of a process across multiple computers connected by a computer network.

Distributed File System Systems that offer simplified, highly available access to storing, analysing and processing data Document Store Databases A document-oriented database that is especially designed to store, manage and retrieve documents, also known as semi structured data.

Big Data Glossary: The Ultimate List of All Big Data & Analytics Terms

Document management The practice of tracking and storing electronic documents and scanned images of paper documents. Drill An open source distributed system for performing interactive analysis on large-scale datasets.

Pdf big data glossary

Elasticsearch An open source search engine built on Apache Lucene. Event analytics Shows the series of steps that led to an action. Exabyte One million terabytes, or 1 billion gigabytes of information.

External data Data that exists outside of a system. Extract, transform, and load ETL A process used in data warehousing to prepare data for use in reporting or analytics.

Exploratory analysis Finding patterns within data without standard procedures or methods. Failover The automatic switching to another computer or node should one fail. Flume Flume is a framework for populating Hadoop with data.

Grid computing The performing of computing functions using resources from multiple distributed systems. Graph Databases They use graph structures a finite set of ordered pairs or certain entities , with edges, properties and nodes for data storage.

Hadoop An open source software library project administered by the Apache Software Foundation. HBase HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. Hive Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook.

In-database analytics The integration of data analytics into the data warehouse. In-memory database Any database system that relies on memory for data storage. In-memory data grid IMDG The storage of data in memory across multiple servers for the purpose of greater scalability and faster access or analytics.

Internet of Things Ordinary devices that are connected to the internet at any time any where via sensors Kafka Kafka developed by LinkedIn is a distributed publish-subscribe messaging system that offers a solution capable of handling all data flow activity and processing these data on a consumer website. Key Value Stores Key value stores allow the application to store its data in a schema-less way.

KeyValue Databases They store data with a primary key, a uniquely identifiable record, which makes easy and fast to look up. Latency Any delay in a response or delivery of data from one point to another. Location analytics Location analytics brings mapping and map-driven analytics to enterprise business systems and data warehouses.

Location data Data that describes a geographic location. Log file A file that a computer, network, or application creates automatically to record events that occur during operation—for example, the time a file is accessed.

Machine-generated data Any data that is automatically created from a computer process, application, or other non-human source. MapReduce MapReduce is a software framework that serves as the compute layer of Hadoop. Mashup The process of combining different datasets within a single application to enhance output—for example, combining demographic data with real estate listings. Mahout Mahout is a data mining library. Metadata Data about data; gives information about what the data is about.

MPP database A database optimized to work in a massively parallel processing environment. They are primarily giant strings that are perfect for manipulating HTML and XML strings directly Network analysis Viewing relationships among the nodes in terms of the network or graph theory, meaning analysing connections between nodes in a network and the strength of the ties.

Object Databases They store data in the form of objects, as used by object-oriented programming. Object-based Image Analysis Analysing digital images can be performed with data from individual pixels, whereas object-based image analysis uses data from a selection of related pixels, called objects or image objects. Online analytical processing OLAP The process of analyzing multidimensional data using three operations: Online transactional processing OLTP The process of providing users with access to large amounts of transactional data in a way that they can derive meaning from it.

Operational data store ODS A location to gather and store data from multiple sources so that more operations can be performed on it before sending to the data warehouse for reporting. Oozie Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages — such as Map Reduce, Pig and Hive — then intelligently link them to one another.

Parallel data analysis Breaking up an analytical problem into smaller components and running algorithms on each of those components at the same time. Parallel method invocation PMI Allows programming code to call multiple functions in parallel. Parallel processing The ability to execute multiple tasks at the same time. Parallel query A query that is executed over multiple system threads for faster performance.

Pattern recognition The classification or labeling of an identified pattern in the machine learning process.

Pig Pig Latin is a Hadoop-based language developed by Yahoo. Predictive analytics Using statistical functions on one or more datasets to predict trends or future events. Predictive modeling The process of developing a model that will most likely predict a trend or outcome.

Public data Public information or data sets that were created with public funding Query Asking for information to answer a certain question Query analysis The process of analyzing a search query for the purpose of optimizing it for the best possible result. R R is a language and environment for statistical computing and graphics. Reference data Data that describes an object and its properties.

The object may be physical or virtual. Risk analysis The application of statistical methods on one or more datasets to determine the likely risk of a project, action, or decision. Root-cause analysis The process of determining the main cause of an event or problem.

Big Data Glossary pdf

Routing analysis Finding the optimized routing using many different variables for a certain means of transport in order to decrease fuel costs and increase efficiency. Scalability The ability of a system or process to maintain acceptable performance levels as workload or scope increases. Schema The structure that defines the organization of data in a database system. Search data Aggregated data about search terms used over time.

Semi-structured data Data that is not structured by a formal data model, but provides other means of describing the data and hierarchies.

Sentiment analysis The application of statistical functions on comments people make on the web and through social networks to determine how they feel about a product or company. Server A physical or virtual computer that serves requests for a software application and delivers those requests over a network. Spatial analysis It refers to analysing spatial data such geographic data or topological data to identify and understand patterns and regularities within data distributed in geographic space.

SQL A programming language for retrieving data from a relational database Sqoop Sqoop is a connectivity tool for moving data from non-Hadoop data stores — such as relational databases and data warehouses — into Hadoop.

Storm Storm is a system of real-time distributed computing, open source and free, born into Twitter. Software as a service SaaS Application software that is used over the web by a thin client or web browser. Storage Any means of storing data persistently.

Storm An open-source distributed computation system designed for processing multiple data streams in real time. Structured data Data that is organized by a predetermined structure. Structured Query Language SQL A programming language designed specifically to manage and retrieve data from a relational database system.

Text analytics The application of statistical, linguistic, and machine learning techniques on text-based sources to derive meaning or insight. Transactional data Data that changes unpredictably.

Glossary pdf data big

Value All that available data will create a lot of value for organizations, societies and consumers. Volume The amount of data, ranging from megabytes to brontobytes Visualization A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively. Weather data Real-time weather data is now widely available for organizations to use in a variety of ways.

ZooKeeper ZooKeeper is a software project of the Apache Software Foundation, a service that provides centralized configuration and open code name registration for large distributed systems. Baiju NT. Preview post Does your HR department have a data scientist?