Course content Introduction. Common basics of data analysis in corporate systems Chapter 1. Data warehouse and OLAP technologies Chapter 2. Data mining technologies
Chapter 1. Data warehouse and OLAP technologies Topic 1. Introduction to the analytical data processing. Topic 2. Basic technologies of analytical data processing. Topic 3. Multidimensional data presentation. Topic 4. Online analytical processing (OLAP). Topic 5. Multidimensional data presentation approaches for final user. "Drill-down" and "slice-and- dice" technologies. Topic 6. Design basics of multidimensional data model. Topic 7. Data warehouse technology. Topic 8. Data marts. Topic 9. Metadata for data warehouses. Topic 10. Data warehouses building stages.
Chapter 2. Data mining technologies Topic 1. Knowledge discovery process. Topic 2. Data mining categories Topic 3. Tasks and technologies of data mining. Topic 4. Classes description. Topic 5. Mining association rules development. Topic 6. Classification. Topic 7. Regression and time series analysis. Topic 8. Clusterization. Topic 9. Interaction of OLAP, data warehouses and data mining.
Т.Конноли, К.Бегг. А.Страчан. Базы данных. К.: Вильяме, М.Альперович. Введение в OLAP. // Базы данных, 3, Мамаев Е. MS SQL Server 2000 в подлиннике. ВНУ-СПб, С.Архипенков. Аналитические системы на базе Огасlе Ехрress ОLАР. Диалог МИФИ С.Архипенков, Д.Голубев, О.Максименко. Хранилища данных. От концепции до внедрения. М.: Диалог МИФИ, Спирли 3. Корпоративные Хранилища данных. Планирование, разработка, реализация, (том 1), Вильяме, Дюк В. Data mining: учебный курс. Питер, 2002.
ТЕХНОЛОГИИ АНАЛИТИЧЕСКОЙ ОБРАБОТКИ ДАННЫХ ( analytical data processing) многомерное представление данных (dimensional data model) онлайновая аналитическая обработка данных (OLAP) технология хранилищ данных (data warehousing) технология витрин данных (data marting) технология разработки данных (data mining)
Analytical Data Processing Dimensional Data Model OLAP (On-line Analytical Processing) Data Warehousing Data Marting Data Mining
Analytical Data Processing Analytical data processing is understood as data processing for support of strategic or administrative decision-making (where data are analytical by the nature), allowing users to analyze laws on the basis of the big volume of the data collected during the significant period of time.
The basic technologies of analytical data processing OLAP (On-line Analytical Processing - set of technologies for fast and convenient access of users to data in the special format focused on analytical use (dimensional data) Data warehousing - разработка и реализация хранилищ данных data warehouses - the integrated subject-focused analytical databases of scale of the enterprise data marting - построение витрин данных (data marts) – data marts - the databases of smaller scale constructed with use of the same technologies, as data warehouses which are usually filled with a subset of data from the big data warehouse Data mining - search of the important laws in great volumes of data with use of special technologies and tools (statistical methods, neural networks, etc.)
Dimensional Data Model Commonly used in data warehousing systems, dimensional modeling is a design technique that puts the data in a standard framework and provides easy access. A dimension model consists of one fact table and multiple dimension tables which are smaller. (Kimball). Two kinds of schemas are used when designing data models, either a star schema or a snowflake schema. star schemasnowflake schema
Многомерное (пространственное) представление данных The basic purposes of Dimensional Data Model: to organize data in the same way as they are perceived by users (the people working in business) to provide as much as possible of convenient and flexible ways of representation of the information, focused on the client
характеристики фактов: основными характеристиками фактов являются метрики. Они характеризуют способы вычисления фактов; в качестве метрик могут выступать такие понятия как сумма, процентное отношение, среднее и т.д. Метрики могут включать в себя операции над несколькими измеримыми величинами. Программное обеспечение может включать средство отображения измеримых величин в соответствии с различными метриками. факты обычно обладают свойством аддитивности, т.е. измеримая величина для определенной категории (например, за год) должна быть равна сумме величин для всех категорий нижнего уровня, составляющих данную, (например, за кварталы данного года). В последнее время появляются продукты, которые позволяют работать с неаддитивными фактами (текстом, ценами и т.п.)
Concepts of Dimensional Data Modeling To build a dimensional database, you start with a dimensional data model. The dimensional data model provides a method for making databases simple and understandable. You can conceive of a dimensional database as a database cube of three or four dimensions where users can access a slice of the database along any of its dimensions. To create a dimensional database, you need a model that lets you visualize the data.
Table 9. A Simple Relational Table ProductRegionUnit Sales FootballEast2300 FootballWest4000 FootballCentral5600 Tennis racketEast5500 Tennis racketWest8000 Tennis racketCentral2300 BaseballEast10000 BaseballWest22000 BaseballCentral34000 Table 10. A Simple Two-Dimensional Example RegionCentralEastWest Product Football Tennis Racket Baseball
Dimensions Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.
Attributes That Correspond to the Dimension Elements
Покажите общий объем продаж всех товаров для Харькова по месяцам. объем продаж (Sales) - это факт товар - размерность (Product) месяц - уровень размерности "время" (Time) Харьков - элемент размерности "место" (Location)
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables. In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.data marts
Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column. Fact Table
The Fact Table The fact table stores the measures of the business and points to the key value at the lowest level of each dimension table. The measures are quantitative or factual data about the subject. The measures are generally numeric and correspond to the how much or how many aspects of a question. Examples of measures are price, product sales, product inventory, revenue, and so forth. A measure can be based on a column in a table or it can be calculated. Table shows a fact table whose measures are sums of the units sold, the revenue, and the profit for the sales of that product to that account on that day. Table
Table 11. A Fact Table with Sample Records Product Code Account Code Day CodeUnits SoldRevenueProfit
Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").data warehouse
Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.
Snowflake Schema: The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.disk storage
OLAP - On-Line Analytical Processing (Оперативный анализ данных) - многомерный оперативный анализ данных для поддержки принятия важных решений. Short for Online Analytical Processing, a category of software tools that provides analysis of data stored in a database. OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views. OLAP often is used in data mining. The chief component of OLAP is the OLAP server, which sits between a client and a database management systems (DBMS). The OLAP server understands how data is organized in the database and has special functions for analyzing the data. There are OLAP servers available for nearly all the major database systems.datadatabasedata miningserverclientdatabase management systems (DBMS) OLAP
OLAP cube In database theory, an OLAP cube is an abstract representation of a projection of an RDBMS relation. Given a relation of order N, consider a projection that subtends X, Y, and Z as the key and W as the residual attribute. Characterizing this as a function,database theoryprojectionRDBMSrelation residualattributefunction W : (X,Y,Z) W, the attributes X, Y, and Z correspond to the axes of the cube, while the W value into which each ( X, Y, Z ) triple maps corresponds to the data element that populates each cell of the cube.
OLTP (online transaction processing) is a class of program that facilitates and manages transaction-oriented applications, typically for data entry and retrieval transactions in a number of industries, including banking, airlines, mailorder, supermarkets, and manufacturers. Probably the most widely installed OLTP product is IBM's CICS (Customer Information Control System). Today's online transaction processing increasingly requires support for transactions that span a network and may include more than one company. For this reason, new OLTP software uses client/server processing and brokering software that allows transactions to run on different computer platforms in a network.transactionCICSclient/server
OLTP OLTP stand for Online Transaction Processing. This is a standard, normalized database structure. OLTP is designed for transactions, which means that inserts, updates, and deletes must be fast. Imagine a call center that takes orders. Call takers are continually taking calls and entering orders that may contain numerous items. Each order and each item must be inserted into a database. Since the performance of the database is critical, we want to maximize the speed of inserts (and updates and deletes). To maximize performance, we typically try to hold as few records in the database as possible.
Data warehouse In the broadest sense of the term, a data warehouse has been used to refer to a database that contains very large stores of historical data. The data is stored as a series of snapshots, in which each record represents data at a specific time. This data snapshot allows a user to reconstruct history and to make accurate comparisons between different time periods. A data warehouse integrates and transforms the data that it retrieves before it is loaded into the warehouse. A primary advantage of a data warehouse is that it provides easy access to and analysis of vast stores of information.
The term was coined by W. H. Inmon A data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect.
Ralph Kimball. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd Edition Building the Data Warehouse - by W. H. Inmon Exploration Warehousing: Turning Business Information into Business Opportunity - by William H. Inmon, R. H. Terdeman, and Claudia Imhoff Claudia Imhoff
Data warehouse A database that is optimized for data retrieval. The data is not stored at the transaction level; some level of data is summarized. Unlike traditional OLTP databases, which automate day-to-day operations, a data warehouse provides a decision-support environment in which you can evaluate the performance of an entire enterprise over time. Typically, you use a relational data model to build a data warehouse. Data mart A subset of data warehouse that is stored in a smaller database and that is oriented toward a specific purpose or data subject rather than for enterprise-wide strategic planning. A data mart can contain operational data, summarized data, spatial data, or metadata. Typically, you use a dimensional data model to build a data mart.
Operational data store A subject-oriented system that is optimized for looking up one or two records at a time for decision making. An operational data store is a hybrid form of data warehouse that contains timely, current, integrated information. The data typically is of a higher level granularity than the transaction. You can use an operational data store for clerical, day-to-day decision making. This data can serve as the common source of data for data warehouses. Repository A repository combines multiple data sources into one normalized database. The records in a repository are updated frequently. Data is operational, not historical. You might use the repository for specific decision-support queries, depending on the specific system requirements. A repository fits the needs of a corporation that requires an integrated, enterprise-wide data source for operational processing.
multidimensional database (MDB) A multidimensional database (MDB) is a type of database that is optimized for data warehouse and online analytical processing (OLAP) applications. Multidimensional databases are frequently created using input from existing relational databases. Whereas a relational database is typically accessed using a Structured Query Language (SQL) query, a multidimensional database allows a user to ask questions like "How many Aptivas have been sold in Nebraska so far this year?" and similar questions related to summarizing business operations and trends. An OLAP application that accesses data from a multidimensional database is known as a MOLAP (multidimensional OLAP) application. A multidimensional database - or a multidimensional database management system (MDDBMS) - implies the ability to rapidly process the data in the database so that answers can be generated quickly. A number of vendors provide products that use multidimensional databases. Approaches to how data is stored and the user interface vary.databasedata warehouseOLAPrelational databaseSQLqueryMOLAP Conceptually, a multidimensional database uses the idea of a data cube to represent the dimensions of data available to a user. For example, "sales" could be viewed in the dimensions of product model, geography, time, or some additional dimension. In this case, "sales" is known as the measure attribute of the data cube and the other dimensions are seen as feature attributes. Additionally, a database creator can define hierarchies and levels within a dimension (for example, state and city levels within a regional hierarchy).