Sunday, February 16, 2020 admin Comments(0)

Data Warehousing and Data Mining. – Introduction –. Acknowledgements: I am indebted to Michael Böhlen and Stefano Rizzi for providing me their slides, upon . Abstract—The aim of this paper is to show the importance of using data warehousing and data mining nowadays. It also aims to show the process of data mining. Differences between operational and data warehousing systems. – Benefits ships between database, data warehouse and data mining leads us to the second.

Language:English, Spanish, Hindi
Genre:Academic & Education
Published (Last):18.07.2015
ePub File Size:16.59 MB
PDF File Size:15.47 MB
Distribution:Free* [*Register to download]
Uploaded by: ILUMINADA

Datawarehousing & Datamining. 2. Outline. 1. Introduction and Terminology. 2. Data Warehousing. 3. Data Mining. • Association rules. • Sequential patterns. Data Mining overview, Data Warehouse and OLAP Technology,Data Stepsfor the Design and Construction of Data Warehouses, A Three-Tier Data. 𝗣𝗗𝗙 | This paper shows design and implementation of data warehouse as well as the use of data mining algorithms for the purpose of knowledge discovery as.

A data warehouse is a technique for collecting and managing data from varied sources to provide meaningful business insights. It is a blend of technologies and components which allows the strategic use of data. Data Warehouse is electronic storage of a large amount of information by a business which is designed for query and analysis instead of transaction processing. It is a process of transforming data into information and making it available to users for analysis. What Is Data Mining?

This makes the overall IT architecture of the enterprise more resilient to changing requirements. To Support the Reengineering of Decisional Processes At the end of each BPR initiative come the projects required to establish the technological and organizational systems to support the newly reengineered business process.

Although reengineering projects have traditionally focused on operational processes, data warehousing technologies make it possible to reengineer decisional business processes as well.

Data warehouses, with their focus on meeting decisional business requirements, are the ideal systems for supporting reengineered decisional business processes. The concept of the data mart is causing a lot of excitement and attracts much attention in the data warehouse industry.

Data Mining & Data Warehousing Full Notes PDF Download eBook

Mostly, data marts are presented as an inexpensive alternative to a data warehouse that takes significantly less time and money to build. However, the term data mart means different things to different people. A rigorous definition of this term is a data store that is subsidiary to a data warehouse of integrated data.

The data mart is directed at a partition of data often called a subject area that is created for the use of a dedicated group of users. A data mart might, in fact, be a set of denormalized, summarized, or aggregated data. Sometimes, such a set could be placed on the data warehouse database rather than a physically separate store of data.

In most instances, however, the data mart is a physically separate store of data and is normally resident on a separate database server, often on the local area enterprises relational OLAP technology which creates highly denormalized star schema relational designs or hypercubes of data for analysis by groups of users with a common interest in a limited portion of the database.

All these type of data marts, called dependent data marts because their data content is sourced from the data warehouse, have a high value because no matter how many are deployed and no matter how many different enabling technologies are used, the different users are all accessing the information views derived from the same single integrated version of the data.

Unfortunately, the misleading statements about the simplicity and low cost of data marts sometimes result in organizations or vendors incorrectly positioning them as an alternative to the data warehouse. This viewpoint defines independent data marts that in fact represent fragmented point solutions to a range of business problems in the enterprise. This type of implementation should rarely be deployed in the context of an overall technology of applications architecture.

Indeed, it is missing the ingredient that is at the heart of the data warehousing concept: data integration. Each independent data mart makes its own assumptions about how to consolidate the data, and the data across several data marts may not be consistent.

As a result, an environment is created in which multiple operational systems feed multiple non-integrated data marts that are often overlapping in data content, job scheduling, connectivity, and management.

In other words, a complex many-to-one problem of building a data warehouse is transformed from operational and external data sources to a many-to-many sourcing and management nightmare.

But, as usage begets usage, the initial small data mart needs to grow i. It is clear that the point-solution-independent data mart is not necessarily a bad thing, and it is often a necessary and valid solution to a pressing business problem, thus achieving the goal of rapid delivery of enhanced decision support functionality to end users.

To address data integration issues associated with data marts, the recommended approach proposed by Ralph Kimball is as follows. For any two data mart in an enterprise, the common dimensions must conform to the equality and roll-up rule, which states that these dimensions are either the same or that one is a strict roll-up of another.

The time dimensions from both data marts might be at the individual day level, or, conversely, one time dimension is at the day level but the other is at the week level. Because days roll up to weeks, the two time dimensions are conformed. The time dimensions would not be conformed if one time dimension were weeks and the other time dimension, a fiscal quarter.

The resulting data marts could not usefully coexist in the same application. In summary, data marts present two problems: 1 scalability in situations where in initial small data mart grows quickly in multiple dimensions and 2 data integration.

Therefore, when designing data marts, the organizations should pay close attention to system scalability, data consistency, and manageability issues. The key to a successful data mart strategy is the development of overall scalable data warehouse architecture; and key step in that architecture is identifying and implementing the common dimensions.

A number of misconceptions exist about data marts and their relationships to data warehouses discuss two of those misconceptions below.

And warehousing mining pdf data

Each data mart is free to exist within its own isolated world, and all the users are happy. Unfortunately, that enterprises fail to realize until much later is that by deploying one isolated data mart after another, the enterprise has actually created new islands of automation.

While at the onset those data marts are certainly easier to develop, the task of maintaining many unrelated data marts is exceedingly complex and will create data management, synchronization, and consistency issues. Multiple data marts are definitely appropriate within an organization, but these should be implemented only under the integrating framework of an enterprise-wide data warehouse. Each data mart is developed as an extension of the data warehouse and is fed by the data warehouse.

The data warehouses enforces a consistent set of business rules and ensures the consistent use of terms and definitions. Although both technologies support decisional information needs of enterprise decisionmakers, the two are distinctly different and are deployed to meet different types of decisional information needs. Inmon, C. Imhoff, and G. Unlike the databases of OLTP applications that are operational or function oriented , the Operational Data Store contains subject-oriented, enterprise-wide data.

However, unlike data warehouses, the data in Operational Data Stores are volatile, current and detailed. However, some significant challenges of the ODS still remain.

Table 2. The ODS provides an integrated view of the data in the operational systems. Data are transformed and integrated into a consistent, unified whole as they are obtained from legacy and other operational systems to provide business users with an integrated and current view of operations.

Flash Monitoring and Reporting Tools As mentioned in Chapter 1, flash monitoring and reporting tools are like a dashboard that provides meaningful online information on the operational status of the enterprise. Operational Monitoring Relationship of Operational Data Stores to Data Warehouse Enterprises with Operational Data Stores find themselves in the enviable position of being able to deploy data warehouses with considerable ease.

Since operational data stores are integrated, many of the issues related to extracting, transforming, and transporting data from legacy systems have been addressed by the ODS, as illustrated in Figure 2.

The ODS is free to focus only on the current state of operations and is constantly updated in real time. Although the task of calculating ROI for data warehousing initiatives is unique to each enterprise, it is possible to classify the type of benefits and costs that are associated with data warehousing.

Benefits Data warehousing benefits can be expected from the following areas: The quantification of such costs in terms of staff hours and erroneous data may yield surprising results. Benefits of this nature, however, are typically minimal, since warehouse maintenance and enhancements require staff as well. At best, staff will be redeployed to more productive tasks.

Analysts go through several steps in their day-to-day work: Unfortunately, much of the time sometimes up to 40 percent spent by enterprise analysts on a typical day is devoted to locating and retrieving data.

The availability of integrated, readily accessible data in the data warehouse should significantly reduce the time that analysts spend with data collection tasks and increase the time available to actually analyze the data they have collected. This leads either to shorter decision cycle times or improvements in the quality of the analysis. The most significant business improvements in warehousing result from the analysis of warehouse data, especially if the easy availability of information yields insights here before unknown to the enterprise.

The goal of the data warehouse is to meet decisional information needs, therefore it follows naturally the greatest benefits of warehousing that are obtained when decisional information needs are actually met and sound business decisions are made both at the tactical and strategic level.

Understandably, such benefits are more significant and therefore, more difficult to project and quantify. Costs Data warehousing costs typically fall into one of the four categories. These are: This item refers to the costs associated with setting up the hardware and operating environment required by the data warehouse.

In many instances, this setup may require the acquisition of new equipment or the upgrade of existing equipment. Larger warehouse implementations naturally imply higher hardware costs. This item refers to the costs of downloading the licenses to use software products that automate the extraction, cleansing, loading, retrieval, and presentation of warehouse data. This item refers to services provided by systems integrators, consultants, and trainers during the course of a data warehouse project.

This item refers to costs incurred by assigning internal staff to the data warehousing effort, as well as to costs associated with training internal staff on new technologies and techniques. ROI Considerations The costs and benefits associated with data warehousing vary significantly from one enterprise to another.

The effect of data warehousing on the tactical and strategic management of an enterprise is often likened to cleaning the muddy windshield of a car. It is difficult to quantify the value of driving a car with a cleaner windshield. Similarly, it is difficult to quantify the value of managing an organization with better information and insight. Lastly, it is important to note that data warehouse justification is often complicated by the fact that much of the benefit may take sometime to realize and therefore is difficult to quantify in advance.

In Summary Data warehousing technologies have evolved as a result of the unsatisfied decisional information needs of enterprises. With the increased stability of operational systems, information technology professionals have increasingly turned their attention to meeting the decisional requirements of the enterprise.

A data warehouse, according to Bill Inmon, is a collection of integrated, subject-oriented databases designed to supply the information required for decision-making. Each data item in the data warehouse is relevant to some moment in time. A data mart has traditionally been defined as a subset of the enterprise-wide data warehouse.

Many enterprises, upon realizing the complexity involved in deploying a data warehouse, will opt to deploy data marts instead. Although data marts are able to meet the immediate needs of a targeted group of users, the enterprise should shy away from deploying multiple, unrelated data marts.

The presence of such islands of information will only result in data management and synchronization problems. Like data warehouses, Operational Data Stores are integrated and subject-oriented. However, an ODS is always current and is constantly updated ideally in real time. The Operational Data Store is the ideal data source for a data warehouse, since it already contains integrated operational data as of a given point in time.

Although data warehouses have proven to have significant returns on investment, particularly when they are meeting a specific, targeted business need, it is extremely difficult to quantify the expected benefits of a data warehouse. The costs are easier to calculate, as these break down simply into hardware, software, services, and in-house staffing costs. PEOPLE Although a number of people are involved in a single data warehousing project, there are three key roles that carry enormous responsibilities.

Negligence in carrying out any of these three roles can easily derail a well-planned data warehousing initiative. This section of the book therefore focuses on the Project Sponsor, the Chief Information Officer, and the Project Manager and seeks to answer the questions frequently asked by individuals who have accepted the responsibilities that come with these roles.

Every data warehouse initiative has a Project Sponsor-a high-level executive who provides strategic guidance, support, and direction to the data warehousing project. The Project Sponsor ensures that project objectives are aligned with enterprise objectives, resolves organizational issues, and usually obtains funding for the project.

The CIO is responsible for the effective deployment of information technology resources and staff to meet the strategic, decisional, and operational information requirements of the enterprise. Data warehousing, with its accompanying array of new technology and its dependence on operational systems, naturally makes strong demands on the physical and human resources under the jurisdiction of the CIO, not only during design and development but also during maintenance and subsequent evolution.

The warehouse Project Manager is responsible for all technical activities related to implementing a data warehouse.

Ideally, an IT professional from the enterprise fulfills this critical role. It is not unusual, however, for this role to be outsourced for early or pilot projects, because of the newness of warehousing technologies and techniques. This chapter attempts to provide answers to questions frequently asked by Project Sponsors. It is naive to expect an immediate change to the decision-making processes in an organization when a data warehouse first goes into production.

End users will initially be occupied more with learning how to use the data warehouse than with changing the way they obtain information and make decisions. It is also likely that the first set of predefined reports and queries supported by the data warehouse will differ little from existing reports. Decision-makers will experience varying levels of initial difficulty with the use of the data warehouse; proper usage assumes a level of desktop computing skills, data knowledge, and business knowledge.

Desktop Computing Skills Not all business users are familiar and comfortable with the desktop computers, and it is unrealistic to expect all the business users in an organization to make direct, personal use of the front-end warehouse tools. On the other hand, there are power users within the organization who enjoy using computers, love spreadsheets, and will quickly push the tools to the limit with their queries and reporting requirements.

Data Knowledge It is critical that business users be familiar with the contents of the data warehouse before they make use of it.

In many cases, this requirement entails extensive communication on two levels. First, the scope of the warehouse must be clearly communicated to property manage user expectations about the type of information they can retrieve, particularly in the earlier rollouts of the warehouse.

Second, business users who will have direct access to the data warehouse must be trained on the use of the selected front-end tools and on the meaning of the warehouse contents. The answers that the warehouse will provide are only as good as the questions that are directed to it. As end users gain confidence both in their own skills and in the veracity of the warehouse contents, data warehouse usage and overall support of the warehousing initiative will increase.

As the data scope of the warehouse increases and additional standard reports are produced from the warehouse data, decision-makers will start feeling overwhelmed by the number of standard reports that they receive. Decision-makers either gradually want to lessen their dependence on the regular reports or want to start relying on exception reporting or highlighting, and alert systems. For example, instead of receiving sales reports per region for all regions within the company, a sales executive may instead prefer to receive sales reports for areas where actual sales figures are either 10 percent more or less than the budgeted figures.

Alert Systems Alert systems also follow the same principle, that of highlighting or bringing to the fore areas or items that require managerial attention and action. However, instead of reports, decision-makers will receive notification of exceptions through other means, for example, an e-mail message.

As the warehouse gains acceptance, decision-making styles will evolve from the current practice of waiting for regular reports from IT or MIS to using the data warehouse to understand the current status of operations and, further, to using the data warehouse as the basis for strategic decision-making. At the most sophisticated level of usage, a data warehouse will allow senior management to understand and drive the business changes needed by the enterprise.

A successful enterprise-wide data warehouse effort will improve financial, marketing and operational processes through the simple availability of integrated data views.

Previously unavailable perspectives of the enterprise will increase understanding of cross-functional operations. The integration of enterprise data results in standardized terms across organizational units e. A common set of metrics for measuring performance will emerge from the data warehousing effort. Communication among these different groups will also improve. The very process of consolidation requires the use of a common vocabulary and increased understanding of operations across different groups in the organization.

While financial processes will improve because of the newly available information, it is important to note that the warehouse can provide information based only on available data. For example, one of the most popular banking applications for data warehousing is profitability analysis. Unfortunately, enterprises may encounter a rude shock when it becomes apparent that revenues and costs are not tracked at the same level of detail within the organization.

Banks frequently track their expenses at the level of branches or organization units but wish to compute profitability on a per customer basis. With profit figures at the customer level and costs at the branch level, there is no direct way to compute profit. As a result, enterprises may resort to formulas that allow them to compute or derive cost and revenue figures at the same level for comparison purposes. Marketing Data warehousing supports marketing organizations by providing a comprehensive view of each customer and his many relationships with the enterprise.

Over the years, marketing efforts have shifted in focus. Customers are no longer viewed as individual accounts but instead are viewed as individuals with multiple accounts.

This change in perspective provides the enterprise with cross-selling opportunities. The notion of customers as individuals also makes possible the segmentation and profiling of customers to improve target-marketing efforts.

The availability of historical data makes it possible to identify trends in customer behavior, hopefully with positive results in revenue. Operations By providing enterprise management with decisional information, data warehouses have the potential of greatly affecting the operations of an enterprise by highlighting both problems and opportunities that here before went undetected.

Strategic or tactical decisions based on warehouse data will naturally affect the operations of the enterprise. It is in this area that the greatest return on investment and, therefore, greatest improvement can be found. As mentioned in Chapter 2, return on investment ROI from data warehousing projects varies from organization to organization and is quite difficult to quantify prior to a warehousing initiative. However, a common list of problems encountered by enterprises can be identified as a result of unintegrated customer data and lack of historical data.

A properly deployed data warehouse can solve the problems, as discussed below. Customers are annoyed by requests for the same information by different units within the same enterprise. The inconsistent use of terms results in different business rules for the same item. Decision-makers have to rely on conflicting data and may lose credibility with customers, suppliers, or partners. Data gathering is ad hoc, inconsistent, and manually performed. There are no formal rules to govern the creation of these reports.

Business analysts within the organization spend more time collecting data instead of analyzing data. Competitors with more sophisticated means of producing similar reports have a considerable advantage. Analysis for trends and causal relationships are possible.

The enterprise is unable to anticipate events and behave proactively or aggressively. Customer demands come as a surprise, and the enterprise must scramble to react. Marketing campaigns can be predictive in nature, based on historical data. Enterprise Emphasis on Customer and Product Profitability Increase the focus and efficiency of the enterprise by gaining a better understanding of its customers and products.

Perceived Need Outside the IT Group Data warehousing is sought and supported by business users who demand integrated data for decision-making. A true business does not need technological experimentation, but drives the initiative. Integrated Data The enterprise lacks a repository of integrated and historical data that are required for decision-making.

Cost of Current Efforts The current cost of producing standard, regular managerial reports is typically hidden within an organization. A study of these costs can yield unexpected results that help justify the data warehouse initiative. The Competition does it Just because competitors are going into data warehousing, it does not mean that an enterprise should plunge headlong into it. However, the fact that the competition is applying data warehousing technology should make any manager stop and see whether data warehousing is something that his own organization needs.

The costs associated with developing and implementing a data warehouse typically fall into the categories described below: Hardware Warehousing hardware can easily account for up to 50 percent of the costs in a data warehouse pilot project.

A separate machine or server is often recommended for data warehousing so as not to burden operational IT environments. Hardware costs are generally higher at the start of the data warehousing initiative due to the download of new hardware. However, data warehouses grow quickly, and subsequent extensions to the warehouse may quickly require hardware upgrades. As the warehouse grows both in volume and in scope, subsequent investments in hardware can be made as needed. Software Software refers to all the tools required to create, set up, configure, populate, manage, use, and maintain the data warehouse.

The data warehousing tools currently available from a variety of vendors are staggering in their features and price range Chapter 11 provides an overview of these tools. Each enterprise will be best served by a combination of tools, the choice of which is determined or influenced not only by the features of the software but also by the current computing environment of the operational system, as well as the intended computing environment of the warehouse.

Services Services from consultants or system integrators are often required to manage and integrate the disparate components of the data warehouse. The use of consultants is also popular, particularly with early warehousing implementations, when the enterprise is just learning about data warehousing technologies and techniques.

Service-related costs can account for roughly 30 percent to 35 percent of the overall cost of a pilot project but may drop as the enterprise decreases its dependence on external resources. Internal Staff Internal staff costs refer to costs incurred as a result of assigning enterprise staff to the warehousing project. The staff could otherwise have been assigned to other activities. The heaviest demands are on the time of the IT staff who have the task of planning, designing, building, populating, and managing the warehouse.

The participation of end users, typically analysts and managers, is also crucial to a successful warehousing effort. The Project Sponsor, the CIO, and the Project manager will also be heavily involved because of the nature of their roles in the warehousing initiative. Table 3. The typical risks encountered on data warehousing projects fall into the following categories: These risks relate either to the project team structure and composition or to the culture of the enterprise.

These risks relate to the planning, selection, and use of warehousing technologies. Technological risks also arise from the existing computing environment, as well as the manner by which warehousing technologies are integrated into the existing enterprise IT architecture. These risks are true of most technology projects but are particularly dangerous in data warehousing because of the scale and scope of warehousing projects. Data warehousing requires a new set of design techniques that differ significantly from the well-accepted practices in OLTP system development.

Considering its scope and scale, the warehousing initiative should be business driven; otherwise, the organization will view the entire effort as a technology experiment. A strong Project Sponsor is required to address and resolve organizational issues before these have a choice to derail the project e. The Project Sponsor must be someone who will be a user of the warehouse, someone who can publicly assume responsibility for the warehousing initiative, and someone with sufficient clout.

This role cannot be delegated to a committee. Unfortunately, many an organization chooses to establish a data warehouse steering committee to take on the collective responsibility of this role.

If such a committee is established, the head of the committee may by default become the Project Sponsor. Unlike OLTP business requirements, which tend to be exact and transaction based, data warehousing requirements are moving targets and are subject to constant change. Despite this, the intended warehouse end-users should be interviewed to provide an understanding of the types of queries and reports query profiles they require.

By talking to the different users, the warehousing team also gains a better understanding of the IT literacy of the users user profiles they will be serving and will better understand the types of data access and retrieval tools that each user will be more likely to use. The end-user community also provides the team with the security requirements access profiles of the warehouse. These business requirements are critical inputs to the design of the data warehouse.

Senior Management Expectations Not Managed Because of the costs, data warehousing always requires a go-signal from senior management, often obtained after a long, protracted ROI presentation.

In their bid to obtain senior management support, warehousing supporters must be careful not to overstate the benefits of the data warehouse, particularly during requests for budgets and business case presentations. Raising senior management expectations beyond manageable levels is one sure way to court extremely embarrassing and highly visible disasters.

End-User Community Expectations not Managed Apart from managing senior management expectations, the warehousing team must, in the same manner, manage the expectations of their end users. Warehouse analysts must bear in mind that the expectations of end users are immediately raised when their requirements are first discussed. The warehousing team must constantly manage these expectations by emphasizing the phased nature of data warehouse implementation projects and by clearly identifying the intended scope of each data warehouse rollout.

Political Issues Attempts to produce integrated views of enterprise data are likely to raise political issues. In other enterprises, the various units want to have as little to do with warehousing as possible, for fear of having warehousing costs allocated to their units. Understandably, the unique combination of culture and politics within each enterprise will exert its own positive and negative influences on the warehousing effort.

A number of factors increase the logistical overhead in data warehousing.

Difference between Data Mining and Data Warehouse

A few among are: Highly formal organizations generally have higher logistical overhead because of the need to comply with pre-established methods for getting things done. Elaborate chains of command likewise may cause delays or may require greater coordination efforts to achieve a given result.

Logistical delays also arise from geographical distribution, as in the case of multi-branch banks, nationwide operations or international corporations. Multiple, stand-alone applications with no centralized data store have the same effect.

Moving data from one location to another without the benefit of a network or a transparent connection is difficult and will add to logistical overhead. Technological Inappropriate Use of Warehousing Technology A data warehouse is an inappropriate solution for enterprises that need operational integration on a real-time, online basis.

An ODS is the ideal solution to needs of the nature. Multiple unrelated data marts are likewise not the appropriate architecture for meeting enterprise decisional information needs. All data warehouse and data mart projects should remain under a single architectural framework. Poor Data Quality of Operational Systems When the data quality of the operational systems is suspect, the team will, by necessity, devote much of its time and effort to data scrubbing and data quality checking.

Poor data quality also adds to the difficulties of extracting, transforming, and loading data into the warehouse. The importance of data quality cannot be overstated. Warehouse end users will not make use of the warehouse if the information they retrieve is wrong or of dubious quality. The perception of lack of data quality, whether such a perception is true or not, is all that is required to derail a data warehousing initiative.

Inappropriate End-user Tools The wide range of end-user tools provides data warehouse users with different levels of functionality and requires different levels of IT sophistication from the user community. Providing senior management users with the inappropriate tools is one of the quickest ways to kill enthusiasm for the data warehouse effort. Likewise, power users will quickly become disenchanted with simple data access and retrieval tools.

Over Dependence on Tools to Solve Data Warehousing Problems The data warehouse solution should not be built around tools or sets of tools. Most of the warehousing tools e. What enterprises soon realize in their first warehousing project is that much of the effort in a warehousing project still cannot be automated.

Manual Data Capture and Conversion Requirements The extraction process is highly dependent on the extent to which data are available in the appropriate electronic format. In cases where the required data simply do not exist in any of the operational systems, a warehousing team may find itself resorting to the strongly discouraged practice of using data capture screens to obtain data through manual encoding operations. Unfortunately, a data warehouse quite simply cannot be filled up through manual data that will be available in the warehouse.

Technical Architecture and Networking Study and monitor the impact of the data warehouse development and usage on the network infrastructure. Assumptions about batch windows, middleware, extract mechanisms, etc..

Project Management Defining Project Scope Inappropriately The mantra for data warehousing should be start small and build incrementally. Organizations that prefer the big-bang approach quickly find themselves on the path to certain failure. Monolithic projects are unwieldy and difficult to mange, especially when the warehousing team is new to the technology and techniques.

In contrast, the phased, iterative approach has consistently proven itself to be effective, not only in data warehousing but also in most information technology initiatives. Each phase has a manageable scope, requires a smaller team, and lends itself well to a coaching and learning environment. The lessons learned by the team on each phase are a form of direct feedback into subsequent phases.

Underestimating Project Time Frame Estimates in data warehousing projects often fail to devote sufficient time to the extraction, integration, and transformation tasks. Figure 3. The project team should therefore work on stabilizing the back-end of the warehouse as quickly as possible.

The front-end tools are useless if the warehouse itself is not yet ready for use. Underestimating Project Overhead Time estimates in data warehousing projects often fail to consider delays due to logistics. Keep an eye on the lead time for hardware delivery, especially if the machine is yet to be imported into the city or country. Quickly determine the acquisition time for middleware or warehousing tools. Watch out for logistical overhead. Allocate sufficient time for team orientation and training prior to and during the course of the project to ensure that everyone remains aligned.

Devote sufficient time and effort to creating and promoting effective communication within the team. Typical Effort Distribution on a Warehousing Project Losing Focus The data warehousing effort should be focused entirely on delivering the essential minimal characteristics EMCs of each phase of the implementation.

It is easy for the team to be distracted by requests for nonessential or low-priority features i. These should be ruthlessly deferred to a later phase; otherwise, valuable project time and effort will be frittered away on nonessential features, to the detriment of the warehouse scope or schedule.

End users will need continuous training and support, especially new users are gradually granted access to the warehouse. Collect warehouse usage and query statistics to get an idea of warehouse acceptance and to obtain inputs for database optimization and tuning.

Plan subsequent phases or rollouts of the warehouse, taking into account the lessons learned from the first rollout. Allocate, acquire, or train the appropriate resources for support activities. Unfortunately, data warehousing requires design strategies that are very different from the design strategies for transactional, operation systems.

For example, OLTP database are fully normalized and are designed to consistently store operational data, one transaction at a time. In direct contrast, a data warehouse requires database designs that even business users find directly usable. Dimensional or star schemas with highly denormalized dimension tables on relational technology require different design techniques and different indexing strategies. Data warehousing may also require the use of hypercubes or multidimensional database technology for certain functions and users.

To get the most value out of the system, the most detailed data required by users should be loaded into the data warehouse. The degree to which users can slice and dice through the data warehouse is entirely dependent on the granularity of the facts. Too high a grain makes detailed reports or queries impossible to produce.

To low a grain unnecessarily increases the space requirements and the cost of the data warehouse. Not Defining Strategies to Key Database Design Issues The suitability of the warehouse design significantly impacts the size, performance, integrity, future scalability, and adaptability of the warehouse. Outline or high-level warehouse designs may overlook the demands of slowly changing dimensions, large dimensions, and key generation requirements among others.

Having the appropriate leaders setting the tone, scope, and direction of a data warehousing initiative can spell the difference between failure and success. The enterprise must verify that a data warehouse is the appropriate solution to its needs.

If the need is for operational integration, then an Operational Data Store is more appropriate. The entire data warehousing effort must be phased so that the warehouse can be iteratively extended in a cost-justified and prioritized manner. A number of prioritized areas should be delivered first; subsequent areas are implemented in incremental steps. Work on no urgent components is deferred. Obtain feedback from users as each rollout or phase is completed, and as users make use of the data warehouse and the front-end tools.

Any feedback should serve as inputs to subsequent rollouts. With each new rollout, users are expected to specify additional requirements and gain a better understanding of the types of queries that are now available to them. Each phase of the project should be conducted in a manner that promotes evolution, adaptability, and scalability. An overall data warehouse architecture should be defined when a high-level understanding of user needs has been obtained and the phased implementation path has been studied.

The data warehouse design must address slowly changing dimensions, aggregation, key generalization, heterogeneous facts and dimensions, and minidimensions.

Data Mining And Data Warehousing - DMDW Study Materials

These dimensional modeling concerns are addressed in Chapter Although there are no hard-and-fast rules for determining when your organization is ready to launch a data warehouse initiative, the following positive signs are good clues: The performance measures and reward mechanisms are likely to change, and they bring about corresponding changes to the processes and the culture of the organization.

Individuals who have an interest in preserving the status quo are likely to resist the data warehousing initiative, once it becomes apparent that such technologies enable organizational change.

Users Clamor for Integrated Decisional Data A data warehouse is likely to get strong support from both the IT and user community if there is a strong and unsatisfied demand for integrated decisional data as opposed to integrated operational data. It will be foolish to try using data warehousing technologies to meet operational information needs. The Operational Systems are Fairly Stable An IT department, division, or unit that continuously fights fires on unstable operational systems will quickly deprioritize the data warehousing effort.

Organizations will almost always defer the warehousing effort in favor of operational concerns—after all, the enterprise has survived without a data warehouse for year; another few months will not hurt. When the operational systems are up in production and are fairly stable, there are internal data sources for the warehouse and a data warehouse initiative will be given higher priority.

There is Adequate Funding A data warehouse project cannot afford to fizzle out in the middle of the effort due to a shortage of funds. Be aware of long-term funding requirements beyond the first data warehouse rollout before starting on the pilot project. Data warehousing results come in different forms and can, therefore, be measured in one or more of the following ways: The extent to which these reports and queries actually contribute to more informed decisions and the translation of these informed decisions to bottom line benefits may not be easy to trace, however.

Senior managers can also get the information they need directly, thus improving the security and confidentiality of such information. Turnaround time for decision-making is dramatically reduced. In the past, decisionmakers in meetings either had to make an uninformed decision or table a discussion item because they lacked information. The ability of the data warehouse to quickly provide needed information speeds up the decision-making process.

Timely Alerts and Exception Reporting The data warehouse proves its worth each time it sounds an alert or highlights an exception in enterprise operations. Early detection makes it possible to avert or correct potentially major problems and allows decision-makers to exploit business situations with small or immediate windows of opportunity.

Number of Active Users The number of active users provides a concrete measure for the usage and acceptance of the warehouse. Frequency of Use The number of times a user actually logs on to the data warehouse within a given time period e. Frequent usage is a strong indication of warehouse acceptance and usability. An increase in usage indicates that users are asking questions more frequently. Tracking the time of day when the data warehouse is frequently used will also indicate peak usage hours.

Session Times The length of time a user spends each time he logs on to the data warehouse shows how much the data warehouse contributes to his job. Query Profiles The number and types of queries users make provide an idea how sophisticated the users have become. As the queries become more sophisticated, users will most likely request additional functionality or increased data scope.

This metric also provides the Warehouse Database Administrator DBA with valuable insight as to the types of stored aggregates or summaries that can further optimize query performance. It also indicates which tables in the warehouse are frequently accessed.

Conversely, it also allows the warehouse DBA to identify tables that are hardly used and therefore are candidates for purging. Change Requests An analysis of users change requests can provide insight into how well users are applying the data warehouse technology. Unlike most IT projects, a high number of data warehouse change requests is a good sign; it implies that users are discovering more and more how warehousing can contribute to their jobs.

However; true warehousing ROI comes from business changes and decisions that have been made possible by information obtained from the warehouse.

These, unfortunately, are not as easy to quantify and measure. In Summary The importance of the Project Sponsor in a data warehousing initiative cannot be overstated. The project sponsor is the highest-level business representative of the warehousing team and therefore must be a visionary, respected, and decisive leader.

Warehousing data pdf and mining

At the end of the day, the Project Sponsor is responsible for the success of the data warehousing initiative within the enterprise. Data warehousing, with its accompanying array of new technologies and its dependence on operational systems, naturally makes strong demands on the technical and human resources under the jurisdiction of the CIO. For this reason, it is natural for the CIO to be strongly involved in any data warehousing effort.

This chapter attempts to answer the typical questions of CIOs who participate in data warehousing initiatives. After the data warehouse goes into production, different support services are required to ensure that the implementation is not derailed. These support services fall into the categories described below.

Regular Warehouse Load The data warehouse needs to be constantly loaded with additional data. The amount of work required to load data into the warehouse on a regular basis depends on the extent to which the extraction, transformation, and loading processes have been automated, as well as the load frequency required by the warehouse.

The frequency of the load depends on the user requirements, as determined during the data warehouse design activity. The most frequent load possible with a data warehouse is once a day, although it is not unusual to find organizations that load their warehouses once a week, or even once a month. The regular loading activities fall under the responsibilities of the warehouse support team, who almost invariably report directly or indirectly to the CIO.

Applications After the data warehouse and related data marts have been deployed, the IT department of division may turn its attention to the development and deployment of executive systems 54 THE CIO 55 or custom applications that run directly against the data warehouse or the data marts. These applications are developed or targeted to meet the needs of specific user groups. Any in-house application development will likely be handled by internal IT staff; otherwise, such projects should be outsourced under the watchful eye of the CIO.

Warehouse DB Optimization Apart from the day-to-day database administration support of production systems, the warehouse DBA must also collect and monitor new sets of query statistics with each rollout or phase of the data warehouse.

The data structure of the warehouse is then refined or optimized on the basis of these usage statistics, particularly in the area of stored aggregates and table indexing strategies. User Assistance or Help Desk As with any information system in the enterprise, a user assistance desk or help desk can provide users with general information, assistance, and support.

And warehousing mining pdf data

An analysis of the help requests received by the help desk provides insight on possible subjects for follow-on training with end users. In addition, the help desk is an ideal site for publicizing the status of the system after every successful load. Training Provide more training as more end users gain access to the data warehouse. Apart from covering the standard capabilities, applications, and tools that are available to the users, the warehouse training should also clearly convey what data are available in the warehouse.

Advanced training topics may be appropriate for more advanced users.

Warehousing data pdf and mining

Specialized work groups or one-on-one training may be appropriate as follow-on training, depending on the type of questions and help requests that the help desk receives. Preparation for Subsequent Rollouts All internal preparatory work for subsequent rollouts must be performed while support activities for prior rollouts are underway. This activity may create resource contention and therefore should be carefully managed. One of the toughest decisions any data warehouse planner has to make is to decide when to evolve the system with new data and when to wait for the user base, IT organization, and business to catch up with the latest release of the warehouse.

Warehouse evolution is not only a technical and management issue, it is also a political issue. The IT organization must continually either: In addition, each rollout of the warehouse is likely to be in different stages and therefore they have different support needs. For example, an enterprise may find itself busy with the planning and design of the third phase of the warehouse, while deployment and training activities are underway for the second phase, and help desk support is available for the first phase.

The CIO will undoubtedly face the unwelcome task of making critical decisions regarding resource assignments. In general, data warehouse evolution takes place in one or more of the following areas: Data Evolution in this area typically results in an increase in scope although a decrease is not impossible.

The extraction subsystem will require modification in cases where the source systems are modified or new operational systems are deployed. Users New users will be given access to the data warehouse, or existing users will be trained on advanced features. This implies new or additional training requirements, the definition of new users and access profiles, and the collection of new usage statistics.

New security measures may also be required. IT Organization New skill sets are required to build, manage, and support the data warehouse. New types of support activities will be needed. Business Changes in the business result in changes in the operations, monitoring needs, and performance measures used by the organization. The business requirements that drive the data warehouse change as the business changes.

Application Functionality New functionality can be added to existing OLAP tools, or new tools can be deployed to meet end-user needs.

Every data warehouse project has a team of people with diverse skills and roles. The involvement of internal staff during the warehouse development is critical to the warehouse maintenance and support tasks once the data warehouse is in production.

Not all the roles in a data warehouse project can be outsourced to third parties; of the typical roles listed below, internal enterprise staff should fulfill the roles listed below: Below is a list of typical roles in a data warehouse project. Note that the same person may play more than one role. Steering Committee The steering committee is composed of high-level executives representing each major type of user requiring access to the data warehouse. The project sponsor is a member of the committee.

In most cases, the sponsor heads the committee. The steering committee should already be formed by the time data warehouse implementation starts; however, the existence of a steering committee is not a prerequisite for data warehouse planning.

During implementation, the steering committee receives regular status reports from the project team and intervenes to redirect project efforts whenever appropriate. User Reference Group Representatives from the user community typically middle-level managers and analysts provide critical inputs to data warehousing projects by specifying detailed data requirements, business rules, predefined queries, and report layouts.

User representatives also test the outputs of the data warehousing effort. It is not unusual for end-user representatives to spend up to 80 percent of their time on the project, particularly during the requirements analysis and data warehouse design activities. Toward the end of a rollout, up to 80 percent of the representatives time may be required again for testing the usability and correctness of warehouse data.

End users also participate in regular meetings or interviews with the warehousing team throughout the life of each rollout up to 50 percent involvement. Warehouse Driver The warehouse driver reports to the steering committee, ensures that the project is moving in the right direction, and is responsible for meeting project deadlines. The warehouse driver is a business manager, but is a business manager responsible for defining the data warehouse strategy with the assistance of the warehouse project manager and for planning and managing the data warehouse implementation from the business side of operations.

The warehouse driver also communicates the warehouse objectives to other areas of the enterprise this individual normally serves as the coordinator in cases where the implementation team has cross-functional team members. It is therefore not unusual for the warehouse driver to be the liaison to the user reference group.

The project manager normally reports to the warehouse driver and jointly defines the data warehouse strategy. It is not unusual, however to find organizations where the warehouse driver and project manager jointly manage the project. In such cases, the project manager is actually a technical manager.

The project manager is responsible for implementing the project plans and acts as coordinator and the technology side of the project, particularly when the project involves several vendors. The warehouse project manager keeps the warehouse driver updated on the technical aspects of the project but isolates the warehouse driver from the technical details. Business Analyst s The analysts act as liaisons between the user reference group and the more technical members of the project team.

Through interviews with members of the user reference group, the analysts identify, document, and model the current business requirements and usage scenarios. Analysts play a critical role in managing end-user expectations, since most of the contact between the user reference group and the warehousing team takes place through the analysts. Analysts represent the interests of the end users in the project and therefore have the responsibility of ensuring that the resulting implementation will meet end-user needs.

This individual analyzes the information requirements specified by the user community and designs the data structures of the data warehouse accordingly.

The workload of the architect is heavier at the start of each rollout, when most of the design decisions are made. The workload tapers off as the rollout gets underway. The warehouse data architect has an increasingly critical role as the warehouse evolves.

Each successive rollout that extends the warehouse must respect an overall integrating architecture—and the responsibility for the integrating architecture falls squarely on the warehouse data architect. Data mart deployments that are fed by the warehouse should likewise be considered part of the architecture to avoid the data administration problems created by multiple, unrelated data marts. Metadata Administrator The metadata administrator defines metadata standards and manages the metadata repository of the warehouse.

The workload of the metadata administrator is quite high both at the start and toward the end of each warehouse rollout. Workload is high at the start primarily due to metadata definition and setup work. Workload toward the end of a rollout increases as the schema, the aggregate strategy, and the metadata repository contents are finalized. Warehouse DBA The warehouse database administrator works closely with the warehouse data architect.

The workload of the warehouse DBA is typically heavy throughout a data warehouse project. Much of this individuals time will be devoted to setting up the warehouse schema at the start of each rollout. As the rollout gets underway, the warehouse DBA takes on the responsibility of loading the data, monitoring the performance of the warehouse, refining the initial schema, and creating dummy data for testing the decision support front-end tools.

Toward the end of the rollout, the warehouse DBA will be busy with database optimization tasks as well as aggregate table creation and population. As expected, the warehouse DBA and the metadata administrator work closely together. The warehouse DBA is responsible for creating and populating metadata tables within the warehouse in compliance with the standards that have been defined by the metadata administrator.

Among their typical responsibilities are: Given their familiarity with the current computing environment, source system DBAs and SAs are often asked to identify the data transfer and extraction mechanisms best suited for their respective operational systems. These individuals are familiar with the data structures of the operational systems and are therefore the most qualified to contribute to or finalize the mapping of source system fields to warehouse fields.

In the course of their day-to-day operations, the DBAs and SAs encounter data quality problems and are therefore in a position to highlight areas that require special attention during data cleansing and transformation.

Depending on the status of the operational systems, these individuals may spend the majority of their time on the above activities during the course of a rollout. Conversion and Extraction Programmer s The programmers write the extraction and conversion programs that pull data from the operational databases. They also write programs that integrate, convert, and summarize the data into the format required by the data warehouse.