Data Mining - Mehmed Kantardzic [17]
The model of a data-mining process should help to plan, work through, and reduce the cost of any given project by detailing procedures to be performed in each of the phases. The model of the process should provide a complete description of all phases from problem specification to deployment of the results. Initially the team has to answer the key question: What is the ultimate purpose of mining these data, and more specifically, what are the business goals? The key to success in data mining is coming up with a precise formulation of the problem the team is trying to solve. A focused statement usually results in the best payoff. The knowledge of an organization’s needs or scientific research objectives will guide the team in formulating the goal of a data-mining process. The prerequisite to knowledge discovery is understanding the data and the business. Without this deep understanding, no algorithm, regardless of sophistication, is going to provide results in which a final user should have confidence. Without this background a data miner will not be able to identify the problems he/she is trying to solve, or to even correctly interpret the results. To make the best use of data mining, we must make a clear statement of project objectives. An effective statement of the problem will include a way of measuring the results of a knowledge discovery project. It may also include details about a cost justification. Preparatory steps in a data-mining process may also include analysis and specification of a type of data mining task, and selection of an appropriate methodology and corresponding algorithms and tools. When selecting a data-mining product, we have to be aware that they generally have different implementations of a particular algorithm even when they identify it with the same name. Implementation differences can affect operational characteristics such as memory usage and data storage, as well as performance characteristics such as speed and accuracy.
The data-understanding phase starts early in the project, and it includes important and time-consuming activities that could make enormous influence on the final success of the project. “Get familiar with the data” is the phrase that requires serious analysis of data, including source of data, owner, organization responsible for maintaining the data, cost (if purchased), storage organization, size in records and attributes, size in bytes, security requirements, restrictions on use, and privacy requirements. Also, the data miner should identify data-quality problems and discover first insights into the data, such as data types, definitions of attributes, units of measure, list or range of values, collection information, time and space characteristics, and missing and invalid data. Finally, we should detect interesting subsets of data in these preliminary analyses to form hypotheses for hidden information. The important characteristic of a data-mining process is the relative time spent to complete each of the steps in the process, and the data are counterintuitive as presented in Figure 1.6. Some authors estimate that about 20% of the effort is spent on business objective determination, about 60% on data preparation and understanding, and only about 10% for