Data generalization involves broadening the categories in which data is categorized in a database, or in other words, “zooming out” first from data to produce a more comprehensive picture of the patterns or insights it offers. Before going through the Data Generalization and Summarization in Data Mining, we need to explore the data generalization first in detail. The data generalization procedure can appear as follows if your data set has information on the ages of a variety of individuals:
Original Data:
Ages: 26, 29, 30, 32, 37, 41, 42, 46, 48, 48, 54, 56, 57, 58, 59
Generalized Data:
Ages:
20-29 (2)
30-39 (3)
40-49 (5)
50-59 (5)
Data generalization, a type of different data masking, substitutes a precise data value with a less exact one. Although it may appear counterproductive, this is a frequently used strategy for secure storage, analysis, and data mining.
Importance of Data Generalization
When you need to evaluate data you’ve gathered but also need to protect the privacy of the people who’ve been included in that data, one of the main uses for data generalization is necessary. It’s an effective technique for removing personally identifiable information while keeping the value of the points. In the previous age example, aggregating age stats information on each decade provides a broad overview of exactly where the individuals in the set of data fall while still enabling you to use that information for somewhat concentration or analysis.
You can frequently utilize more powerful generalization methods on the unnecessary data points while preserving the relevant points a little intact when you have several identifiable data points but just one or a few of them are pertinent to your needs.
Compliance is another crucial factor when it comes to dataset generalization; rules are in place that specifies how much personally identifiable information about individuals may be preserved. To prevent any data leaks or unauthorized exposure, make sure you are informed of the regulatory requirements for your particular business.
Types of Data Generalization
Which of the two main types of data generalization you employ in a particular situation depends on a number of variables, including the type of data, your particular needs and goals for it, and the privacy and security standards established by your company, your industry, and/or governmental regulatory bodies.
Declarative generalization and automated generalization are the two basic methods of data generalization. Let’s examine what each of them represents and how they seem in real life.
- Automated Data Generalization
Automated generalization makes use of algorithms to ascertain the least degree of generalization or distortion necessary to maintain accuracy and proper privacy. One of the most popular generalization strategies is called k-anonymization, which uses the specified generalization value known as k.
The data are referred to as 2-anonymous if k=2. This indicates that there are at least two sets of every possible data combination because the data points have been sufficiently globalized. Because each ‘category’ of data, in this case, age ranges, has at least two occurrences. If a piece of data had the age and location of a number of different people, the data would have to be generalized so that every age/location pair occurred at least twice in the data.
- Decorative Data Generalization
Declarative generalization entails choosing manually what size data bins to use in each situation. We decided that the bucket size should be a decade for our age group example. If this were an actual data set, we may have concluded that bin size offered the best level of security and privacy for every individual in the data without sacrificing the utility of the data.
Declarative generalization has some inherent limitations, chief among which is the potential for data distortion or bias because outliers are sometimes completely removed. To make sure that whoever is getting safe data doesn’t have any more detail than is necessary to get the intended result, declarative generalization can be a good place to start.
Data Generalization Approaches
The process of summarising data through the use of higher-level concepts in place of relatively low-level numbers is known as data generalization. This kind of data mining uses descriptive data. Data generalization can be done using two fundamental methods:
- Use of Data Cube:
It is often referred to as the OLAP method. A previous selling graph is a useful tool, thus it is an effective strategy. The Data cube is used in this method to store calculations and results. It uses data cube roll-up and drill-down procedures. Aggregate functions like count(), sum(), average(), and max are frequently used in these procedures (). Then, a variety of applications, including knowledge discovery and decision assistance, can leverage these materialized views.
- Object-Oriented-based Attribute Induction
It uses a query-oriented, generalization-based method for online data analysis. With this strategy, we make generalizations based on the various values of each attribute found in the pertinent data set. following the same tuple’s merging, the counts associated with each member are added together to perform aggregation. Before an OLAP or data mining request is sent to be processed, it performs offline aggregation. The attribute-oriented induction methodology, on the other hand, was initially proposed as a relational database query-oriented, generic-based method (online data analysis technique). Both specific metrics and categorical data are not excluded. The attribute-oriented induction methodology employs two techniques Removal of an attribute, Generalization of attributes
Data Mining
Finding anomalies, trends, and correlations within huge data sets in order to forecast outcomes is known as data mining. You could employ this information to lower risks, improve customer connections, raise profits, and more by employing a variety of strategies.
- History of Data Mining
There is a long history of sifting through data to find hidden relationships and forecast upcoming trends. The phrase “data mining,” sometimes called “knowledge discovery using databases,” wasn’t first used until the 1990s. But its core is made up of three interconnected scientific disciplines: machine learning, artificial intelligence, and statistics, which examine the numerical correlations between data (algorithms that can learn from data to make predictions).
As data mining technology continues to advance to keep up with the endless possibilities of big data and accessible computing power, what was once old is now new.
We have been able to replace manual, laborious, and time-consuming processes with quick, simple, and automated data analysis during the past ten years because of improvements in processing power and speed. The possibility to find pertinent insights increases with the complexity of the data sets that have been gathered. Among others, data mining is being used by retailers, banks, manufacturers, telecommunications companies, and insurers to find connections between factors like price optimization, promotions, and demographics as well as how the economy, risk, competition, and social media are affecting their business models, revenues, operations, and customer relationships.
DATA GENERALIZATION AND SUMMARIZATION IN DATA MINING
So here is the answer to Data Generalization and Summarization in Data Mining,
A crucial idea in Data Generalization and Data Summarization in Data Mining, which allows for a brief overview of a dataset so that anomalies can be identified. Many of the underlying trends and patterns of the raw data would be easily accessible through a well-chosen summary. Extraction of helpful data from raw information is exactly what is meant by the phrase “data mining,” which refers to this process. Additionally, Data Generalization and Data Summarization in Data Mining try to display the patterns and information that have been retrieved in a tabular or graphic way.
Data can often be summarized graphically using a graph referred to as visualization or numerically using a table called a tabular summary. Data mining’s various methods of data summarization in Data Generalization and Summarization in Data Mining. There are also some methods of summarization
This technique quickly communicates patterns like frequency and percentage, cumulative frequency, etc. Tabular Summarization and Data Visualization: a chosen graph style’s visualizations, such as a histogram, time-series line graph, column or bar graphs, etc., can aid in quickly identifying trends in a visually appealing manner.
There are three places in data mining where data summarization can be used. These are listed below:
- Data Generalization and Summarization in Data Mining (Centrality)
The center or middle value of the data is referred to as the centrality principle. The average, also known as mean, median, and mode, is the most popular metric for demonstrating centrality. The allocation of the sample data is summarised by the three of them.
Mean: This is employed to determine the collection of values’ numerical average.
Mode: This displays the dataset’s value that is repeated the most frequently.
When values are sorted in order, the median indicates the value that falls in the center of all the other values in the dataset. The dataset’s shape will have a significant impact on which measure is most appropriate to utilize.
- Data Generalization and Summarization in Data Mining (Dispersion)
How dispersed the data are around the mean is referred to as the distribution of a sample (center). The degree of variability or diversity within the data is shown by examining the spreading of the distribution of the data. Levels and low occurs when the values are near the center, but high dispersion happens when the values are far dispersed from the center.
Depending on your dataset and the specifics of your analysis, you can use a variety of metrics of dispersion. The following are the many metrics of dispersion: Standard deviation: This gives you a consistent technique to determine what is typical, illustrating what is excessively huge or excessively tiny, and assisting you in comprehending the variability of the variable from the mean. It displays how near the mean all of the numbers are similar to standard deviation, variance assesses how closely or widely values deviate from the mean.
Range: The range demonstrates the variation between the biggest and smallest values, indicating how far apart the extremes are from one another.
- Distribution of a Sample of Data for Data Summarization
The shape, or how data values are dispersed across the value range in the sample, has to do with how sample data values are distributed. In layman’s terms, it indicates whether the values are symmetrically distributed around the average by clustering around it, or whether there are more numbers to one hand than the order. The spread of the data sample can be examined graphically and using shape statistics. Frequency histograms and tally plots can be employed to summarize the data and to visually depict the distribution of the data.
Histograms: Histograms are similar to bar charts in that a bar reflects the frequency of data values that correspond to different size classes. The distinction is that bars in histograms are drawn without gaps to show the x-axis, which represents a continuous variable.
Tally plots: To represent the values from a dataset, one might use a tally plot, a type of data frequency distribution graph. Skewness and kurtosis can be used to determine how central the average is and how clustered the data are around it in shape statistics.
Skewness: This describes the degree to which the average is centered within the distribution. A sample’s skewness indicates how dominant the average is concerning the range of values as a whole.
Kurtosis: This gauges the distribution’s degree of pointiness. A sample’s Kurtosis serves as a gauge for how skewed the distribution is and how closely the values are grouped in the middle.
When performing data summarization and subsequent analysis through data mining, knowing the shape of the distribution of your data can be quite helpful in assisting you in selecting the statistical choice from which to choose.