MIS_329_Decision_Support_SystemsAssignmentNationwide_Insurance_Used

MIS 329 Decision Support Systems

Assignment

Nationwide Insurance Used Bl to Enhance Customer Service

Nationwide Mutual Insurance Company, headquaitered in Columbus, Ohio, is one of the largest insurance and financial services companies, with $23 billion in revenues and more than $160 billion in statutory assets. It offers a comprehensive range of products through its family of 100-plus companies with insurance products for auto, motorcycle, boat, life, homeown- ers, and farms. It also offers financial products and services including annuities, mongages, mutual funds, pensions, and investment management.

Nationwide strives to achieve greater efficiency in all operations by managing its expenses along with its ability to grow its revenue. It recognizes the use of its su·ategic asset of information combined with analytics to outpace competitors in strategic and operational decision making even in complex and unpredictable environments.

Historically, Nationwide's business units worked inde- pendently and with a lot of autonomy. This led to duplication of efforts, widely dissimilar data processing environments, and exu·eme data redundancy, resulting in higher expenses. The situation got complicated when Nationwide pursued any merg- ers or acquisitions.

Nationwide, using enterprise data warehouse technology from Teradata, set out to create, from tl1e ground up, a single, authoritative environn1ent for clean, consistent, and complete data that can be effectively used for best-practice analytics to make su-ategic and tactical business decisions in the areas of customer growth, retention, product profitability, cost contain- ment, and productivity improvements. Nationwide u-ansfonned its siloed business units, which were supponed by stove-piped data environments, into integrated units by using cutting-edge analytics that work with clear, consolidated data from all of its business units. The Teradata data warehouse at Nationwide has grown from 400 gigabytes to more than 100 terabytes and supports 85 percent of Nationwide's business with more than 2,500 users.

Integrated Customer Knowledge

than 48 sources into a single customer data mart to deliver a holistic view of customers. This data mart was coupled with Teradata's customer relationship management application to create and manage effective customer marketing campaigns that use behavioral analysis of customer interactions to drive customer management actions (CMAs) for target segments. Nationwide added more sophisticated customer analytics that looked at customer portfolios and the effectiveness of various marketing campaigns. This data analysis helped Nationwide to initiate proactive customer communications around customer lifetime events like marriage, birth ofchild, or home purchase and had significant impact on improv- ing customer satisfaction. Also, by integrating customer contact history, product ownership, and payment informa- tion, Nationwide's behavioral analytics teams further created prioritized models that could identify which specific cus- tomer interaction was important for a customer at any given time. This resulted in one percentage point improvement in customer retention rates and significant improvement in customer enthusiasm scores. Nationwide also achieved 3 percent annual growth in incremental sales by using CKS. There are other uses of the customer database. In one of the initiatives, by integrating customer telephone data from multiple systems into CKS, the relationship managers at Nationwide tty to be proactives in contacting customers in advance of a possible weather catastrophe, such as a hur- ricane or flood, to provide the primary policyholder infor- mation and explain the claims processes. These and other analytic insights now drive Nationwide to provide extremely personal customer service.

Financial Operations

A sinillar performance payoff from integrated information was also noted in financial operations. Nationwide's decentralized management style resulted in a fragmented financial report- ing environment that included more than 14 general ledgers, 20 chans of accounts, 17 separate data repositories, 12 different repo1ting tools, and hundreds of thousands of spreadsheets. There was no common central view of the business, which

Nationwide's Customer Knowledge Store (CKS) m1t1at1ve developed a customer-centric database that integrated customer, product, and externally acquired data from more resulted in labor-intensive slow and inaccurate reporting. About 75 percent of the effort was spent on acquiring, clean- ing, and consolidating and validating the data, and very little time was spent on meaningful analysis of the data.

The Financial Performance Management initiative implemented a new operating approach that worked on a single data and technology architecture with a common set of systems standardizing the process of reporting. It enabled Nationwide to operate analytical centers of excellence with world-class planning, capital management, risk assessment, and other decision support capabilities that delivered timely, accurate, and efficient accounting, reporting, and analytical services.

The data from more than 200 operational systems was sent to the enterprise-wide data warehouse and then distrib- uted to various applications and analytics. This resulted in a 50 percent improvement in the monthly closing process with closing intervals reduced from 14 days to 7 days.

Postmerger Data Integration

Nationwide's Goal State Rate Management m1t1at1ve enabled the company to merge Allied Insurance's automobile policy system into its existing system. Both ationwide and Allied source systems were custom-built applications that did not share any common values or process data in the same manner. Nationwide's IT department decided to bring all the data from source systems into a centralized data warehouse, organized in an integrated fashion that resulted in standard dimensional reporting and helped Nationwide in performing what-if analyses. The data analysis team could identify previously unknown potential differences in the data environment where premiums rates were cal- culated differently between Nationwide and Allied sides. Correcting all of these benefited Nationwide's policyhold- ers because they were safeguarded from experiencing wide premium rate swings.

Enhanced Reporting

Nationwide's legacy reporting system, which catered to the needs of property and casualty business units, took weeks to compile and deliver the needed reports to the agents. Nationwide determined that it needed better access to sales and policy information to reach its sales targets. It chose a single data warehouse approach and, after careful assessment of the needs of sales management and individual agents, selected a business intelligence platform that would integrate dynamic enterprise dashboards into its reporting systems, making it easy for the agents and associates to view policy information at a glance. The new reporting system, dubbed Revenue Connection, also enabled users to analyze the infor- mation with a lot of interactive and drill-down-to-details capa- bilities at various levels that eliminated the need to generate custom ad hoc reports. Revenue Connection virtually elimi- nated requests for manual policy audits, resulting in huge savings in time and money for the business and technology teams. The reports were produced in 4 to 45 seconds, rather than days or weeks, and productivity in some units improved by 20 to 30 percent.

Answer the following questions:

1. Why did Nationwide need an enterprise-wide data warehouse?

2. How did integrated data drive the business value?

3. What forms of analytics are employed at Nationwide?

4. With integrated data available in an enterprise data warehouse, what other applications could Nationwide potentially develop?

Assignment Purpose:

The main purpose of the assignment is to enable students describe the relationship between DW, BI, and DSS.

Assignment Guideline:

1. Use Time New Roman.

2. Use Font Size 12.

3. Use 1.15 Line Spacing.

4. Individual Assignment.

Substantive Ethics

I found this article by Mark S. Blodgett to be quite refreshing and informative in terms of the new perspectives being presented. In the article, the issue being presented is the differences between ethics and law within corporate programs. It is an interesting issue that not many seem to think about when mentioning business rules and regulations. Moreover, ethics and law are typically viewed as two completely separate things, but the author digresses. Blodgett believes that in order to better integrate ethical codes and legal terms into a corporation, both entities should be viewed as one and the same. This is a fair point because as mentioned in the article, ethical codes are used by more than 90% of companies today, yet law has not really sunken into businesses as much as it should. Also, as mentioned in the text, legal obligations can be easily ignored by business executives simply because they are ignorant of the laws that are proposed. This is another huge factor as to why laws and ethics should be two sides of the same coin and not be viewed as differences.

It must be mentioned that I do agree with the author and what the articles findings suggested. Both legal and ethical approaches should be taken when considering corporations and businesses in order to integrate a more fluent and accommodable environment. Additionally, I can imagine this study was a long and difficult one as over twenty different compliance areas were assessed in order to compile an accurate study. Not only that, the term frequencies needed to be operationally defined correctly which is no easy feat.

Overall, I feel this study is a very helpful and useful one not only for corporate business, but for anyone in the workplace. Legal obligations must be enforced but at the same time ethical codes must be placed so that businesses may prosper in a healthy way.

Justice at the Millennium

This article by Colquitt et al. was very interesting and insightful on the topic of justice and fairness. Before reading this study, I did not even consider what defines justice or how fairness is accounted for. The authors are correct when stating that we only judge something as just based on past research and experiences and I found this quite interesting. Furthermore, the authors found research studies dating back to 1975 up to the date of publication in order to see just how much things have changed in terms of the workplace. When considering this, it was a great choice to conduct this study as a meta-analysis to see the key differences between older definitions of justice, and a modern take on the concept. Between all this time, lots of rules and regulations have been implemented into what defines justice and more specifically, into the workplace. This study mainly focuses on how justice today plays a role in an organizational point of view rather than a courtroom, which can relate to a lot more people. Rightfully so, the researchers proposed three important questions to take into consideration when analyzing all the different types of articles over the years.

Personally, I believe this meta-analysis is very important for anyone in the workplace because the questions posed by the researchers are prominent issues in today’s society. For instance, an employee may have more than one boss and those bosses may define fairness in differing ways. A study like this may help both bosses come to a happy medium and decide on whatever the employee has done as fair or not. Even more so, thousands of new individuals are entering the workforce every month and with increasing demand for jobs comes new accommodations for what defines as just. Again, I cannot stress enough how important this study is to those already in the workplace or to those who are looking to make a change into any work environment.

Journal of Intelligent & Fuzzy Systems 38 (2020) 6159–6173 6159 DOI:10.3233/JIFS-179698 IOS Press

The impact of big data market segmentation using data mining and clustering techniques

Fahed Yosepha,b,∗ , Nurul Hashimah Ahamed Hassain Malimb, Markku Heikkiläc, Adrian Brezulianud, Oana Gemane and Nur Aqilah Paskhal Rostamb a Faculty of Social Sciences, Business and Economics, Åbo Akademi University, Turku, Finland bDepartment of School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia cFaculty of Social Sciences, Business and Economics, Åbo Akademi University, Turku, Finland dFaculty of Electronics, Telecommunications and Information Technology, Gheorghe Asachi Technical University, Iaşi, Romania eDepartment of Health and Human Development, Stefan cel Mare University, Suceava, Romania

Abstract. Targeted marketing strategy is a prominent topic that has received substantial attention from both industries and academia. Market segmentation is a widely used approach in investigating the heterogeneity of customer buying behavior and profitability. It is important to note that conventional market segmentation models in the retail industry are predominantly descriptive methods, lack sufficient market insights, and often fail to identify sufficiently small segments. This study also takes advantage of the dynamics involved in the Hadoop distributed file system for its ability to process vast dataset. Three different market segmentation experiments using modified best fit regression, i.e., Expectation-Maximization (EM) and K- Means++ clustering algorithms were conducted and subsequently assessed using cluster quality assessment. The results of this research are twofold: i) The insight on customer purchase behavior revealed for each Customer Lifetime Value (CLTV) segment; ii) performance of the clustering algorithm for producing accurate market segments. The analysis indicated that the average lifetime of the customer was only two years, and the churn rate was 52%. Consequently, a marketing strategy was devised based on these results and implemented on the departmental store sales. It was revealed in the marketing record that the sales growth rate up increased from 5% to 9%.

Keywords: Market segmentation, data mining, customer lifetime value (CLTV), RFM model (recency frequency monetary)

1. Introduction is the key success to brand loyalty, repeat store visits, and ultimately, sales conversions. This relationship

The retail industry collects enormous volumes of has been affected by recent economic and social. The POS data. However, this RAW POS data has min- retail industry is prompted to be more strategic in imal use if it’s not properly processed to generate their planning and to develop a deep understanding retail insights, optimize marketing efforts and drive of its consumers as well as their competitors. Under- decisions. The retailer’s relationship with customers standing customers’ behavior as well as establishing a

loyal relationship with customers has become the cen- ∗Corresponding author. Fahed Yoseph, Faculty of Social Sci- tral concern and strategic goal for most retailers [1]

°ences, Business and Economics, Abo Akademi University, Turku, interested in tracking and managing their customer

Finland, and Deparment f School of Computer Sciences, Uni- lifetime value on a systematic basis [44]. Market seg-versiti Sains Malaysia, 11800, Penang, Malaysia. E-mail:

[email protected]. mentation is the process to divide the market base

mailto:[email protected]

https://1064-1246/20/$35.00

6160 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques

of potential customers into similar or homogeneous groups or segments that possess mutual characteris- tics helps marketers to gather individuals with similar choices and interests [2]. This enables retailers to avoid selling unprofitable and irrelevant products with regards to their marketing purpose, which will result in better management of the available resources through the selection of suitable market segment and the primary focus of specific promising segments [3–15].

Furthermore, as far as the research scope is concerned, there has been number of studies that examine customer purchase behavior and lifetime value among different products based on a variety of market segmentation with demographic variables and characteristics. Instead of addressing individual consumers based on their purchasing behavior, most market segmentation studies merely considered the overview of consumers’ historical data to produce assumptions of what makes consumers similar to one another. It is significant to highlight that this method hides critical facts about individual consumers.

Among those customer lifetime value models, a highly regarded model cited by many experts is the Pareto/NBD Counting Your Customers proposed by Schmittlein, Morrison, and Colombo (1987). The model investigates customer purchase behavior in settings where customer purchase dropout is unob- served. However, the model is powerful for analyzing customer purchase behavior, but it has been proven to be empirically complex to implement due to the computational challenges, and only a handful of researches claim to have implemented it [44].

Based on previous studies of market segmenta- tion on the retail domain, Recency, Frequency, and Monetary (RFM) has been extensively employed as this model can divide customers into groups which, therefore, enables retailers to decide on ways to fully utilize their limited resources in providing effective customer service through the categorization of cus- tomers. Nonetheless, RFM also has its own limitation [4] where it only focuses on customers’ best scores in addition to providing less meaningful scoring on recency, frequency and monetary for most consumers (Wei, Lin, and Wu, 2010). Moreover, RFM analy- sis is not able to prospect for new customers, as it mainly concerns the organization’s current customers [6] and that it is not considered as a precise quan- titative analysis model as the importance of each RFM measure is different among other industries [16–20]. The current research foresees an enhanced user-friendly market segmentation modeling method,

which is more advanced and effective than conven- tional RFM method. The integration of Customer Lifetime Value and newly proposed RFM variants (PQ) (T) into a closed-loop model represents dif- ferent variation in customer purchase behavior. The enhanced model has the capability to simultane- ously analyze millions of raw POS data, identify groups of customers by criteria the retailer may never have considered. This goldmine knowledge is expected to help marketers avoid the assumptions when doing customer deep-dive and trend analysis, which subsequently tapped marketers to device tar- geted marketing campaign resulting in sales growth and higher ROI. The RFMPQ and RFMT dataset con- centrate on the idea of identifying the purchasing power history of an individual customer or segment. P variable represents the average purchasing power per customer per all transactions, Q variable repre- sents the average purchasing power per product, and T represents the change of consumer buying behav- ior or trend using change rate. The enhanced RFM model also incorporates CLTV for predicting future cash flow attributed to the customer’s shopping period with the retailer [8], followed by applying a mod- ified best-fit regression technique, and K-means++ and Expectation-Maximization (EM) clustering algo- rithms to analyze the customer buying behavior as well as to assess the clustering technique’s perfor- mance using cluster quality assessment. The analysis can also identify marketers’ area of focus and ensure the highest quality of customer service.

2. Installing and using the microsoft word template

Market segmentation is the process of categorizing large homogenous market into similar or homoge- neous smaller groups who share characteristics such as income, shopping habits, lifestyle, age, and per- sonality traits [9]. These segments are relevant to marketing and sales and can be used to optimize products, customer service and advertising to differ- ent consumers [6]; It is seen that many companies across the retail industry have identified customer service as a market key differentiator and tend to segment their customers for positive customer expe- rience and service delivery [10]. There are three types of market segmentation bases, namely demo- graphic, geographic, and behavioral. Demographic segmentation is the most commonly used variable when segmenting a market. It has the ability to

F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6161

provide the retailer with a clear vision with the future advertising plans, a precise customer shop- ping profile and focuses on the measurable factors of consumers and their households. Furthermore, this segment is primarily descriptive in terms of gender, race, age, income, lifestyle, and family status [11]. In contrast, behavioral segmentation divides cus- tomers based on their attitude toward products. Many marketers consider behavioral variables such as occa- sions, benefits, usage rate, customer status, readiness, loyalty status, and attitude towards a product as the ideal starting points for creating market segmentation [11]. According to behavioral segmentation, con- sumer behavior is the segmentation process based on their evaluation and buying activities, as well as the use and disposal of goods to recognize consumer needs. These criteria can provide a thorough under- standing of consumer behavior as they reason from social psychology, anthropology, economics, sociol- ogy, and psychology that influence consumers on their purchasing decision of products [21–25]. To get a sense of the overall customer lifetime value for the customer-base, [45] proposed a framework to integrate customers’ distribution with the iso-value curves, by grouping customers on the basis of RFM characteristics and to understand the factors that trig- ger consumer’s defection. [48] proposed analytical model for consumer engagement, related to the subse- quent stages of the consumer life-cycle like customer development, customer acquisition, and customer retention. The authors concluded that the availabil- ity of data is vital to the development of advanced analysis in each consumer’s stage. However, sev- eral organizational issues of analytics for consumer engagement remain, which constitute barriers to implementing analytics for customer engagement. In order to solve the problems of consumer behav- ior that evolved with time, this research examines the behavioral, demographic segmentation model and identifies customer behavior using model Customer Life Time Value (CLTV) and Recency, Frequency and Monetary (RFM) model.

2.1. Customer life time value (CLTV) model

Customer Lifetime Value (CLTV) is an important metric to measure the total worth or profit to a busi- ness obtained from a customer over the whole period of their relationship with the retailer [8]. The liter- ature defines the customers churn as the extinction of the contract between the firm and the customer, where customer retention refers to the collection of

activities organizations take to reduce the number of customer’s defections.

Churn rate and retention rate critical matrix for any company and considered primary components of the future CLTV. Where CLTV is an estimation of the average profit, a customer is expected to gen- erate before he or she churn [48]. The concept of retention and churn is often correlated with industry life-cycle. When the industry is in the growth phase of its life-cycle, sales increase exponentially. How- ever, customer churn is the most challenging task for the retailer industry. In this perspective, more insight is needed to know the reason for customer churn in a dynamic industry.

The three main components of CLTV are customer acquisition, customer expansion, and customer reten- tion [46]. Nevertheless, it is crucial to consider COGS (Cost of Goods Sold) and acquisition cost to square off the real CLTV. The basic model to calculate CLTV is presented in Equation (1).

�n ptCLTV = (r)t (1) t=1 (1 + d)t

The above CLTV formula is more of a proxy for an average customer who stays for X period of time and pays Y total amount of money. The t represents a specific period of time, while (t = 1) represents the first year, and (t = 2) denotes the second year. The n represents the total time period the customer will stay with the retailer before churn occurs. The r represents the month over retention rate. Pt is the profit that the customer/customers will contribute or generate to the Retailer in the Period t, and finally, d refers to the churn rate. Additionally, the customer’s loyalty can be calculated using the Retention Rate formula, as illustrated in Equation (2). Based on the Retention Rate formula, CE denotes to the number of customers at the end of each time period, where, CN is the total number of new customers acquired in the chosen time period, and CS denotes to the number of customers at the start of the time period.

� � (CE - CN)

Retention rate = × 100 (2) CS

Management of consumer retention requires the tools that allow decision-makers to assess the risk of each consumer to defect and understanding the factors that trigger consumers’ defection [47]. Cus- tomer retention strategy also known as a loyalty rate is the collection of activities a retailer uses to maintain on a long-term relationship basis by engag- ing existing customers to increase profitability by

6162 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques

Table 1 Criteria of customers in each segment

Segment Criteria of Customers

Best The average purchase amount > the total average purchase amount

The average purchase frequency of customer > the total average frequency

Spender The average purchase amount > the total average purchase amount

The Average frequency of customer < the total average frequency

Frequent The average purchase amount < the total average purchase amount

The average frequency of customer > the total average frequency

Uncertain The average purchase amount < the total average purchase amount

The average frequency of customer < the total average frequency

increasing the number of repeat customers. CLTV represents a greater improvement compared to the tra- ditional RFM analysis as the frequency of customer’s purchases, and the amount of customers’ average pur- chase is used for segmenting customers-base. CLTV matrix classifies customer purchase behavior using different segments, namely Best, Spender, Frequent, and Uncertain classified by Marcus (1998). Table 1 illustrates the criteria of each segment that were clas- sified by Marcus (1998).

2.2. Recency, frequency, and monetary (RFM) model

RFM is a standard statistical marketing model for customer behavior segmentation assess consumer lifetime value. The model is very popular in the retail industry as it groups customers based on their shopping power history – how recently, how often, and how much did the customer buy. RFM model helps retailers group customers into various segments or categories to identify customers who are more likely to respond to marketing promotions and future customer personalization services [17]. The R sym- bolizes recency refers to the interval between the time since last purchase the customer made. The F sym- bolizes the frequency of consumer behavior in a time period, and the M symbolizes monetary referring to the amount of money consumption in a period [18]. Quintiles scoring is the most commonly used scor- ing in the RFM method in arranging customers in ascending or descending order or (Best to Worst). Customers are grouped into five equal groups where the best group receives the highest score of (5), and

the worst receives the lowest score of (1) [1]. The RFM score is the weighted average of its individual components and is calculated as portrayed in equation 3 and 4 to derive a continuous RFM Score. Finally, these scores can be re-scaled to the 0 –1 range [17].

RFM score = (recency score × recency weight) + (frequency score × frequency weigh + (monetary score x monetary weight)) (3)

Rescaled RFM score = (RFM score − minimum RFM Score)/(Maximum RFM score − minimum RFM score) (4)

2.3. Market segmentation using data mining, RFM, CLTV models and clustering techniques

Market segmentation helps to differentiate and cus- tomize marketing strategies into segments. Market Segmentation is a significant key in data mining, where data mining is used to interrogate segmenta- tion data to create data-driven behavioral information segments that are applied to detect meaningful pat- terns and rules underlying consumer behavior [19]. Furthermore, [26] and [27] were among the stud- ies that performed market segmentation using data mining, RFM, CLTV, and clustering technique to form a decision-making system. [28] proposed clus- tering and profiling of customers using customer relationship management (CRM) and RFM for rec- ommendations were proposed. On the other hand, data mining was conducted on historical data of cus- tomer’s sales using the RFM model with K-Means algorithm where results have outlined recommen- dations to perform customer relationship strategy. Also, f (2016) proposed a three-dimensional mar- ket segmentation model based on customer lifetime value, customer activity, and customer satisfaction. For more accuracy, the author grouped customers into several different groups. RFM, Kano, and BG/NBD models obtained the corresponding variables.

Furthermore, the market segmentation model helps enterprises to maximize their profits. In [29–35], cus- tomers were classified into various clusters using RFM technique and association rules were mined to identify high-profit customers. RFM statistical Tech- niques and Clustering methods for Customer Value

F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6163

Analysis were combined in [26] for a company’s online selling. As far as the methodology is con- cerned, there is no standard convention for measuring customer purchase behavior as each literature differs in examining customer purchase behavior. Neverthe- less, the K-Means algorithm is noted as an extensively used clustering algorithm in previous research due to its simplicity and speed in working with a large amount of data [36]. Despite its strength, it has been recorded that the K-Means method uses a random dis- tribution for the seeding positions and does not main- tain the same result each time that it is run [37]. New, improved K-means algorithm called K-means++, which uses sophisticated seeding procedure for the initial choice of the center positions and often twice as fast as the standard k-means. Contrastingly, Neha, Kirti, and Kanika (2012) noted that K-means and (EM) Expectation-Maximization algorithms are the two most commonly used based algorithms for the identification of growth patterns. Even though EM is similar to the K-Means algorithm, this algorithm is based on two different steps iterated until there are no more changes in the current hypothesis [29]. Expectation (E) refers to computing the probability that each datum is a member of each class. Maximiza- tion (M) refers to altering the parameters of each class to maximize those probabilities. Eventually, they con- verge, although not necessarily correct. Furthermore, EM algorithm is embedded with a significant feature where it can be applied to problems with observed data that provide “partial” information only [30]. Based on several comparative studies of EM and K- Means methods [31–34], it was observed that EM outperformed K-Means and results were improved when they were hybridized. The current study inte- grates two dynamics models, namely CLTV and RFM models, with the addition of new RFM variants, i.e., P, Q and T to cater the weakness and inaccuracy of consumer modeling that are caused by the limita- tions RFM. In addition, this study applies K-means++ and Expectation Maximization (EM) clustering algo- rithms to offer the retail industry with effective analy- sis of customer buying behavior through the combina- tion of customer profitability and product profitability in creating a strategic marketing campaign as explained previously in the introduction [38–40].

2.4. Mining big data

Data mining is the process of extracting infor- mation from large data sets and transform it into an understandable form for further use. Data min-

ing can be used in such a case where the database is large, and the classification of such data is dif- ficult [35]. The term Big data is often used for very large databases whose size in terabytes to many PETA bytes and it is beyond the ability of commonly used Relational Database Management (RDBM) to pro- cess the data within a tolerable elapsed time. Patel, Birla, & Nair (2012) have done a lot of experiment on the big data problem. The result was the finding Hadoop Distributed File System (HDFS) for storage and map-reduce method for parallel processing on a large volume of data. However, the research in Big Data analysis using data mining especially with clus- tering methods is still considered to be young, and therefore attracts many researchers to conduct fur- ther research in this potential area [37, 38] proposed a fast-parallel k-means clustering algorithm based on Map Reduce, which has been widely embraced by both academia and industry. They used to speed up, scale-up, and size up to evaluate the performances of their proposed algorithm. Their finding showed that the proposed model could process very large dataset on commodity (Low-cost) hardware effectively.

Hadoop is becoming a commodity for every data- driven organization, where data is larger and comes in many formats, mining and extracting intelligence from data has always been a challenge [39]. The new dynamic in the database has brought new chal- lenges to the current analytical models and traditional databases and emphasize the need for a paradigm shift in data extraction and data analysis. Such challenges are the performance of the data retrieval and the vari- eties of data sources for which the format of the relational databases may no longer be the best option. [39] stated that Traditional database systems fall short in handling scalability to boost the performance effi- ciency and dealing with Big Data effectively and thus the adoption of based systems such as Hadoop is increasing. Hadoop is an open-source framework for data-intensive distributed system processing of large-scale data, based on Map Reduce programming model and a distributed file system called Hadoop Distributed File system (HDFS).

Map Reduce programming model is a methodol- ogy that deals with implementation and generating large datasets, making Hadoop the preferred as a solu- tion to the problems in the traditional Data Mining [41].

The main components of Hadoop are Hadoop distributed file system (HDFS) a high bandwidth clustered storage allows writing an application that rapidly processes massive data in parallel, which is

6164 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques

vital for large files. Map Reduce, is the heart of Hadoop. HDFS is high bandwidth clustered storage, while Map-Reduce processing enormous pieces of data and divide the input dataset into independent smaller pieces and be distributed amongst multiple machines referred to as nodes to parallel process them [42].

3. Research method

The methodology of this work, which involved Five phases, is outlined in Fig. 1. The first phase focused on the implementation of our POS database on a single node Hadoop distributed file system.

The experiments were performed on the following system: Single node using Java: Hardware: Intel Core i5, 8GB RAM, CPU 2.4 GHz Software: Java with JDK 1.8 Hadoop implementation: Software: Ubuntu 16.10, Java with JDK 1.8, Hadoop 2.7.0. The sec- ond phase focused on (ETL) Extract, Transform, and Load involving data preprocessing steps, i.e., data cleansing, features selection, and data transforma- tion. Since the dataset that was used in the current research was different than those in existing litera- ture, a controlled experiment was performed where the work of [20] was replicated as the baseline of this research.

The hybrid approach (RFMPQ & CLTV) was included in the third and fourth phase, where differ- ent methods were employed stepwise. As the data was transformed into three different variants, i.e., RFM, PQ, and T, the first processing step differed in one of them.

The classification was used to categorize cus- tomer purchase behavior into CLTV matrix based on the RFM and PQ dataset, while modified best-fit regression was performed on the T dataset to find the customer purchase trend (curve). Even though [20] only employed K-means technique, the present experiment was extended to include the utilization of K-means++ and (EM).

Subsequently, the outputs were fed into the clus- tering algorithms, i.e., K-means++ and EM at the fourth phase for further demographical segmentation. The accuracy of these clustering algorithms was mea- sured during the final phase using the cluster quality assessment that was introduced by Draghici & Kuklin (2003). Additionally, the retention rate was calcu- lated, and human judgment was also included as a measure of the effectiveness of this method for a marketing campaign.

Fig. 1. Methodologies for the hybrid of classification and cluster- ing of market segmentation.

The Methodologies for this research is illustrated in Fig. 1.

3.1. Proposed data transformation using RFM model, RFMPQ, RFMT

In this study we are using Apache Hadoop on single-node Hadoop cluster using Ubuntu Linux 12.04 64 Bits Server Edition was preferred as the operating system and KVM (virtual memory) was selected as virtualization environment. Hadoop (HDFS) node was accessed via Secure Shell (SSH). In this study, no parameter or optimization adjust- ment was made on the operating system to cause performance improvement. This type of Hadoop implementation serves the purpose and sufficient to have a running Hadoop environment in order to con- duct our experiment. The market segmentation model uses retail POS data acquired from a medium-sized retailer from the State of Kuwait. The POS data con- tains Three years (2012 – 2015) of customers initial and repeated purchases who made their purchases at different geographical branches. Each transac- tion represented a product purchased, with each line consists of a cashier number, store-code, item-code, brand- code, product (quantity) sold, product price, date and time of the transaction, sub-total, grand total as well as the customer’s demographic information.

F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6165

Since the data was in separate compilations based on the stores’ geographic locations, a common data for- mat with consistent definitions for more descriptive keys and fields was developed to merge the informa- tion. In this phase, the string variables are converted to numeric variables, and subsequently, missing values were checked and replaced default or mean values manually. In this research, the PQ variants will be used to describe customer’s purchase power in differ- ent demographic and behavioral eras and customer’s attractiveness to a specific product and service. For the T variant customers from the best segment are used to identify customer purchase curve, and we cal- ibrate the market segmentation model using repeated transactions for 3220 customers over two years’ period.

The age attribute was grouped into four (i.e., ages from 1–17, ages from 18–24, ages from 25–34, ages from 35–44, ages from 45–54 and ages above 55). The age group analysis was based on the premise that a typical customer’s needs would change as they age. The customer’s age was classified into six categories, where each category was identified using a unique number. Category 1 = (1–17), Cat- egory 2 = (18–24), Category 3 = (25–34), Category 4 = (35–44), Category 5 = (45–54), and senior Cate- gory 6 = (55 +). The Gender attribute was encoded as 1 for Male customers, 2 for Female customers, and 3 for Companies. Furthermore, the demographic con- cept hierarchy method such as city and country was replaced by higher-level concept nationality. Citizens of Middle Eastern nationalities, Asian nationalities, USA, and Canada, were assigned unique numeric (binary) value. Other nationalities were grouped based on continents, namely Europeans and Africans. One exception was made for British nationalities due to the high volume of purchases. To ensure the maximum accuracy of RFM scores, the values of five-dimension attributes from the POS Data were necessary.

The attributes are described in Table 2 as follows: The next step involved the calculation of RFM

scores as well as the newly proposed variation PQ and T. The implementation of CLVT, retention rate, and RFMPQ and T are developed using advanced PL/SQL programming language. It must be noted that the RFMPQ score refers to the weighted aver- age of its individual components in which the scoring analysis typically involves grouping customers into equal buckets (quantiles) sizes. As far as this study is concerned, the grouping procedure was applied inde- pendently to the five RFMPQ component measures.

Table 2 Attributes of RFMPQ

CUSTOMER ID Customer unique identifier used to capture customer’s related information.

TRX DATE: Transaction date used to capture customer’s Recency (R).

TXH COUNT Number of Transaction used to capture the Frequency (F) number of each transaction made by a customer.

TRX TOTAL SALE The total amount of each transaction used to capture the Monetary (M) value made by the customer.

TRXUNIT PRICE The average purchase power (Monetary) used to capture Average Monetary (P) per customer.

TRX QTY The average purchase power (quantity - Q) used to capture the Average Items purchased per customer.

Customers were grouped according to the respec- tive measure into classes of equal sizes. The derived R, F, M, P, and Q groups became the components for the RFMPQ cell assignment. RFMPQ groups are aggregated, with appropriate client-defined weights, and the scores were the weighted average of its com- ponents.

The next step involved where values and scores of RFMPQ variables were determined and used as inputs of clustering algorithms. According to this research and the client’s requests, RFMPQ variables had equal weights (1:1:1).

The second variant proposed in this research was T variable, which represented the trend of customer purchase behavior using change rate. This study pro- poses a combination of two analytical data mining steps. For finding T, the change in consumer purchase behavior trend, it uses an enhanced best-fit regression algorithm. Then T dataset is then put into the unsu- pervised clustering algorithms, to split the consumers into different groups based on patterns dissimilarities. The variable T should answer a very important ques- tion if the consumer is at high risk of shifting his or her service to another retailer. One of the most com- mon indicators of high-risk consumers is a drop off in purchase power and a decrease of visits. One of the most common indicators of high-risk consumers is a drop off in purchase power and a decrease of visits. Major limitations of market segmentation and RFM models is ignoring the behavioral changes of con- sumers during the time period of analysis. Although the recency variable acted as one of the indicators of such consumer behavior, it was affected by cus- tomers’ transient behavior, and it was only based

6166 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques

on the last purchase date. Therefore, introducing a new analysis variable was essential for the retailer to narrow down high-risk consumers.

Based on the perspective of a retailer, average val- ues change according to customers’ purchase and their satisfaction with the retailer in terms of price and services. If these average purchase amounts of a customer decrease continuously, it could be con- cluded that this customer is or was on the verge of shifting his business to another retailer. A similar conclusion can also be derived from a customer who had an increased average purchase value, showing that the customer was very profitable and should be ranked accordingly. Therefore, these customers with a variety must be treated accordingly. A computable parameter was developed through the introduction of some advanced programming. Firstly, the purchase amounts of all customers in each selected time period of analysis were required.

3.2. Market segmentation using K-means++ and EM clustering methods

The drawback of the classic k-means algorithm is that the user needs to define the centroid point and offers no accuracy guarantees. This has become more critical when clustering documentation because each center point is represented by a word and the distance calculation between words is not a trivial task. To overcome this problem, a k-means++ was introduced in order to find a good initial center point. K-means++ is a simple probabilistic means of initializing k- means clustering that not only has the best known theoretical guarantees on the expected outcome qual- ity but works very well in practice. According to [43–51] k-means++ algorithm is another variation of the k-means algorithm, a new approach to select ini- tial cluster centers by random starting centers with specific probabilities are used. The steps used in this algorithm are described below. In this regard, the essential component required is the preserva- tion of the diversity of seeds while ensuring that the outliers remain robust. The primary concern of the k-means problem is to identify cluster centers that minimize intra-class variance by reducing the distances from each clustered data point. This can be achieved through an effective and well-designed cluster-initialization technique. k-means++ was pro- posed in 2007 by [42] for choosing initial values (seeds) for the classic k-means clustering algorithm to avoid poor results found by the k-means clustering algorithm. K-means++ algorithm initializes means

more intelligently so that there is a distribution of cluster means that is roughly even relative to the data. K-means++ accomplishes this by selecting the first cluster center at random and then drawing the next sample from a distribution that puts a heavy proba- bility weight where there are data and no close-by cluster center. The execution of K-Means++ and EM algorithms is carried out using WEKA tools, then the generated results are exported into excel for easy comparative analysis. We evaluate the performance of the market basket based on the mean calculated across three years forecast customers’ sales transac- tion. The K-means++ algorithm equation’s method is explained below. The K-means++ algorithm comes with a theoretical guarantee to find a solution that is O (log k) competitive to the optimal k-means solution. It is also fairly simple to describe and implement.

Expectation-maximization (EM) algorithm is an iterative estimation algorithm, a method similar to the K-Means algorithm introduced by Dempster, Laird, & Rubin (1977). The Expectation-Maximization algorithm is an important tool of a statistical and pow- erful method for obtaining the maximum likelihood estimation of the parameters of an underlying distri- bution when data contains null and missing values to generate an accurate hypothesis. There are three steps involved in EM technique, and the first was the EM clustering initialization. Every class j, of M classes (clusters), is formed by a vector parameter (θ), composed by two parameters the mean (�j) and the covariance matrix (Pj) which defined the Gaus- sian probability normal distribution as features used to describe or classify the observed and unobserved entities of the data set x. The Expectation Maximiza- tion (EM) algorithm was aimed to approximate the parameter vector (θ) of the real distribution. Clus- ter Convergence was the third step in M-step. After every iteration was performed, a convergence inspec- tion was conducted to verify whether the difference of the attributes vector of an iteration to the previous iter- ation was smaller than an acceptable error tolerance given by parameter.

3.3. Quality assessment

One type of cluster quality assessment as suggested in [41] was performed in this study is to compare between the size (diameter) of the clusters versus the distance to the nearest cluster, the inter-cluster dis- tance versus the size of the cluster is conducted in this study. This process can also be understood in terms of the distance between members of each cluster and

F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6167

Fig. 2. Cluster quality accessed by ratio of distance to the nearest cluster and cluster diameter: source (Draghici, 2003).

the center of the cluster, and the diameter (size) of the smallest sphere containing the cluster. If the inter- cluster distance was much larger than the size of the clusters, then the clustering algorithm is considered to be trustworthy. The formula is explained as; min (Dij)/max (di) where Dij was the distance between cluster I and cluster j, where both i and j were from (1 to 6) and di is the diameter.

Figure 2 shows the quality can be assessed sim- ply by looking at the cluster diameter. Therefore, the cluster can be formed by heuristic even when there is no similarity between clustered patterns. This is occurring because the algorithm forces K clusters to be created.

4. Results and discussion on RFM and RFMPQ dataset

The main objective driven by the customer pyramid classification is to determine the segment in which customers spend more with the retailer over a period of time and also the segment that is less costly to maintain. The distribution of customers after using CLTV matrix on the RFMPQ dataset is illustrated in Fig. 3 with respect to their frequency and average monetary values. The findings generated by customer value matrix classification revealed that there were 7024 customers from who were categorized under the Best segment, 25153 customers fell under the Spender segment, 39107 customers were grouped in the Frequent segment, and 39624 customers were classified in the Uncertain segment. These findings indicated that most customers were shoppers with a limited and high budget.

Based on the results in Fig. 4, it can be observed that the analysis of K-means++ and EM clustering algorithms illustrated that both algorithms are agree- able on the gender and age segments. Cluster 1 is

Fig. 3. Effective customer pyramid classification.

the most beneficial segment of customers. The best spenders in cluster 1 are predominantly females from the age group (25–34) and (35–44). The cluster qual- ity assessment of RFM dataset is shown in Table 3, clearly shows the K-means++ algorithm with the size of cluster 328.4529657 compare with the EM algo- rithm 271.6329114 is far more accurate, because it’s largest value of inter-cluster distance divided by the size of the cluster.

Since classification, according to the CLTV matrix was already applied prior to clustering, discussion on the results of this experiment is divided into subsec- tion according to the quadrant of the CLTV matrix. The first section will elaborate on the demographic clustering for gender and age, where summaries for both gender and age are provided for each segment. The following section will discuss the cluster aver- age shopping (visits) frequency, the cluster average monetary, and the cluster average spending per visit.

4.1. RFMPQ results using K-means++ and EM on best, spender, frequent, and uncertain segments

Results in Fig. 5. shows the summary for all four segments, namely best, spender, frequent and uncer- tain where most customers were females and their average age according to the cluster. In general, K- means++ is seen as the best clustering method except in the Best segment; the accuracy is very close due to the less number of customers. Each cluster will be dis- cussed in further details with respect to the nationality in the following sections.

Figure 6 shows the four clusters for Uncertain seg- ment as generated by K-means++ and EM Clustering. Total customers in this segment are 39624. The anal- ysis in both clustering algorithms showed that cluster 1 was the most beneficial cluster (segment) because

6168 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques

Fig. 4. The average of four clusters generated by the K-means++ algorithm and EM algorithm for gender and age.

it was more superior to the other clusters in terms of the demographic variables.

Average age results portrayed that customers were between the age of (45–54) are the most beneficial customers, and the average gender indicated most customers were male customers. In terms of the average nationality, citizens of the Philippines were considered as the best customers in average monetary when using K-Means++ and citizens of India were the best customers based on EM Clustering.

4.2. Summary of the market segmentation on RFMPQ dataset

With regards to the Market Segmentation on RFMPQ dataset, it can be summarized that Best seg- ment was the most profitable segment with its Total Average Spending Per Segment of (1109.37) com- pared to other segments. Cluster 2 was the most profitable cluster in the Best segment with most female customers and the age groups of (25–34) and (35–44). Meanwhile, the result for average national- ity portrayed that citizens of Kuwait and UAE were the best customers in average monetary and it was dis- covered that both K-means++ and EM showed similar accuracy. The Spender segment was the second most profitable segment with the Total Average Spending Per Segment of (834.01). Cluster 4 of this segment was the most profitable cluster with an average mon- etary of (924.19).

Females customers, those between the ages of (25–34) and (35–44) as well as the citizens of Kuwait and Qatar were the best customers, and it was also shown that K-Means++ algorithm was the most accu-

rate. Next, the Frequent segment was the third most profitable segment with the Total Average Spending Per Segment of (349.57), and it was noted that cluster 1 was the most profitable cluster with an average mon- etary of (397.81). Female customers, the age group of (35–44) and citizens of the UAE and Bahrain were the best customers. K-means++ algorithm was regarded as the most accurate algorithm for this cluster in Fre- quent segment.

The Uncertain segment was the least profitable seg- ment with its Total Average Spending Per Segment (168.59). Cluster 1 was the least profitable clus- ter with an average monetary of (239.14). Female customers, those in the age group of (45–44) and cit- izens of the Philippines and some Arab nationalities were the best customers, and K-means++ algorithm was the most accurate algorithm for this cluster in Uncertain segment. Based on the comprehensive information extracted from the RFMPQ dataset, it is concluded that (a) the newly proposed PQ method can fulfill both robust classification and robust seg- mentation for market segmentation model, even in the dataset that has noisy data. Also, it is noted that (b) The (PQ) is the most important variable, and it has also been proven as a crucial parameter for clustering.

4.3. Predictive customer lifetime value

F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6169

Fig. 5. Summary for all segments in demographic clustering (Gender).

Fig. 6. Four clusters generated by K-Means++ and EM clustering for uncertain segment.

is essential to consider COGS (Cost of Goods Sold) customers’ expected lifetime and potential monetary and acquisition cost to square off the real CLTV. value from new purchases. Furthermore, based on the Predictive Customer Lifetime Value (CLTV): Estima- clients’ request, the static discount rate is set 25% in, tion of the customer’s future value also considers the of which 25% is minced from the total sales. The

6170 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques

Fig. 7. The result from the strategies implemented using ranking matrix.

acquisition cost, which is also set based on clients’ request, which is 400 becomes the estimated figure or percentage of how much is spent to attract new customers.

4.4. Predictive customer lifetime value

Predictive CLTV Matrix projects the new cus- tomers’ expenses over their entire lifetime with the retailer. Most retailers measure CLTV in dollars or based on the retailer’s local currency spent by the cus- tomer over their entire relationship with the retailer, i.e., from the first to last transaction. Nevertheless, it is essential to consider COGS (Cost of Goods Sold) and acquisition cost to square off the real CLTV. Predictive Customer Lifetime Value (CLTV): Estima- tion of the customer’s future value also considers the customers’ expected lifetime and potential monetary value from new purchases. Furthermore, based on the clients’ request, the static discount rate is set 25% in, of which 25% is minced from the total sales. The acquisition cost, which is also set based on clients’ request, which is 400 becomes the estimated figure or percentage of how much is spent to attract new customers.

4.5. Customer lifetime value ranking matrix

CLTV ranking groups customers into quad- rants according to their profitability and retention propensity. CLTV ranking can assist marketers in performing more effective market segmentation [17]. This is accomplished through the incorporation of RFMPQ and RFMT values with the CLTV to rank the total sales quarterly and yearly using the ranking matrix analytical function, as shown in Fig. 7.

Experimental results on CLTV using Ranking Matrix, which is presented in Fig. 7 reflects the success of the implementation of the market segmen-

tation data mining model. Important to note, the test POS dataset is from 2012 to 2015. The client noticed a drop in sales and a decline in foot traffic in the sec- ond half of 2011. However, the development of our Market Segmentation model started in 2013, and our market segmentation model was implemented in the fourth quarter of 2014. The analysis indicated that the client had lost customers and sales immediately dropped in quarter 1 of 2015 as a direct result from the shift of mass marketing strategy to target mar- keting strategy using market segmentation methods. This is considered as a healthy result due to the shift- ing from mass marketing to more customized market segmentation. Nonetheless, in quarter 2 and onwards, the client has gained a small increase in sales, and a new group of customers has been retained. Results also revealed that the retailer started to turn profitable, starting with quarter 3 and big growth in quarter 4 of 2015.

A consistent result was observed from each quar- ter in 2015. Thus, the client strengthened its market position and made slow the effects of a rapidly weak- ening overall retail market. It is generally portrayed that the average lifetime of the customer is only two years, and the churn rate is 52% (see Table 6), result- ing in the development of a comprehensive marketing strategy. Results have shown that the incorporation of RFMPQ model and CLTV.

Matrix was the best method to categories and classified different consumer segments and different potential consumer segments by long term profitabil- ity. Each of the market segments will be further investigated and analyzed to provide specific charac- terizations, better understanding and identification of opportunities, as well as the profitability and the exis- tence of risks as each segment no longer encompasses the entire market. Based on the segments identified, the retailer should create a tailored campaign for each segment and offer separate services that distinguish

F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6171

these segments to maximize responses according to each segment’s behavior. Finally, based on the exper- iment results, it can be positively concluded that the newly proposed RFM (PQ) and (T) were far more efficient than the traditional RFM in classifying and clustering the segmentation of customer value via RFM attributes. Kindly note that the implementation of this work was done in 2015; however, for some circumstances, our documentation and publication process was delayed for quite some time.

5. Conclusions

This study has introduced a new technique of CLTV and proposed three new RFM variations P, Q, and T. Firstly, the new RFM variations P, Q, and T, which have some advantages over the standard RFM model. The advantage of these newly pro- posed methods is that the RFMPQ method takes into consideration the changes in the customer’s average shopping power over time as well as the average pur- chase power of products purchased over time by a given customer. “The new proposed PQ provided key attributes describing customer’s purchase behavior in different demographic eras. The P variant helped to identify which segment of customers spends more over a period of time and costs less to maintain. The Q variant helped to identify customer’s attractiveness to a specific product and service, and also helped to identify best-selling products, which resulted in increase of sales potential in terms of number of units of best-selling products that can be sold.” The market segments were constructed through several clustering algorithms such as K-Means++ and EM. To convert this idea into a computable dynamic parameter, newly modified best-fit regression algorithm and RFMT variable were proposed. The new modified regression algorithm demonstrated a high degree of accuracy in strengthening and reinforcing the effect of the customer’s most recent T purchase while justify- ing the importance of previous consumer’s purchase. However, more modified regression algorithm can be further extended to highlight key customer’s trends using new demographic variables, like profession, income level, spending methods, and online spend- ing. Results of the application of the new RFM PQ T variations have reflected their effectiveness for mar- ket segmentation and their ability to offer the retail industry with intelligent analysis to combine cus- tomer and product profitability. This type of analysis can also identify the area in which marketers should

focus their attention to as well as ensuring the highest level of quality and customer service. Additionally, these models help retailers identify VIP consumers who are on the verge of shifting taken their business to another competitor. Products that are highly prof- itable and purchased by the most profitable segments can also be easily identified. Results of the analysis results as well as the marketing experts have agreed that the classification of customer purchase behavior using CLTV matrix against RFMPQ dataset revealed the most accurate and crucial information about cus- tomer purchase behavior.

Furthermore, results have also indicated that age and gender variables provided an accurate result with an estimated analysis accuracy of 75%. Nevertheless, the nationality variable presented a low percentage of accuracy in both algorithms, possibly due to the missing and noisy data related to these variables. The results of applying the new RFM PQ and T variations have portrayed their effectiveness which resulted in the sales growth rate of up to 6% for market segmen- tation and their ability to offer the SMR industry with intelligent analysis to combine customer and product profitability.

Secondly, the retail is data-driven industry and pro- cess large POS transactions Increase in the collection of data is often seen as bottlenecks for Big Data anal- ysis. Many retailers face the challenge of keeping data on a platform, which gives them a single con- sistent view. Although Hadoop is not the main focus of this study, however, in this study, the capabilities of Hadoop was investigated. Hadoop implementa- tion provided a highly scalable data distribution and lighting data analysis performance.

Finally, it has been proven that sophisticated statistical modeling methods can provide useful information for experts, but at the same time they are costly and complex to implement and are likely to present a challenge to the implementation of marketing strategies. This study proposed a sim- ple yet powerful approach to market segmentation. The results have shown the model provides easy to implement and affordable market segmentation methodologies that deliver substantial value rela- tive to the amount of effort involved. The analysis results discovered from the (Best, Spender, Fre- quent, and Uncertain) provided the client a guided vision on how to calculate customer lifetime value and retention.

Further search recommendation is to examine the implementation of large POS dataset on multiple clusters and scaling the cluster by adding extra nodes

6172 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques

across several inexpensive servers. Finally, ignor- ing consumer behavior purchase behavior trends can be disastrous. Further research is needed for more Explanatory data mining model like Market Basket Analysis to help to identify what the profile of the future customer might look like from a product per- spective.

Funding

This research received no external funding.

Acknowledgments

We are grateful for the support given by the Uni- versity Sains Malaysia where Y.A.F completed his Masters on this research, under the supervision of N.H.A.H.M.

References

[1] A.B. Patel, M. Birla and U. Nair, Addressing Big Data Problem Using Hadoop and Map Reduce, in 2012 Nirma University International Conference on Engineering (NUiCONE) (2012), pp. 1–5.

[2] A. Chorianopoulos and K. Tsiptsis, Data Mining Techniques in CRM: Inside Customer Segmentation, no. 21. John Wiley & Sons., 2010.

[3] A. David and S. Vassilvitskii, k-means++: The advantages of careful seeding, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007.

[4] A.P. Dempster, N.M. Laird and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, J of the R Stat Soc Ser B 39(1) (1977), 1–38.

[5] A. Kak, Expectation-Maximization Algorithm for Cluster- ing Multidimensional Numerical Data, Purdue University, 2017.

[6] A. Neha, A. Kirti and G. Kanika, Comparative Analysis of k-means and Enhanced K-means clustering algorithm for data mining, Int J Sci Eng Res 3(3), 2012.

[7] B. Sohrabi and A. Khanlari, Customer Lifetime Value (CLV) Measurement Based on RFM Model, Iran Account Audit Rev 14(47) (2007), 7–20.

[8] B. Tammo HA, et al., Analytics for customer engagement, Journal of Service Research 13(3) (2010), 341–356.

[9] C.C. Aggarwal, Data Mining: The textbook. New York, USA: Springer, 2015.

[10] C. Bauckhage, Lecture Notes on Data Science : k-Means Clustering Is Gaussian Mixture Modeling, B-IT, University of Bonn We, 2015.

[11] C. Marcus, A Practical Yet Meaningful Approach to Customer Segmentation, J Consum Mark 15(5) (1998), 494–504.

[12] D. Arthur and S. Vassilvitskii, k-means ++ : The Advan- tages of Careful Seeding, in SODA ’07 Proceedings of the

eighteenth annual ACM-SIAM symposium on Discrete algo- rithms, (2007), pp. 1027–1035.

[13] D. Birant, Data Mining Using RFM Analysis Knowledge- Oriented Applications in Data Mining, Prof Kimi InTech, 2011.

[14] D.R. Kishor and N.B. Venkateswarlu, Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance, Cybern Inf Technol 16(2) (2016), 16–34.

[15] F. Jiashuang, Y. Suihuai, Y. Mingjiu, et al., Optimal selection of design scheme in cloud environment: A novel hybrid approach of multi-criteria decision-making based on F-ANP and F-QFD, DOI: 10.3233/JIFS-190630, Journal: Journal of Intelligent & Fuzzy Systems, vol. Pre-press, no. Pre-press, (2019), pp. 1-18.

[16] F. Peter, S. Bruce, G.S. Hardie and K. Lok Lee, Count- ing your customers the easy way: An alternative to the Pareto/NBD model, Marketing Science 24(2) (2005), 275–284.

[17] G. Lefait and K. Tahar, Customer Segmentation Archi- tecture Based on Clustering Techniques, in 2010 Fourth International Conference on Digital Society, (2010), pp. 243–248.

[18] G. Sunil, C.F. Mela and J.M. Vidal-Sanz, The Value of A ‘Free’ Customer, 2006.

[19] I. He and C. Li, The Research and Application of Customer Segmentation on E-commerce Websites, in 2016 6th Inter- national Conference on Digital Home (ICDH), (2016), pp. 203–208.

[20] I. Kolyshkina, E. Nankani, S.J. Simoff and S.M. Denize, RDMA Aware Networks Programming User Manual, in Australian & New Zealand Marketing Academy. Confer- ence (ANZMAC), (2010), pp. 1–7.

[21] I. Yeh, K. Yang and T. Ting, Knowledge Discovery on RFM Model using Bernoulli Sequence, Expert Syst Appl 36(3) (2009), 5866–5871.

[22] J.A. McCarty and M. Hastak, Segmentation Approaches in Data-Mining: A Comparison of RFM, CHAID, and Logistic Regression, J Bus Res 60 (2007), pp. 656–662.

[23] J.L. Schafer, Analysis of Incomplete Multivariate Data, First edit. USA: CRC Press LLC, 1997.

[24] J. Maryani and D. Riana, Clustering and Profiling of Customers Using RFM For Customer Relationship Manage- ment Recommendations, in 5th International Conference on Cyber and IT Service Management (CITSM), (2017), pp. 1–7.

[25] J.R. Miglautsch, Thoughts on RFM scoring, J Database Mark Cust Strateg Manag 8(1) (2000), 67–72.

[26] J. Wei, S. Lin and H. Wu, A review of the application of RFM model, African J Bus Manag 4(19) (2010), 4199–4206.

[27] K. Dipanjan, S. Garla and C. Goutam, Comparison of Probabilistic-D and k-Means Clustering in Segment Pro- files for B2B Markets SAS Global Forum 2011 Management Comparison of Probabilistic-D and k-Means Clustering in Segment Profiles for B2B Markets, in SAS Global Forum, Las Vegas, NV, USA, 2011, no. April.

[28] K.R. Kashwan and C.M. Velu, Customer Segmentation Using Clustering and Data Mining Techniques, Int J Comput Theory Eng 5(6), 2013.

[29] M. Cleveland, N. Papadopoulos and M. Laroche, Identity, Demographics and Consumer Behaviors International mar- ket Segmentation Across Product Categories, Int Mark Rev 28(3) (2011), 244–266.

[30] M. Khajvand, Z. Kiyana, A. Sarah and A. Somayeh, Esti- mating Customer Lifetime Value Based on RFM Analysis of

F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6173

Customer Purchase Behavior : Case Study, Procedia Com- put Sci 3(2011) (2010), 57–63.

[31] M. Rezaeinia and R. Rahmani, Recommender System Based on Customer Segmentation (RSCS), Kybernetes 45(7) (2016), 1129–1157.

[32] M. Safari, Customer Lifetime Value to Managing Marketing Strategies in the Financial Services, Int Lett Soc Humanist Sci 42 (2015), 164–173.

[33] N. Li, L. Zeng, Q. He and Z. Shi, Parallel Implementation of Apriori Algorithm Based on MapReduce, in 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, (2012), pp. 236–241.

[34] O. Geman, et al., book chapter: From Fuzzy Expert Sys- tem to Artificial Neural Network: Application to Assisted Speech Therapy, INTECH, Artificial Neural Networks. Models and Applications, Edited by Joao Luis Garcia Rosa, Published: October 19th 2016, DOI: 10.5772/61493, ISBN: 978-953-51-2705, 2016.

[35] O. Geman, et al., book chapter Mathematical Models Used in Intelligent Assistive Technologies: Response Surface Methodology in Software Tools Optimization for Medical Rehabilitation, Intel Syst Ref Library, Vol. 170, Hari- ton Costin et al. (Eds): Recent Advances in Intelligent Assistive Technologies: Paradigms and Applications, 978- 3-030-30816-2, 463829 1 En, (4), 2019

[36] P.K. Srimani and M.S. Koti, Evaluation of Principal Com- ponents Analysis (PCA) and Data Clustering Technique (DCT) on Medical Data, Int J Knowl Eng 3(2) (2012), 202–206.

[37] P. Mishra and S. Dash, Developing RFM Model for Cus- tomer Segmentation in Retail Industry, Int J Mark Hum Resour Manag 1(1) (2010), 58–69.

[38] R.A.I.T. Daoud, B. Belaid, A. Abdellah and L. Rachid, Combining RFM Model and Clustering Techniques for Customer Value Analysis of a Company selling online, in 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), 2015, pp. 1–6.

[39] R. Soudagar, Customer Segmentation and Strategy Defini- tion in Segments Case Study: An Internet Service Provider in Iran, Lulea University of technology Department, 2012.

[40] R. Sethy and M. Panda, Big Data Analysis using Hadoop : A Survey International Journal of Advanced Research in Big Data Analysis using Hadoop : A Survey, Int J Adv Res Comput Sci Softw Eng 5(7), 2015.

[41] S. Chen, Cheetah : A High Performance, Custom Data Ware- house on Top of MapReduce, in Proceedings of the VLDB Endowmen, (2010), pp. 1459–1468.

[42] S. Draghici and A. Kuklin, Data Analysis Tools for DNA Microarrays, 1st editio., no. June 2003. CRC Press, 2003.

[43] S. Mohammad, S. Hosseini, A. Maleki and M.R. Gho- lamian, Cluster Analysis using Data Mining Approach to Develop CRM Methodology to Assess the Customer Loy- alty, Expert Syst Appl 37(2010) (2010), 5259–5264.

[44] S. Zahra, M.A. Ghazanfar, A. Khalid, M.A. Azam, U. Naeem and A. Prugel-bennett, Novel Centroid Selection Approaches for KMeans-Clustering Based Recommender Systems, Inf Sci (Ny) 320 (2015), 156–189.

[45] T.K. Hun and R. Yazdanifard, The Impact of Proper Market- ing Communication Channels on Consumer’s Behavior and Segmentation Consumers, Asian J Bus Manag 2(2) (2014), 155–159.

[46] T. White, Hadoop: The Definitive Guide, Third Edit. Sebastopol, United States: O’Reilly Media, Inc, USA, 2012.

[47] T. Upadhyay, V. Atma and D. Vishal, Customer Profiling and Segmentation using Data Mining Techniques, Int J Comput Sci Commun 7(2016) (2016), 65–67.

[48] V. Vandenbulcke, F. Lecron, C. Ducarroz and F. Fouss, Customer Segmentation Based On a Collaborative Recom- mendation System : Application to A Mass Retail Company, in Proceedings of the 42nd Annual Conference of the Euro- pean Marketing Academy, 2013, pp. 1–7.

[49] V.S. Jose, M. Carl, F. Mela and S. Gupta, The value of a free customer. No. wb092903. Universidad Carlos III de Madrid. Departamento de Econom ́ia de la Empresa, 2009.

[50] W. Dong, Y. Cai and T. Kwok Leung, Making EEMD more effective in extracting bearing fault features for intelligent bearing fault diagnosis by using blind fault component sep- aration, DOI: 10.3233/JIFS-169523, Journal: Journal of Intelligent & Fuzzy Systems 34(6) (2018), 3429–3441.

[51] W. Zhao, H. Ma and Q. He, Parallel K -Means Clustering Based on MapReduce, in In IEEE International Conference on Cloud Computing, (2009), pp. 674–679.

Copyright of Journal of Intelligent & Fuzzy Systems is the property of IOS Press and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

blog1

Get help from top-rated tutors in any subject.