The concept of massive data generation nowadays affects several domains such as marketing including electronic invoices (e-invoices) of large retailers, web access log files, healthcare, life sciences and so on. Datasets dimensions grow up, due to the availability of several cheap connected devices, such as mobile devices, RFID and wireless sensors networks, from which to collect data. Often, the collected data need to be gathered into a consistent, integrated and comprehensive form, to be used for knowledge discovery. Without adequately cleaning, transforming and structuring the data before the analysis, it is hard to mine useful knowledge. Thus, users by using data mining can extract knowledge from large invoices documents. In this paper, a pipeline for preprocessing and mining association rules from large retailers commercial documents has been proposed. The preprocessing provides merging, cleaning, formatting and summarization. The methodology can improve the quality of large retailers data by reducing the quantity of irrelevant data, making the remaining data suitable to mine association rules (ARM). Analyzing a real invoices dataset (provided by an Italian retailer) by using the proposed methodology, it was possible to extract 36 significant association rules, highlighting the customers’ behavior in the purchase of goods.
A pipeline for mining association rules from large datasets of retailers invoices
Agapito G.;Calabrese B.;Guzzi P. H.;Cannataro M.
2019-01-01
Abstract
The concept of massive data generation nowadays affects several domains such as marketing including electronic invoices (e-invoices) of large retailers, web access log files, healthcare, life sciences and so on. Datasets dimensions grow up, due to the availability of several cheap connected devices, such as mobile devices, RFID and wireless sensors networks, from which to collect data. Often, the collected data need to be gathered into a consistent, integrated and comprehensive form, to be used for knowledge discovery. Without adequately cleaning, transforming and structuring the data before the analysis, it is hard to mine useful knowledge. Thus, users by using data mining can extract knowledge from large invoices documents. In this paper, a pipeline for preprocessing and mining association rules from large retailers commercial documents has been proposed. The preprocessing provides merging, cleaning, formatting and summarization. The methodology can improve the quality of large retailers data by reducing the quantity of irrelevant data, making the remaining data suitable to mine association rules (ARM). Analyzing a real invoices dataset (provided by an Italian retailer) by using the proposed methodology, it was possible to extract 36 significant association rules, highlighting the customers’ behavior in the purchase of goods.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.