Start quick conversation on WhatsApp

New Graphic

Data mining and extraction (ETL) techniques in data science

Data mining is the idea of ​​collecting data from a website, database or any SaaS platform that can also be used in a database designed to support Online Analytical Processing (OLAP).

Deleting data is the first step in the data entry process called ETL - Extract, Transform and Load. The purpose of the ETL is to prepare data analytics or business intelligence (BI).

Suppose an organization wants to monitor its market history. It can contain data from a variety of sources, including online comments, media coverage and online action. The ETL tool can extract information from these objects and place them in a database where they can be monitored and dug to identify them on a point of view.

types of data removal
Removal operations can be configured, or specialists can execute the requirements according to the business needs and monitoring objectives. Data can be extracted in three main ways:

Change Notification
An easy way to extract data from a source of information is to transfer the information to that system once the history has been changed. Most databases offer a way to do this in order to facilitate database duplication (data capture or duplicate logs), and most SaaS applications offer webhooks, which provide similar functionality.

In additional sections
Some data sources may not be able to provide much of the modified content, but they may be able to identify edited notes and extract words from the text. In the following ETL sections, the data removal number should identify and disseminate the changes. One drawback of incremental extraction is that it cannot retrieve deleted historical data source, because there is no way to look at history that no longer exists.

in all regions
The first time you repeat at a source you have to extract all of them, and some data sources have no way of knowing what has changed, which is why downloading the entire table will receive data from there. do it. Because all modules require high data transfer volume, which can place properties on the network, it is not a good option if you can avoid it.