Image: Adobe Stock Contents: Big data: More must-read coverage Today, data analytics plays a major role in corporate decision making. It is able to do this because data is culled Continue Reading
Today, data analytics plays a major role in corporate decision making. It is able to do this because data is culled from a variety of sources and then assembled in a single data repository that corporate decision makers can access. When data is combined from different areas throughout the company, corporate decision makers get a 360-degree view of what is going on. This enables them to make more informed decisions.
For example, if a vice president of sales wants to know why a certain product isn’t selling well, he/she can query a central data analytics repository which contains all of the information on that particular product from throughout the enterprise. The sales VP can see the customer complaints about the product that customer service logged, as well as the number of product returns that the warehouse processed. He/she can also see that engineering is working on a revision of the product to cure the defects that have been reported. The VP now has a thorough understanding of why the product hasn’t been doing as well in revenues as was projected.
SEE: Hiring Kit: Database engineer (TechRepublic Premium)
A decade ago, this type of comprehensive analysis and visibility was difficult to achieve. Corporate departments were using their own systems and data, and this data stayed in data silos that weren’t always shared with others with a need to know. Now, with more modernized approaches to preparing and sharing data, a more complete picture of what is going on throughout the company is available to corporate decision makers.
How have organizations managed to pull data from variety of internal and eternal sources, and then combine it into a single data repository that everyone can access?
They use extract, transform and load (ETL) software, commonly referred to as ETL tools, to move the data, transform it and then load it into a target data repository.
ETL software obtains data from one source, transforms the data into a form that is acceptable for another source and then moves the data to the new target source. ETL software is an automated software tool. When companies use ETL software, they no longer have to convert data from one source to another by hand. This saves time, effort and manual errors.
When an ETL tool extracts data, the data can be extracted from any internal or external data source, whether it is a file or a database.
Once the ETL tool has the data, it transforms the data into a form that is compatible with the target data repository that the data will be loaded into. This data transformation is based upon the data conversion rules that IT defines to the ETL software, which then performs the data transformation automatically, based upon those rules.
As a final step, the ETL software takes the transformed data and then moves it into the target data repository.
ETL tools can be run for both batch and real-time data processing. These tools can also be used in both on premises and cloud environments.
The value of ETL tools rests in their ability to automate the movement of data between systems, but they are only as good as the set of business and operational rules that IT provides them.
For instance, an organization will have a set of data governance and data cleaning standards. These might include the exclusion of certain data fields in data transfers between systems, or changes in the formatting of data so that data from an incoming data source will be able to conform and to interoperate with data in the target data repository that might be formatted differently.
In the past, IT had to make and execute these data transformation and quality rules manually. This was a time-consuming process that also had the potential of introducing errors, since the process was done manually. Now with ETL tools that automate major portions of the data extract, transformation and load process, IT can be largely hand-off in these operations, although it still must define the rules of operation and data quality and governance for the ETL tool so the ETL software can do its job.
It is also up to IT to continuously monitor the ETL process in the same way that IT monitors the performance of any other piece of software. This way, if there is a problem, IT can intervene and solve it.
Companies of all sizes need to move data from point to point and then aggregate it in order to support more holistic and informed decision making.
With advent of analytics and a need to understand the business more holistically, IT and end business decision makers want to derive more value from their data, and they want it faster. This is where ETL tools fit in. They automate data moving that used to be manual, and they come with pre-packaged APIs (application programming interfaces) that automatically connect to many popular databases and applications, without IT having to do these integrations “by hand.”
That being said, there are several factors that companies should consider before purchasing an ETL solution.
What do you need the ETL for?
Are you going to be pulling data from different sources that range from unstructured or semi-structured IoT data to legacy system data that resides on internal servers and mainframes? Or is your company almost wholly cloud-based, with a clear preference for an ETL solution that operates within the cloud where most of your data and applications are hosted? What if your company has data and systems that are both on premises and cloud based? What’s the best choice for that scenario?
How do you want prepare your data?
Is the generic formatting (system to system or database to database) that your ETL tool comes pre-packaged with going to meet your data cleaning and formatting needs, or do you need to add extra edit rules to data?
How well can you support and leverage your ETL tool?
If you are a smaller company, do you have skilled personnel on board who are trained in ETL methods and tools? Even if you have this personnel on board, do you have a need to also have your non-IT end business users use the ETL software?
How much do you want to pay for an ETL tool?
Do you prefer an ETL tool that is wholly based upon usage that you can control and monitor for cost, or a cloud-based ETL tool that doesn’t require internal servers and storage from your data center? What about the training and support that might be required for your IT staff and end users? Which ETL software option will be most cost-effective for you?
ETL tools can work in either cloud or on premises IT environments; they also come in either proprietary or open source software. Here are some of the most popular ETL tools in those categories.
ETL in the cloud
AWS Glue is a nice fit for companies that use SQL databases, AWS and Amazon S3 storage services. AWS Glue enables you to clean, validate, organize and load data from disparate static or streaming data sources into a data warehouse or a data lake. It also allows you to process semi-structured data such as clickstream (e.g., website hyperlinks) and process logs. Its strength is in its ability to work with SQL, which many companies have competence in. On the programming side, AWS Glue executes jobs using either Scala or Python code.
With AWS Glue, you can schedule ETL jobs based on a schedule or event, or you can trigger jobs as soon as data becomes available. AWS Glue is an on-demand tool that automatically scales to accommodate the processing and storage resources that you need, and that gives you visibility of runtime metrics while it processes.
AWS Glue integrates well with other AWS systems and processes, so if AWS is your primary data repository and processor, AWS Glue works well. It also has APIs for third party JDBC (JAVA)-accessible databases like DB2, MySQL, Oracle, SyBase, Apache Kafka and MongoDB.
AWS offers free online courses. It also provides certification programs.
Pricing is free for the first million accesses/objects stored and is billed on a monthly basis that is based upon usage thereafter.
Azure Data Factory is a pay-as-you-go cloud-based ETL tool that automatically scales processing and storage to meet your data and processing demands. Its strength is that it can be used by both IT professionals and end users. This is because the tool has both a no-code graphical user interface for end users and a code-based interface for IT. Both code and no-code interfaces feature data pulls from more than 90 connectors. Among these connectors are AWS, DB2, MongoDB, Oracle, MySQL, SQL, SyBase, Salesforce and SAP.
Azure Data factory is a nice choice for Microsoft shops, and for companies that want both their business end users and IT group to have access to ETL tools that enable them to pull data into data repositories.
Microsoft offers free online training. It also offers certifications for Azure Data Factory. Its standard technical support package provides 24×7 access to support engineers via email and phone, with a guaranteed response time that is within one hour.
Pricing is based on usage.
Google Cloud Dataflow is part of the Google Cloud platform, and is well integrated with other Google services. Dataflow uses ApacheBeam open source technology to orchestrate the data pipelines that are used in DataFlow’s ETL operations. Google Cloud Dataflow requires IT expertise in SQL databases, and in the Java and Python programming languages. This software can be deployed for both batch and real-time processing, and in either a scheduled or a real-time on demand mode. Because Google Cloud Dataflow is cloud-based, it can automatically scale to accommodate the processing and storage that you need for any ETL job. Google Cloud Dataflow is ideal for shops that heavily use the Google Cloud platform.
Through its Cloud Academy, Google offers a free online tutorial on Dataflow, offers hands-on training at $34/month and a Google certification program at $39/month.
Google Cloud has several technical support options that start at the Basic Level (billing/payment support) and increase to Standard (unlimited technical support), Enhanced (faster response technical support) and Premium support (a dedicated support representative).
Pricing is based on usage.
On premises or hybrid ETL tools
InfoSphere DataStage is part of the IBM Information Server Platform. It uses a client/server design where jobs are created and administered via a Windows client against a central repository on a server. This server can be Intel-based, UNIX-based, LINUX-based or even an IBM mainframe. Regardless of platform, the IBM InfoSphere DataStage ETL software can integrate data on demand across multiple, high volumes of data sources and can target applications using a high performance parallel framework. InfoSphere DataStage also facilitates extended metadata management and enterprise connectivity.
InfoSphere DataStage is well suited for large enterprises that have mainframes or large servers, and high volume processing and data. These organizations tend to run on multiple clouds, and also in on premises data centers. The connecters supported by IBM InfoSphere DataStage range from AWS, Azure and Google, to SyBase, Hive, JSON, Kafka, Oracle, Salesforce, Snowflake, Teradata and others.
IBM InfoSphere DataStage is a robust ETL solution, and also a costly one. This tool is designed for IT professionals who have a sound understanding of SQL and also knowledge of the BASIC programming language, which InfoSphere DataStage uses.
IBM offers pay-for online and classroom training and certifications for DataStage. It also provides 24/7 technical support packages
Pricing is available upon request.
Oracle Data Integrator (ODI) is a strong platform for larger enterprises that run other Oracle applications such as Enterprise Resource Planning (ERP). ODI is designed to move data from point to point across an entire company’s business functions. Like ERP, it can support integrated workflows across entire organizations.
ODI can process data integration requests that range from high-volume batch loads to service-oriented architecture (SOA) data services that enable software components to be called and reused in new processes. ODI also supports parallel task execution for faster data processing and offers built-in integrations with other Oracle tools, such as Oracle GoldenGate and Oracle Warehouse Builder.
ODI ETL software supports data integration for both structured and unstructured data. It supports relational databases, and has a library of APIs for third party data and applications. On the big data side, ODI also supports Spark Streaming, Hive, Kafka, Cassandra, HBase, Sqoop and Pig. ODI is a sophisticated and proprietary tool that requires IT expertise and experience in Java programming.
On a subscription basis, Oracle offers access to online training and certifications for ODI.
Technical support is available, and will be added to licensing fees.
Pricing is license based.
Informatica PowerCenter is an enterprise-strength ETL tool that is best utilized by large organizations with the need to move data across many different business functions. PowerCenter extracts, transforms and loads data from a variety of different structured and unstructured data sources that span internal and external (cloud-based) enterprise applications. PowerCenter has many APIs to variety of different third party applications and data.
Common data formats that PowerCenter works with include JSON, XML, PDF and Internet of Things (IoT) machine data. PowerCenter can work with many different third party databases, such as SQL and Oracle database. PowerCenter will transform data based upon the transformation rules that are defined by IT.
Informatica PowerCenter furnishes a user-friendly graphical interface that is designed for the use of business users, but the tool is best used by IT, as it is highly sophisticated. PowerCenter can automatically scale to meet processing and data needs at the same time that it works to optimize performance.
Although PowerCenter is a proprietary ETL tool, it can work in both cloud and on premises environments.
Informatica offers PowerCenter online training subscriptions and provides learning paths for developers, administrators and data integrators through its Informatica University.
It also offers technical support options that companies can subscribe to.
Pricing is based upon usage.
SEE: Microsoft Power Platform: What you need to know about it (free PDF) (TechRepublic)
Open source ETL tools
Talend is an open source software that can quickly build data pipelines for ETL operations. It is a tool best utilized by IT, because it requires changes to code every time you need to change a job. That being said, Talend is a highly user-friendly tool for IT professionals that uses a graphical user interface to effect connections to data and applications.
Talend comes with more than 900 different connectors to commercial and open source data sources and applications. Its graphical user interface enables you to point and click on connections to commonly used corporate data sources, such as Excel, Dropbox, Oracle, Salesforce, Microsoft Dynamics and others. Talend Open Studio can pull both structured and unstructured data from relational databases, software applications and files. It can be used with on premises, cloud and multi-cloud platforms, so Talend is a good fit for companies that operate in a hybrid computing mode that includes both in-house and on-cloud systems and data.
Talend’s ability to work easily in on premises, cloud and multi-cloud environments simplifies work for IT and speeds productivity in the process.
The Talend Academy is available by subscription, and offers a variety of online and instructor-led courses. Talend certification programs are also available.
Talend technical support provides access to a wide user community, an online library and a one-stop customer portal. Technical support services are priced on a per customer basis.
A basic version of Talend is available for free. The enhanced version of Talend is priced on a per user basis.
Pentaho Data Integration (PDI) is an open source ETL tool, and also a software that provides data mining, reports and information dashboards. Pentaho works with either structured or unstructured data. As an in-house ETL resource, Pentaho can be hosted on either Intel or Apple servers. Pentaho uses JDBC to connect to a variety of relational databases such as SQL, but it can also connect to proprietary enterprise databases like DB2. Pentaho captures, cleans and loads standard and unstructured systems data, and it works equally well processing incoming IoT data from the field or from factory floors.
Pentaho’s strength is its ability to be used by citizen developers (i.e., business end users), and not just by IT. This makes it a good fit for small and medium sized businesses that may not have the resident IT expertise onboard to run ETLs. Pentaho does this because It offers no-code capabilities that enable end users without IT programming knowledge to extract, transform and load data from a multitude of sources on their own. Users can use a drag and drop graphical user interface to get their jobs done.
There are two different versions of Pentaho: a Community edition that is easy to use and that contains basic ETL functions; and an Enterprise edition that is more robust and includes more features.
Pentaho offers online, self-paced learning and instructor-led education for a fee.
It offers technical support options that range from 8/5 to 24/7 coverage, and that are customized per client.
The Community edition of Pentaho is free of charge, and the Enterprise edition is priced on a per subscription basis.
Data integration is one of the most persistent challenges for IT teams. What ETL tools bring to the table is a simplified way of moving data from system to system and from data repository to data repository. These ETL tools comes in a wide variety of flavors that can meet the needs of enterprises with complex data and system integration needs in hybrid environments to smaller companies that lack IT expertise and must watch their budgets. The ETL tool your business chooses will depends on its specific use cases and budget.