Find Duplicates Service Installer FAQ

Sueann See · September 2021

Here are some FAQs on the Find Duplicates Service Installer available since Aperture Data Studio v2.4.8.

What is the purpose of the new Find Duplicates Service Installer?

The Find Duplicates Service Installer is created with the intention to simplify the installation of a separate instance of Find Duplicates on Windows. Previously, there was a long list of manual steps to accomplish this including having to install a web server. Users often encounter a lot of issues during the installation especially when installing the web server (Tomcat). With the new installer, there is no need to install the web server separately. You can also include the JRE (Java runtime), Standardize Service and Find Duplicates Workbench Service alongside the Find Duplicates Service all at once.

What does this mean for Aperture Data Studio deployments using Linux/Docker container?

There is no change to Linux/Docker container deployments. There is already a Find Duplicates image which contains the Standardize server and web server (Tomcat). There is also a separate image for Find Duplicates workbench.

Why do we need a separate instance of Find Duplicates in the first place?

The default setup of the Find duplicates step is to run embedded within Data Studio. While suitable for testing and small data sets, this is not recommended for production use. Processing large data sets through Find duplicates can use a significant proportion of system resources (CPU/memory) and impact other users and performance of other workflows when run within Data Studio. When ready to move to production, or to process larger volumes, it is recommended to run Find duplicates as a separate instance.

There's currently a default 1 million records processing limit when using an embedded instance of the Find Duplicates server. To process volumes above 1 million records, you will have to configure a separate instance.

Can customers still choose to use existing manual method for installations?

Yes, you can select the Find Duplicates Server option in the installer which will install the .war files only.

You can also choose to install/update the other components manually. For example, if you may want to update the Java runtime environment manually according to your company’s requirements, you can optionally exclude the JRE installation when running the installer.

How do customers move from the existing manual method for installation to using the new installer?

There will be documented manual steps for users upgrading from a Tomcat deployment to the Find Duplicates service, which will include steps to stop the Tomcat service and/or configure the Find Duplicates service on a different port. Note that the new installer will detect running Find Duplicates services (including Workbench and Standardize) and attempt to stop them before installing and starting the new service. However, it will not detect a running Tomcat server.

Does the installer consider the common configuration and settings required for Standardize, Find Duplicates and Find Duplicates workbench?

The following list contains the common configuration/settings which we have considered. Do let us know if you have specific feedback that can help us improve in any of these areas.

SSL

SSL is still disabled by default. However, in order to simplify the configuration steps, a find_duplicates.properties file that includes all the settings that need to be configured for SSL will be provided. We have considered if we should turn SSL on by default but it would require configuring a certificate as part of the installation as well which may not be straightforward for the typical user and could also present an extra unnecessary step in the context of a development setup. We are also conscious that not all clients would opt for SSL.

Memory

We are not configuring minimum and maximum heap sizes by default. The settings can be configured like the other JVM settings in the Find Duplicates.ini file using the -Xms and -Xmx parameters. We do not want to restrict the sizes with a hard-coded limit on more powerful systems where Java would calculate a higher default maximum size.

Logging

The log file name, format and location can all be customized in the FindDuplicates.ini, and logging options that override the existing log4j2.xml configuration can also be passed as parameters.

Port numbers

Default ports are set as follows:

Find Duplicates Service: 8080
Find Duplicates Workbench Service: 26312
Standardize Service: 5000

This can be configured in the installer as well as in the Find Duplicates.ini file. (Note: Port changes for Standardize Service will be supported at a later release)

Duplicates store path

Default path will be set as: C:\ApertureDataStudio\data\experianmatch. This can be configured in the installer as well as in the Find Duplicates.ini file.

Duplicates maximum cluster size

Default maximum cluster size is set as: 500. This can be configured in the installer as well as in the Find Duplicates.ini file.

CORs

The following java System properties are configurable and equivalent to the Tomcat Cors filter properties:

cors.allowed.origins
cors.allowed.methods
cors.allowed.headers
cors.exposed.headers
cors.preflight.maxage
cors.support.credentials

License path

The following system properties can be configured:

matching.licensing.libpath
matching.licensing.licensefolder

Default country for Standardize

We recommend setting the default country in the Find duplicates rules and blocking keys. Defaults are already set in all of the out of the box rules and blocking key configurations.

Will established Find Duplicates store created using Find Duplicates deployed under Tomcat be compatible with the new installer?

The new installer does not change how the duplicate store is established or maintained. You will just need to ensure that the duplicate store path has been configured correctly in the Find Duplicates.ini file.

Will customized settings be retained upon upgrade once configured with the new installer?

Custom settings like database path, cluster size, port, SSL, logging, etc, will be retained on upgrade with the Find Duplicates installer by default (i.e. unless users uncheck the migrate check box).

KatriM · February 2022

Where can we find those those documented manual steps for upgrading from a Tomcat deployment to the Find Duplicates service?

We have been checking these instructions: Data Quality user documentation | Installing a separate instance (experianaperture.io)

Sueann See · March 2022

@KatriM you should be treating this as a new installation of the Find Duplicates service, and once that is up and running, you can uninstall Tomcat.

Ensure you have backup copies of your match stores that you want to retain.
Shut down tomcat
Install and configure the Find Duplicates Service following the instructions in the documentation
Test to confirm everything is working.
Uninstall tomcat

Find Duplicates Service Installer FAQ

Comments

Categories