Skip to main content
All CollectionsDeveloper ToolkitGLUE
Seamless Integration with AWS Redshift Using Scispot Glue
Seamless Integration with AWS Redshift Using Scispot Glue

This article is to help you set up an Amazon Redshift connector in Scispot Glue.

S
Written by Satya Singh
Updated over 7 months ago

Syncing Labsheets data from Scispot to Amazon Redshift helps you keep all your experimental data in one place, making it easier to analyze and share with your team. This streamlined process saves time and improves collaboration, helping you focus more on your research.

What are the prerequisites?

  • AWS Account: Ensure you have an existing AWS account with a Redshift cluster created.

  • SSH Tunnel Setup: Ensure you have SSH access with credentials (<ssh_user>@<ssh_host>) and a secured private key file.

  • Network Access: Verify that an available local port and unrestricted network access to the Redshift database endpoint and port are in place.

How do I use this integration?

Scispot now offers a powerful integration with AWS Redshift, enabling users to effortlessly push their data from Scispot Labsheets to an AWS Redshift database. This integration is managed through the intuitive Scispot GLUE Interface, streamlining the process for users.

Connecting to AWS Redshift can be complex due to its secure nature, residing within a user’s Virtual Private Cloud (VPC) and security group. However, Scispot ensures secure and efficient data transfer methods through the Scispot Agent and SSH Tunnel workflow.

To facilitate this connection, users must set up an SSH Tunnel on their end. The Scispot Agent, running on their system, will then leverage this tunnel to securely transmit data to the AWS Redshift database. This method ensures data integrity and security throughout the transfer process.

Step 1: Create a Scispot Agent

The Scispot Agent is a Java application that runs locally on your machine and manages the flow of information from Scispot servers to your Amazon Redshift data warehouse.

  1. Navigate to the GLUE application on the left-hand menu of Scispot.

  2. Select Agents in the top bar of GLUE.

  3. Select Add Agent, choose a name, select your device's operating system, and confirm to download the agent.

Step 2: Add Source

The source is the location that will be providing data to be synchronized with Amazon Redshift. In this case, it will be Scispot Labsheets.

  1. Navigate to GLUE within Scispot.

  2. Select Sources in the top bar of GLUE.

  3. Select Add Source and from Data Sources choose Scispot Labsheets.

  4. Select the Labsheet(s) to be synced by using the Add button and typing the name of a Labsheet.

  5. Select Create Source once all desired Labsheets have been added.

Step 3: Add Destination

The destination is the location where the data from the source will be sent.

  1. Navigate to Destinations within Scispot Glue.

  2. Select Add Destination and from 'Data Sources' choose Amazon Redshift.

  3. In the field 'Cluster Endpoint' input your Redshift cluster endpoint.

  4. Select and input a port value.

  5. In the 'Database Name' field input the name of the desired database destination within Amazon Redshift.

  6. In the 'Schema Name' field input the corresponding schema name from Amazon Redshift Database (by default it will be public).

  7. For authentication, users need to input the cluster's user name and password. If you prefer not to share your credentials directly in the UI, Scispot provides an alternative option:

    1. You can check 'Use file for credentials' option and provide the path to a text file(.txt) on your local system containing your username and password. Ensure that the credentials are stored in the following format within the text file:

      REDSHIFT_USERNAME="user" 
      REDSHIFT_PASSWORD="password"
  8. Select the agent created above from the corresponding dropdown menu.

  9. Select Create Destination to finish the process.

Step 4: Create the Connection

To connect the configured source and destination:

  1. Select Connections in the top bar of GLUE.

  2. Select Add Connection in the top right corner of GLUE.

  3. Define the source as the created Scispot Labsheets source.

  4. Define the destination as the created Amazon Redshift destination.

  5. Select Create Connection.

Step 5: Setup the SSH Tunnel

To set up a secure connection between your device and Amazon Redshift servers:

In a terminal session run the following SSH command:

ssh -L <local_port>:<redshift_endpoint>:<redshift_port> -i <path_to_pem_file> <ssh_user>@<ssh_host>

For example:

ssh -L 5439:testing-workspace.872198.us-east-1.redshift-serverless.amazonaws.com:5439 -i aws-pem.pem [email protected]

Step 6: Run the Agent

  1. Unzip the downloaded agent file to get a command line file and the agent application.

  2. Execute the command line file within the directory containing the agent file to launch the application.

  3. Confirmation that the Scispot Agent is running will appear in your terminal.

Important Notes

  • The integration works on scheduled algorithms, creating records every hour for your data inside Labsheets. When you connect your agent and set up the SSH tunnel, it sends the data to AWS Redshift.

  • For successful data sync, ensure the agent is up and running.

  • Labsheet names and column names must follow Redshift database naming conventions.

  • Avoid manipulating Labsheets or Redshift tables once a connection is set up to prevent unexpected behavior.

  • Only values from primitive datatype columns will be synced to Redshift tables.

The integration is now complete! Monitor file synchronization within the Connections tab of GLUE, with syncs occurring every 60 minutes.

FAQ

What happens if I add or delete the columns of a synced Labsheet?

In the current version of the integration, if you add a column to your existing Labsheet then this new column will be added in the redshift table. The user can see the logs related to this in the transactions. Deleting columns in a Labsheet will not change the existing Redshift table.

What happens if I change the name of a Labsheet?

We recommend not changing the Labsheet name after adding it in the source configuration to push to Redshift. Changing the name will create a new table inside Redshift and start populating data there.

What happens if I delete a synced Labsheet within Scispot?

The synced table in Redshift will not be deleted.

What happens if I delete a row inside Labsheet? Will it delete the rows inside Redshift Tables as well?

No, the rows will not be deleted within the corresponding Redshift table. In the current implementation of the integration, the sync will not delete information from Redshift tables.

How does the sync work now?

We are running an orchestrator inside Scispot that monitors your Labsheet data. Every hour, it creates transactions and records for any new row additions or updates to existing rows. Once you start running the Scispot Agent and setup the SSH tunnel, these records will begin pushing to your Redshift database.

What are the column types from Scispot Labsheets that are pushed in the sync to Redshift?

We are currently syncing only primitive columns. We are not pushing columns like Sequences, Connection, Image, Chem Data, Magic Columns, or Files.

What will be the data type of columns that are pushed to Redshift?

We are using three types of data types. For text columns, we are using the TEXT type, for date columns the DATE type, and for numeric value columns, we are using DECIMAL(20,2).

How can I see the logs and monitor the transactions?

In the Glue dashboard under 'Connections', you can click on a specific connection to view all the transactions. If you expand a transaction, you will see the records, which represent the rows synced and their status. A PENDING status for a record denotes that that record is waiting for the agent and SSH tunnel to be up in order to be pushed to Redshift.

What are the name criteria for Labsheet file names and column names to sync properly?

We follow Redshift database conventions. Therefore Labsheet file names must not contain any spaces. Spaces in column names are fine and are handled with the space being replaced by an underscore character.

The other rules of the Redshift conventions are:

Table Name Validation Rules:

  1. Length: Must not exceed 127 characters.

  2. Starting Character: Must start with an alphabetic character (a-z, A-Z).

  3. Allowed Characters: Only alphabetic characters, digits (0-9), and underscores (_) are permitted.

  4. Case Sensitivity: Case-insensitive by default; use double quotes to preserve case.

  5. Reserved Words: Avoid reserved words unless enclosed in double quotes.

  6. Schema Prefix: Use schema name as a prefix to avoid conflicts.

  7. No Trailing Spaces: Must not have trailing spaces.

Column Name Validation Rules:

  1. Length: Must not exceed 127 characters.

  2. Starting Character: Must start with an alphabetic character (a-z, A-Z).

  3. Allowed Characters: Only alphabetic characters, digits (0-9), and underscores (_) are permitted.

  4. Reserved Words: Avoid reserved words; can use them with proper quoting.

  5. No Trailing Spaces: Must not have trailing spaces.

NOTE: Not every account will have these integrations. Please contact [email protected] if you don't see these features enabled.

Did this answer your question?