Skip to main content

Installation

1. Preparation

To successfully install StreamPark, the following environments need to be prepared:

ItemVersionRequiredOther
Operating SystemLinux/MacOSWindows is not supported
Java1.8+Java version >=1.8
Scala2.12+Scala version >=2.12
DatabaseMySQL: 5.6+, Postgresql: 9.6+Defaults to H2 database, supports MySQL and Postgresql
Flink1.12+Minimum supported Flink version is 1.12
Hadoop2+Optional, required for YARN deployment
Kubernetes1.16+Optional, required for Kubernetes deployment

2. Download

You can download the latest version of StreamPark directly from the official website. The version used in this document is 2.1.5, available at: https://streampark.apache.org/download

After the download is complete, extract the package:

# Extract the StreamPark installation package.
tar -zxvf apache-streampark_2.12-2.1.5-incubating-bin.tar.gz

The extracted directory will look like this:

├── bin
│ ├── startup.sh // Startup script
│ ├── shutdown.sh // Shutdown script
│ └── ......
├── conf
│ ├── config.yaml // Project configuration file
│ └── logback-spring.xml // Logging configuration file
├── lib
│ └── *.jar // Project JAR files
├── logs // Log directory
├── script
│ ├── data
│ │ ├── mysql-data.sql // MySQL DDL SQL scripts
│ │ └── pgsql-data.sql // PostgreSQL DDL SQL scripts
│ ├── schema
│ │ ├── mysql-schema.sql // MySQL full initialization data
│ │ └── pgsql-schema.sql // PostgreSQL full initialization data
│ └── upgrade
│ ├── mysql
│ │ ├── 1.2.3.sql // SQL script for upgrading to version 1.2.3
│ │ ├── 2.0.0.sql // SQL script for upgrading to version 2.0.0
│ │ ├── ......
│ └── pgsql
│ └── ......
└── temp // Temporary directory for internal use, do not delete

3. Start

Navigate to the bin directory under the installation directory and start the program:

# Go to the bin directory under the installation directory
cd bin

# Start the program
./startup.sh
Warning

After starting the program, you may encounter the following error:
streampark.workspace.local: "/tmp/streampark" is an invalid path, please reconfigure in xxx/conf/config.yaml

This is because the local workspace directory for StreamPark does not exist or is not set to a valid path. The solution is simple: you can use the default configuration and create the streampark directory under /tmp:

mkdir -p /tmp/streampark

Alternatively, configure a valid local path in the streampark.workspace.local property in the conf/config.yaml file.

Run startup.sh again, and you should see that the program starts successfully: startup-success

Visit StreamPark at: http://127.0.0.1:10000

Tip

Username and password: admin / streampark

login-page

You should see Login successful! login-success

4. More

At this point, you can see that the project is running successfully. The installation above is based on the default H2 local database. Let’s now look at how to integrate external databases (MySQL or PostgreSQL).

4.1 Use External Database

If you want to use an external database, you need to modify the conf/config.yaml file in the installation directory. The core modification is as follows (using MySQL as an example):

datasource:
dialect: mysql # Change this to mysql, default is h2, supports mysql and pgsql
h2-data-dir: ~/streampark/h2-data # If using H2, configure this directory. If using MySQL or PostgreSQL, configure the remaining information (username and password)
username: # Username for the data source connection
password: # Password for the data source connection
url: # JDBC connection URL, e.g., jdbc:mysql://localhost:3306/streampark?useUnicode=true&characterEncoding=UTF-8&useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=GMT%2B8

Next, manually connect to the external database and initialize the MySQL table creation script (location: installation_directory/script/schema/mysql-schema.sql) and the data initialization script (location: installation_directory/script/data/mysql-data.sql).

Once the above steps are complete, start the service, and it will automatically connect and use the external database.

4.2 Use Hadoop

If you need to deploy jobs to YARN, you must configure the Hadoop environment variables. If you are using a Hadoop environment installed via CDH, the relevant environment variables can be configured as follows:

export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop # Hadoop installation directory
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=$HADOOP_HOME/../hive
export HBASE_HOME=$HADOOP_HOME/../hbase
export HADOOP_HDFS_HOME=$HADOOP_HOME/../hadoop-hdfs
export HADOOP_MAPRED_HOME=$HADOOP_HOME/../hadoop-mapreduce
export HADOOP_YARN_HOME=$HADOOP_HOME/../hadoop-yarn
Tip

StreamPark will automatically read the Hadoop configuration from the environment variables, connect to Hadoop, upload resources to HDFS, and deploy jobs to YARN.

In addition, you may need to modify the conf/config.yaml file (for example, specifying the hadoop user, Kerberos authentication, etc.). The core modifications are as follows:

streampark:
workspace:
# Root path for storing resources in HDFS
remote: hdfs:///streampark/
proxy:
# Hadoop YARN proxy URL, e.g., Knox process address https://streampark.com:8443/proxy/yarn
yarn-url:
yarn:
# Authentication type
http-auth: 'simple' # Default is simple or kerberos
# Flink on YARN or Spark on YARN, HADOOP_USER_NAME
hadoop-user-name: hdfs

# If Kerberos authentication is enabled, configure the following:
security:
kerberos:
login:
debug: false
enable: false
keytab:
krb5:
principal:

4.3 Other Configurations

Here is the detailed explanation of the config.yaml file configuration, allowing you to easily integrate SSO or LDAP, etc.:

########################################### Logging Configuration ###########################################
logging:
level:
root: info

########################################### Basic Service Configuration ###########################################
server:
port: 10000
session:
ttl: 2h # User login session expiration time. Users will be automatically logged out after this period.
undertow: # Undertow service settings
buffer-size: 1024
direct-buffers: true
threads:
io: 16
worker: 256

########################################### Database Configuration ###########################################
datasource:
dialect: h2 # Database dialect, supports h2, mysql, pgsql
h2-data-dir: ~/streampark/h2-data # If using H2, configure this directory
username:
password:
url: # Database connection URL, e.g., jdbc:mysql://localhost:3306/streampark?......

########################################### Project Configuration ###########################################
streampark:
## Workspace configuration, local and HDFS workspaces for different types of resources
workspace:
local: /tmp/streampark
remote: hdfs:///streampark/
## Proxy settings
proxy:
lark-url: # Feishu proxy URL
yarn-url: # YARN proxy URL
## YARN-related configuration
yarn:
http-auth: 'simple' # Authentication method
hadoop-user-name: hdfs # Configure the Hadoop username
## Project management settings
project:
max-build: 16 # Maximum number of projects to run concurrently.
## Development interface settings
openapi:
white-list: # Configure the white list for open APIs

########################################### Kerberos Authentication Configuration ###########################################
security:
kerberos:
login:
debug: false
enable: false
keytab:
krb5:
principal:
ttl: 2h

########################################

Looking forward to your further exploration~