[Branch-1.4] Port #9851 #9879 to fix release issue (#9895)

weiting-chen · philo-he · web-flow · commit 50dd117dadb9 · 2025-06-06T15:51:55.000+08:00
* [VL] Fix link issues found in release process (#9851) * [GLUTEN-9878] Update LICENSE and NOTICE to list all licenses used for copied code. (#9879) * Update LICENSE and NOTICE to list all licenses used for copied code. * Update script from velox, gluten 2025, NOTICE-binary. --------- Co-authored-by: PHILO-HE <philo@apache.org>
diff --git a/LICENSE b/LICENSE
@@ -200,3 +200,63 @@
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
+
+   This product bundles various third-party components also under the
+   Apache Software License 2.0.
+
+   Apache DataFusion(https://github.com/apache/datafusion)
+   ./.github/workflows/take.yml
+
+   Apache Spark(https://github.com/apache/spark)
+   ./backends-clickhouse/src/main/scala/org/apache/spark/sql/execution/CHColumnarWrite.scala
+   ./backends-clickhouse/src/main/scala/org/apache/spark/sql/execution/SparkWriteFilesCommitProtocol.scala
+   ./cpp-ch/local-engine/Parser/aggregate_function_parser/BloomFilterAggParser.cpp
+   ./gluten-substrait/src/main/scala/org/apache/spark/sql/execution/GlutenExplainUtils.scala
+   ./shims/spark32/src/main/scala/org/apache/spark/sql/execution/FileSourceScanExecShim.scala
+   ./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
+   ./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
+   ./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala
+   ./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
+   ./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+   ./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala.deprecated
+   ./shims/spark32/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
+   ./shims/spark32/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
+   ./shims/spark33/src/main/scala/org/apache/spark/sql/execution/FileSourceScanExecShim.scala
+   ./shims/spark33/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala
+   ./shims/spark33/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
+   ./shims/spark33/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
+   ./tools/gluten-it/common/src/main/scala/org/apache/spark/sql/TestUtils.scala
+
+   Delta Lake(https://github.com/delta-io/delta)
+   ./backends-clickhouse/src-delta-20/main/scala/org/apache/spark/sql/delta/DeltaLog.scala
+   ./backends-clickhouse/src-delta-20/main/scala/org/apache/spark/sql/delta/Snapshot.scala
+   ./backends-clickhouse/src-delta-20/main/scala/org/apache/spark/sql/delta/commands/DeleteCommand.scala
+   ./backends-clickhouse/src-delta-20/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala
+   ./backends-clickhouse/src-delta-20/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala
+   ./backends-clickhouse/src-delta-20/main/scala/org/apache/spark/sql/delta/commands/UpdateCommand.scala
+   ./backends-clickhouse/src-delta-23/main/scala/org/apache/spark/sql/delta/DeltaLog.scala
+   ./backends-clickhouse/src-delta-23/main/scala/org/apache/spark/sql/delta/Snapshot.scala
+   ./backends-clickhouse/src-delta-23/main/scala/org/apache/spark/sql/delta/commands/DeleteCommand.scala
+   ./backends-clickhouse/src-delta-23/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala
+   ./backends-clickhouse/src-delta-23/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala
+   ./backends-clickhouse/src-delta-23/main/scala/org/apache/spark/sql/delta/commands/UpdateCommand.scala
+   ./backends-clickhouse/src-delta-23/main/scala/org/apache/spark/sql/delta/commands/VacuumCommand.scala
+   ./backends-clickhouse/src-delta-23/main/scala/org/apache/spark/sql/delta/stats/PrepareDeltaScan.scala
+   ./backends-clickhouse/src-delta-33/main/scala/org/apache/spark/sql/delta/DeltaLog.scala
+   ./backends-clickhouse/src-delta-33/main/scala/org/apache/spark/sql/delta/PreprocessTableWithDVs.scala
+   ./backends-clickhouse/src-delta-33/main/scala/org/apache/spark/sql/delta/Snapshot.scala
+   ./backends-clickhouse/src-delta-33/main/scala/org/apache/spark/sql/delta/commands/DMLWithDeletionVectorsHelper.scala
+   ./backends-clickhouse/src-delta-33/main/scala/org/apache/spark/sql/delta/commands/VacuumCommand.scala
+
+    The Velox Project(https://github.com/facebookincubator/velox)
+   ./cpp/velox/udf/examples/MyUDAF.cc
+   ./cpp/velox/utils/Common.cc
+   ./ep/build-velox/src/setup-centos7.sh
+   ./ep/build-velox/src/setup-centos8.sh
+   ./ep/build-velox/src/setup-openeuler24.sh
+   ./ep/build-velox/src/setup-rhel.sh
+
+   ClickHouse(https://github.com/ClickHouse/ClickHouse)
+   ./cpp-ch/local-engine/AggregateFunctions/AggregateFunctionPartialMerge.h
+   ./cpp-ch/local-engine/Functions/SparkFunctionArrayDistinct.cpp
+   
diff --git a/NOTICE b/NOTICE
@@ -1,7 +1,23 @@
 Apache Gluten(incubating)
-Copyright 2023-2024 The Apache Software Foundation
+Copyright 2023-2025 The Apache Software Foundation
 
 This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
 
 The initial codebase was donated to the ASF by Intel and Kyligence, copyright 2023-2024.
+
+Apache DataFusion
+Copyright 2019-2025 The Apache Software Foundation
+
+Apache Spark
+Copyright 2014 and onwards The Apache Software Foundation
+
+Delta Lake
+Copyright (2021) The Delta Lake Project Authors.
+
+The Velox Project
+Copyright © 2024 Meta Platforms, Inc.
+
+ClickHouse
+Copyright 2016-2025 ClickHouse, Inc.
+
diff --git a/NOTICE-binary b/NOTICE-binary
@@ -18,6 +18,11 @@ Copyright 2022-2024 The Apache Software Foundation.
 
 ---------------------------------------------------------
 
+Apache DataFusion
+Copyright 2019-2025 The Apache Software Foundation
+
+---------------------------------------------------------
+
 Apache Uniffle (incubating)
 Copyright 2022 and onwards The Apache Software Foundation.
 
@@ -43,6 +48,21 @@ Copyright (C) 2006 - 2019, The Apache Software Foundation.
 
 ---------------------------------------------------------
 
+Delta Lake
+Copyright (2021) The Delta Lake Project Authors.
+
+---------------------------------------------------------
+
+The Velox Project
+Copyright © 2024 Meta Platforms, Inc.
+
+---------------------------------------------------------
+
+ClickHouse
+Copyright 2016-2025 ClickHouse, Inc.
+
+---------------------------------------------------------
+
 This project includes code from Daniel Lemire's FrameOfReference project.
 
 https://github.com/lemire/FrameOfReference/blob/6ccaf9e97160f9a3b299e23a8ef739e711ef0c71/src/bpacking.cpp
diff --git a/tools/gluten-it/README.md b/tools/gluten-it/README.md
@@ -2,27 +2,27 @@
 
 The project makes it easy to test Gluten build locally.
 
-## Gluten ?
+## Gluten
 
 Gluten is a native Spark SQL implementation as a standard Spark plug-in.
 
 https://github.com/apache/incubator-gluten
 
 ## Getting Started
 
-### 1. Install Gluten in your local machine
+### 1. Build Gluten
 
-See official Gluten build guidance https://github.com/apache/incubator-gluten#how-to-use-gluten
+See official Gluten build guidance https://github.com/apache/incubator-gluten#build-from-source.
 
-### 2. Install and run gluten-it with Spark version
+### 2. Build and run gluten-it
 
 ```sh
 cd gluten/tools/gluten-it
 mvn clean package -P{Spark-Version}
 sbin/gluten-it.sh
 ```
 
-> Note: *Spark-Version* support *spark-3.2* and *spark-3.3* only
+Note: **Spark-Version** can only be **spark-3.2**, **spark-3.3**, **spark-3.4** or **spark-3.5**.
 
 ## Usage
 
diff --git a/tools/gluten-it/sbin/gluten-it.sh b/tools/gluten-it/sbin/gluten-it.sh
@@ -30,6 +30,8 @@ SPARK_JVM_OPTIONS=$($JAVA_HOME/bin/java -cp $JAR_PATH org.apache.gluten.integrat
 
 EMBEDDED_SPARK_HOME=$BASEDIR/../spark-home
 
+mkdir $EMBEDDED_SPARK_HOME && ln -snf $BASEDIR/../package/target/lib $EMBEDDED_SPARK_HOME/jars
+
 # We temporarily disallow setting these two variables by caller.
 SPARK_HOME=""
 SPARK_SCALA_VERSION=""
diff --git a/tools/gluten-it/spark-home/jars b/tools/gluten-it/spark-home/jars
diff --git a/tools/workload/tpch/README.md b/tools/workload/tpch/README.md
@@ -1,7 +1,7 @@
 # Test on Velox backend with TPC-H workload
 
 ## Test datasets
-Parquet and DWRF(a fork of the ORC file format) format files are both supported. Here are the steps to generate the testing datasets:
+Parquet and DWRF (a fork of the ORC file format) format files are both supported. Here are the steps to generate the testing datasets:
 
 ### Generate the Parquet dataset
 Please refer to the scripts in [parquet_dataset](./gen_data/parquet_dataset/) directory to generate parquet dataset. Note this script relies on the [spark-sql-perf](https://github.com/databricks/spark-sql-perf) and [tpch-dbgen](https://github.com/databricks/tpch-dbgen) package from Databricks. Note in the tpch-dbgen kits, we need to do a slight modification to allow Spark to convert the csv based content to parquet, please make sure to use this commit: [0469309147b42abac8857fa61b4cf69a6d3128a8](https://github.com/databricks/tpch-dbgen/commit/0469309147b42abac8857fa61b4cf69a6d3128a8)
@@ -26,27 +26,6 @@ val rootDir = "/PATH/TO/TPCH_PARQUET_PATH" // root directory of location to crea
 val dbgenDir = "/PATH/TO/TPCH_DBGEN" // location of dbgen
 ```
 
-Currently, Gluten with Velox can support both Parquet and DWRF file format and three compression codec including snappy, gzip, zstd.
-Below step, to convert Parquet to DWRF, is optional if you are using Parquet format to run the testing.
-
-### Convert the Parquet dataset to DWRF dataset(OPTIONAL)
-And then please refer to the scripts in [dwrf_dataset](./gen_data/dwrf_dataset/) directory to convert the Parquet dataset to DWRF dataset.
-
-In tpch_convert_parquet_dwrf.sh, spark configures should be set according to the system.
-
-```
-export GLUTEN_HOME=/PATH/TO/gluten
-...
---executor-cores 8                                      \
---num-executors 14                                       \
-```
-
-In tpch_convert_parquet_dwrf.scala, the table path should be configured.
-```
-val parquet_file_path = "/PATH/TO/TPCH_PARQUET_PATH"
-val dwrf_file_path = "/PATH/TO/TPCH_DWRF_PATH"
-```
-
 ## Test Queries
 We provide the test queries in [TPC-H queries](../../../tools/gluten-it/common/src/main/resources/tpch-queries).
 We also provide a scala script in [Run TPC-H](./run_tpch/) directory about how to run TPC-H queries.