swissql - ΡΡΠ°ΡΠΈΡΠ΅ΡΠΊΠΈΠΉ Π°Π½Π°Π»ΠΈΠ· SparkSQL
ΠΠ»ΡΡΠΊΠΈ β’ ΠΠ°ΠΊ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ? β’ Π£ΡΡΠ°Π½ΠΎΠ²ΠΊΠ° β’ Π‘ΡΠ΅ΠΊ β’ ΠΠ²ΡΠΎΡΡ
- ΠΠ΅ΠΉΡ: "Π‘ΡΠ°ΡΠΈΡΠ΅ΡΠΊΠΈΠΉ Π°Π½Π°Π»ΠΈΠ·Π°ΡΠΎΡ SparkSQL Ρ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΡΡ Π΄ΠΎΠ±Π°Π²Π»Π΅Π½ΠΈΡ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»ΡΡΠΊΠΈΡ ΠΏΡΠ°Π²ΠΈΠ»"
- Π Π΅ΡΠ°ΡΡ: ΠΊΠΎΠΌΠ°Π½Π΄Π° I PRO
- Π‘ΡΡΠ»ΠΊΠ° Π½Π° ΡΠ΅ΠΏΠΎΠ·ΠΈΡΠΎΡΠΈΠΉ
- Π‘ΡΡΠ»ΠΊΠ° Π½Π° PyPi
- Π‘ΡΡΠ»ΠΊΠ° Π½Π° Google Colab
ΠΠΎΡΡΠ°Π½ΠΎΠ²ΠΊΠ° Π·Π°Π΄Π°ΡΠΈ
ΠΠ΅ΠΎΠ±Ρ ΠΎΠ΄ΠΈΠΌΠΎ ΡΠ°Π·ΡΠ°Π±ΠΎΡΠ°ΡΡ ΡΡΠ°ΡΠΈΡΠ΅ΡΠΊΠΈΠΉ Π°Π½Π°Π»ΠΈΠ·Π°ΡΠΎΡ SparkSQL ΠΊΠΎΠ΄Π°. ΠΡΠ°ΠΌΠΌΠ°ΡΠΈΠΊΠ° Π΄ΠΎΠ»ΠΆΠ½Π° ΡΠ°ΡΡΠΈΡΡΡΡΡΡ ΡΠ΅ΡΠ΅Π· ΠΎΡΠ΄Π΅Π»ΡΠ½ΡΠΉ BNF ΡΠ°ΠΉΠ» (https://en.wikipedia.org/wiki/BackusβNaur_form). ΠΡΡΡ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΡ Π΄ΠΎΠ±Π°Π²Π»ΡΡΡ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»ΡΡΠΊΠΈΠ΅ ΠΏΡΠΎΠ²Π΅ΡΠΊΠΈ Π² ΠΊΠΎΠ΄Π΅. ΠΡΡΡ ΠΠΠ ΠΈΠ½ΡΠ΅ΡΡΠ΅ΠΉΡ Π΄Π»Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ ΠΈΠ· ΠΊΠΎΠ΄Π°. ΠΠ½ΡΡΡΡΠΌΠ΅Π½Ρ Π΄ΠΎΠ»ΠΆΠ΅Π½ Π±ΡΡΡ Π±ΡΡΡΡΡΠΌ ΠΈ ΠΈΠ·ΠΎΠ»ΠΈΡΠΎΠ²Π°Π½Π½ΡΠΌ.
Π Π΅ΡΠ΅Π½ΠΈΠ΅ - Π§ΡΠΎ ΡΠ°ΠΊΠΎΠ΅ swissql
swissql - ΡΡΠΎ python Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠ° Π΄Π»Ρ ΡΡΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ Π°Π½Π°Π»ΠΈΠ·Π° SparkSQL, ΠΊ ΠΊΠΎΡΠΎΡΠΎΠΉ ΡΠ°ΠΊΠΆΠ΅ ΠΏΡΠ΅Π΄ΠΎΡΡΠ°Π²Π»ΡΠ΅ΡΡΡ CLI. ΠΠ½Π° ΡΠΎΡΡΠΎΠΈΡ ΠΈΠ· ΡΠ»Π΅Π΄ΡΡΡΠΈΡ ΠΌΠΎΠ΄ΡΠ»Π΅ΠΉ:
- Syntax - ΡΠΈΠ½ΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΡΠ°Π·Π±ΠΎΡ ΠΈ ΠΏΠΎΡΡΡΠΎΠ΅Π½ΠΈΠ΅ AST
- Format - ΡΠΎΡΠΌΠ°ΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅
- Style - ΡΡΠΈΠ»ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΈΠΉ Π°Π½Π°Π»ΠΈΠ·
- Optimize - ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΡ SparkSQL Π·Π°ΠΏΡΠΎΡΠ°
- Anti_pattern - Π²ΡΡΠ²Π»Π΅Π½ΠΈΠ΅ Π°Π½ΡΠΈΠΏΠ°ΡΡΠ΅ΡΠ½ΠΎΠ²
- Rule - ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΠ° ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»ΡΡΠΊΠΈΡ ΠΏΡΠ°Π²ΠΈΠ» Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ BNF
- Extract - Π°Π½Π°Π»ΠΈΠ· SQL ΡΡΡΠΎΠΊ Π² ΠΊΠΎΠ΄ΠΎΠ²ΠΎΠΉ Π±Π°Π·Π΅
ΠΠ»ΡΡΠΊΠΈ
...ΠΈΠ»ΠΈ ΡΡΠΎ Π΄Π΅Π»Π°Π΅Ρ ΡΡΠΎ ΡΠ΅ΡΠ΅Π½ΠΈΠ΅ ΠΎΡΠΎΠ±Π΅Π½Π½ΡΠΌ.
- CLI ΠΈΠ½ΡΠ΅ΡΡΠ΅ΠΉΡ - ΡΠ΄Π΅Π»Π°Π΅Ρ ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΡ Π² CI/CD ΠΊΠ°ΠΊ Π½ΠΈΠΊΠΎΠ³Π΄Π° ΠΏΡΠΈΠ²ΡΡΠ½ΠΎΠΉ.
- Π Π°Π·Π½ΠΎΡΠ²Π΅ΡΠ½ΡΠΉ ΡΠ΅ΠΊΡΡ Π΄Π»Ρ ΠΎΠ±Π»Π΅Π³ΡΠ΅Π½ΠΈΡ Π²ΠΈΠ·ΡΠ°Π»ΡΠ½ΠΎΠ³ΠΎ Π²ΠΎΡΠΏΡΠΈΡΡΠΈΡ Π² ΠΊΠΎΠ½ΡΠΎΠ»ΠΈ.
- Π£ΡΠΈΠ»ΠΈΡΠ° Π²ΡΠ»ΠΎΠΆΠ΅Π½Π° ΠΈ Π΄ΠΎΡΡΡΠΏΠ½Π° Π½Π° PyPi, ΡΡΠΎ ΠΎΠ±Π»Π΅Π³ΡΠ°Π΅Ρ ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΡ Π² ΡΠ°Π±ΠΎΡΠΈΠΉ ΠΏΡΠΎΡΠ΅ΡΡ.
- ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΈΠ²Π°Π΅ΠΌΡΡ ΡΠ΅ΡΠ΅Π½ΠΈΠΉ - ΡΡΠΎ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ Π±Π΅Π· ΠΏΡΠΎΠ±Π»Π΅ΠΌ ΠΌΠ°ΡΡΡΠ°Π±ΠΈΡΠΎΠ²Π°ΡΡ ΠΏΡΠΎΠ΅ΠΊΡ ΠΈ Π»Π΅Π³ΡΠ΅ Π°Π΄Π°ΠΏΡΠΈΡΠΎΠ²Π°ΡΡ Π΅Π³ΠΎ ΠΏΠΎΠ΄ Π΄ΡΡΠ³ΠΈΠ΅ Π΄ΠΈΠ°Π»Π΅ΠΊΡΡ SQL.
- ΠΠΎΠΌΠΌΠ΅Π½ΡΠ°ΡΠΈΠΈ Π² ΠΊΠΎΠ΄Π΅ ΠΏΠΎΠΌΠΎΠ³ΡΡ Π±Π΅Π· Π»ΠΈΡΠ½ΠΈΡ ΠΏΡΠΎΠ±Π»Π΅ΠΌ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΡ Π² Π΄ΡΡΠ³ΠΈΡ python ΡΠΊΡΠΈΠΏΡΠ°Ρ .
- ΠΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΡ Π·Π°ΠΏΡΠΎΡΠΎΠ² ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΡ ΡΡΠΊΠΎΠ½ΠΎΠΌΠΈΡΡ ΡΠ΅ΡΡΡΡΡ ΠΏΡΠΈ ΡΠ°Π±ΠΎΡΠ΅ Ρ BigQuery.
Π£ΡΡΠ°Π½ΠΎΠ²ΠΊΠ°
# Π£ΡΡΠ°Π½ΠΎΠ²ΠΊΠ° sqlcheck ( https://github.com/jarulraj/sqlcheck#installation )
# Debian
wget https://github.com/jarulraj/sqlcheck/releases/download/v1.3/sqlcheck-x86_64.deb
dpkg -i sqlcheck-x86_64.deb
# Windows
# ΠΡΠΈ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠ΅ Π΄Π»Ρ Windows Π½Π΅ Π·Π°Π±ΡΠ΄ΡΡΠ΅ Π΄ΠΎΠ±Π°Π²ΠΈΡΡ sqlcheck Π² PATH
wget https://github.com/jarulraj/sqlcheck/releases/download/v1.3/sqlcheck.exe
# ΠΠΎΡΠ»Π΅ sqlcheck, ΡΡΡΠ°Π½ΠΎΠ²ΠΈΠΌ swissql
pip install swissql
Π€ΡΠ½ΠΊΡΠΈΠΎΠ½Π°Π»
ΠΠ°Π³Π»ΡΠ΄Π½ΠΎ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ swissql ΠΎΠΏΠΈΡΠ°Π½ΠΎ Π² Π½Π°ΡΠ΅ΠΌ Π½ΠΎΡΡΠ±ΡΠΊΠ΅ Π½Π° Google Colab. ΠΠΈΠΆΠ΅ ΠΏΡΠΈΠ²Π΅Π΄Π΅Π½Ρ Π΅Π³ΠΎ ΡΡΠ°Π³ΠΌΠ΅Π½ΡΡ Ρ ΠΎΠ±ΡΡΡΠ½Π΅Π½ΠΈΠ΅ΠΌ ΡΠΎΠ³ΠΎ, ΠΊΠ°ΠΊ ΠΈ ΡΡΠΎ ΡΠ°Π±ΠΎΡΠ°Π΅Ρ.
Syntax
Π‘ΠΈΠ½ΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΡΠ°Π·Π±ΠΎΡ SparkSQL ΠΈ ΠΏΠΎΡΡΡΠΎΠ΅Π½ΠΈΠ΅ AST Ρ ΠΏΠΎΠΌΠΎΡΡΡ sqlglot
ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΠΏΡΠΎΠ²ΠΎΠ΄ΠΈΡΡ RBO (rule-based) ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ Π² ΠΌΠΎΠ΄ΡΠ»Π΅ Optimize. ΠΡΠΈΠΌΠ΅Ρ ΠΏΠΎΡΡΡΠΎΠ΅Π½Π½ΠΎΠ³ΠΎ AST:
$ # ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΎΡΠ΄Π΅Π»ΡΠ½ΠΎΠ³ΠΎ ΠΌΠΎΠ΄ΡΠ»Ρ - Syntax
!python -m swissql syntax -q "SELECT * from x join y on x.id = y.id where x.id > 2"
[Generating syntax tree using sqlglot]
(SELECT expressions:
(STAR ), from:
(FROM expressions:
(TABLE this:
(IDENTIFIER this: x, quoted: False))), joins:
(JOIN this:
(TABLE this:
(IDENTIFIER this: y, quoted: False)), on:
(EQ this:
(COLUMN this:
(IDENTIFIER this: id, quoted: False), table:
(IDENTIFIER this: x, quoted: False)), expression:
(COLUMN this:
(IDENTIFIER this: id, quoted: False), table:
(IDENTIFIER this: y, quoted: False)))), where:
(WHERE this:
(GT this:
(COLUMN this:
(IDENTIFIER this: id, quoted: False), table:
(IDENTIFIER this: x, quoted: False)), expression:
(LITERAL this: 2, is_string: False))))
Style
Π‘ ΠΏΠΎΠΌΠΎΡΡΡ sqlfluff
ΠΌΡ ΠΏΠΎΠ»ΡΡΠ°Π΅ΠΌ ΡΠ΅ΠΊΠΎΠΌΠ΅Π½Π΄Π°ΡΠΈΠΈ ΠΏΠΎ ΡΡΠΈΠ»Ρ ΠΊΠΎΠ΄Π°. ΠΡΠΈΠΌΠ΅Ρ ΡΠ΅ΠΊΠΎΠΌΠ΅Π½Π΄Π°ΡΠΈΠΉ ΠΌΠΎΠΆΠ½ΠΎ ΡΠ²ΠΈΠ΄Π΅ΡΡ Π½ΠΈΠΆΠ΅:
# ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΎΡΠ΄Π΅Π»ΡΠ½ΠΎΠ³ΠΎ ΠΌΠΎΠ΄ΡΠ»Ρ - Style
!python -m swissql style -q "SELECT * from x join y on x.id = y.id where x.id > 2"
[Style sql query use sqlfluff]
/home/quakumei/Workspace/Code/sparkySQL/SparkSQL-Analyzer/swissql/analyzers/style/intermediate.sql
== [/home/quakumei/Workspace/Code/sparkySQL/SparkSQL-Analyzer/swissql/analyzers/style/intermediate.sql] FAIL
L: 1 | P: 1 | L044 | Query produces an unknown number of result columns.
L: 1 | P: 8 | L027 | Unqualified reference '*' found in select with more than
| one referenced table/view.
L: 1 | P: 10 | L010 | Keywords must be consistently upper case.
L: 1 | P: 17 | L010 | Keywords must be consistently upper case.
L: 1 | P: 17 | L051 | Join clauses should be fully qualified.
L: 1 | P: 24 | L010 | Keywords must be consistently upper case.
L: 1 | P: 39 | L010 | Keywords must be consistently upper case.
L: 1 | P: 53 | L009 | Files must end with a single trailing newline.
All Finished!
Format
Π€ΠΎΡΠΌΠ°ΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ ΠΊΠΎΠ΄Π° ΠΎΡΡΡΠ΅ΡΡΠ²Π»ΡΠ΅ΡΡΡ ΡΠ°ΠΊ ΠΆΠ΅ Ρ ΠΏΠΎΠΌΠΎΡΡΡ sqlglot. ΠΡΠΈΠΌΠ΅Ρ ΠΎΡΡΠΎΡΠΌΠ°ΡΠΈΡΠΎΠ²Π°Π½Π½ΠΎΠ³ΠΎ Π·Π°ΠΏΡΠΎΡΠ°:
[Formatting sql query using sqlglot]
SELECT
*
FROM x
JOIN y
ON x.id = y.id
WHERE
x.id > 2
Optimize
ΠΠ΅ΡΠ²ΠΈΡΠ½Π°Ρ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΡ Π·Π°ΠΏΡΠΎΡΠ° ΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΡΡΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ sqlglot
Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΏΠΎΡΡΡΠΎΠ΅Π½Π½ΠΎΠ³ΠΎ ΡΠ°Π½Π΅Π΅ AST Π΄Π΅ΡΠ΅Π²Π° ΠΈ ΡΡ
Π΅ΠΌΡ Π΄Π°Π½Π½ΡΡ
. ΠΠΎΠ»Π½ΡΠΉ ΡΠΏΠΈΡΠΎΠΊ
ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΉ Π·Π°ΠΏΡΠΎΡΠ° Π²ΠΊΠ»ΡΡΠ°Π°Π΅Ρ Π² ΡΠ΅Π±Ρ ΡΠ°ΠΊΠΈΠ΅ ΠΎΠΏΠ΅ΡΠ°ΡΠΈΠΈ ΠΊΠ°ΠΊ
expression = qualify_tables(expression, db=db, catalog=catalog)
expression = isolate_table_selects(expression)
expression = qualify_columns(expression, schema)
expression = pushdown_projections(expression)
expression = normalize(expression)
expression = unnest_subqueries(expression)
expression = expand_multi_table_selects(expression)
expression = pushdown_predicates(expression)
expression = optimize_joins(expression)
expression = eliminate_subqueries(expression)
expression = quote_identities(expression)
return expression
ΠΡΠΈΠΌΠ΅Ρ ΠΎΠΏΡΠΈΠΌΠΈΠ·ΠΈΡΠΎΠ²Π°Π½Π½ΠΎΠ³ΠΎ Π·Π°ΠΏΡΠΎΡΠ°:
# ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΎΡΠ΄Π΅Π»ΡΠ½ΠΎΠ³ΠΎ ΠΌΠΎΠ΄ΡΠ»Ρ - Optimize
!python -m swissql optimize -q "SELECT * from x join y on x.id = y.id where x.id > 2" -s '{"x":{"id":"INT", "name":"STRING"},"y":{"id":"INT","name":"STRING"}}' -o optimize
[Optimizing sql query using sqlglot]
Optimization: optimize
SELECT
"x"."id" AS "id",
"x"."name" AS "name",
"y"."id" AS "id",
"y"."name" AS "name"
FROM (
SELECT
"x"."id" AS "id",
"x"."name" AS "name"
FROM "x" AS "x"
WHERE
"x"."id" > 2
) AS "x"
JOIN (
SELECT
"y"."id" AS "id",
"y"."name" AS "name"
FROM "y" AS "y"
) AS "y"
ON "x"."id" = "y"."id"
Π‘ΡΠΎΠΈΡ ΠΎΡΠΌΠ΅ΡΠΈΡΡ, ΡΡΠΎ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΡΠ»Π°Π³Π° -o
Π²Ρ ΠΌΠΎΠΆΠ΅ΡΠ΅ ΡΠΊΠ°Π·Π°ΡΡ ΠΊΠΎΠ½ΠΊΡΠ΅ΡΠ½ΠΎ ΠΊΠ°ΠΊΠΈΠ΅ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ Π²Π°ΠΌ ΡΡΠ΅Π±ΡΡΡΡΡ.
Anti_pattern
Π‘ ΠΏΠΎΠΌΠΎΡΡΡ sqlcheck
ΠΌΡ ΠΏΡΠΎΠ²Π΅ΡΡΠ΅ΠΌ Π½Π°ΡΠΈ Π·Π°ΠΏΡΠΎΡΡ Π½Π° Π½Π°Π»ΠΈΡΠΈΠ΅ Π² Π½ΠΈΡ
Π°Π½ΡΠΈΠΏΠ°ΡΡΠ΅ΡΠ½ΠΎΠ². ΠΠΎΠ΄ΡΠ»Ρ ΠΎΡΠΎΠ±Π΅Π½Π½ΠΎ ΠΏΠΎΠ»Π΅Π·Π΅Π½, ΠΏΠΎΡΠΊΠΎΠ»ΡΠΊΡ ΠΏΡΠ΅Π΄ΠΎΡΡΠ°Π²Π»ΡΠ΅Ρ ΠΏΠΎΠ΄ΡΠΎΠ±Π½ΠΎΠ΅ ΠΎΠ±ΡΡΡΠ½Π΅Π½ΠΈΠ΅, ΠΏΠΎΡΠ΅ΠΌΡ ΡΠΎ ΠΈΠ»ΠΈ ΠΈΠ½ΠΎΠ΅ ΡΠ΅ΡΠ΅Π½ΠΈΠ΅ - ΠΏΠ»ΠΎΡ
ΠΎΠ΅, ΠΈ ΠΏΡΠ΅Π΄Π»Π°Π³Π°Π΅Ρ ΠΌΠ΅ΡΠΎΠ΄Ρ Π΅Π³ΠΎ ΡΠ΅ΡΠ΅Π½ΠΈΡ.
ΠΠΊΡΡΠ°Π»ΡΠ½ΡΠΉ ΡΠΏΠΈΡΠΎΠΊ Π²ΡΡΠ²Π»ΡΠ΅ΠΌΡΡ
Π°Π½ΡΠΈΠΏΠ°ΡΡΠ΅ΡΠ½ΠΎΠ² ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ Π·Π΄Π΅ΡΡ
# ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΎΡΠ΄Π΅Π»ΡΠ½ΠΎΠ³ΠΎ ΠΌΠΎΠ΄ΡΠ»Ρ - Anti_Pattern
!python -m swissql anti_pattern -q "SELECT * from x join y on x.id = y.id where x.id > 2"
οΏ½[33m[Detecting anti-patterns using sqlcheck]οΏ½[0m
sqlcheck version 1.2.1
Debug build (NDEBUG not #defined)
+-------------------------------------------------+
| SQLCHECK |
+-------------------------------------------------+
> RISK LEVEL :: ALL ANTI-PATTERNS
> SQL FILE NAME :: temp.sql
> COLOR MODE :: ENABLED
> VERBOSE MODE :: ENABLED
> DELIMITER :: ;
-------------------------------------------------
==================== Results ===================
-------------------------------------------------
SQL Statement at line 1: οΏ½[1mοΏ½[31mselect * from x join y on x.id = y.id where x.id > 2;οΏ½[0mοΏ½[39m
[temp.sql]: (οΏ½[1mοΏ½[32mHIGH RISKοΏ½[0mοΏ½[39m) οΏ½[1mοΏ½[34mSELECT *οΏ½[0mοΏ½[39m
β Inefficiency in moving data to the consumer:
When you SELECT *, you're often retrieving more columns from the database than
your application really needs to function. This causes more data to move from
the database server to the client, slowing access and increasing load on your
machines, as well as taking more time to travel across the network. This is
especially true when someone adds new columns to underlying tables that didn't
exist and weren't needed when the original consumers coded their data access.
β Indexing issues:
Consider a scenario where you want to tune a query to a high level of
performance. If you were to use *, and it returned more columns than you
actually needed, the server would often have to perform more expensive methods
to retrieve your data than it otherwise might. For example, you wouldn't be able
to create an index which simply covered the columns in your SELECT list, and
even if you did (including all columns [shudder]), the next guy who came around
and added a column to the underlying table would cause the optimizer to ignore
your optimized covering index, and you'd likely find that the performance of
your query would drop substantially for no readily apparent reason.
β Binding
Problems:
When you SELECT *, it's possible to retrieve two columns of the same name from
two different tables. This can often crash your data consumer. Imagine a query
that joins two tables, both of which contain a column called "ID". How would a
consumer know which was which? SELECT * can also confuse views (at least in some
versions SQL Server) when underlying table structures change -- the view is not
rebuilt, and the data which comes back can be nonsense. And the worst part of it
is that you can take care to name your columns whatever you want, but the next
guy who comes along might have no way of knowing that he has to worry about
adding a column which will collide with your already-developed names.
[Matching Expression: οΏ½[1mοΏ½[34mselect *οΏ½[0mοΏ½[39m at line 1]
==================== Summary ===================
All Anti-Patterns and Hints :: 1
> High Risk :: 1
> Medium Risk :: 0
> Low Risk :: 0
> Hints :: 0
Rule
ΠΠΎΠ΄Π΄Π΅ΡΠΆΠΊΠ° ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»ΡΡΠΊΠΈΡ
ΠΏΡΠ°Π²ΠΈΠ» ΠΎΡΡΡΠ΅ΡΡΠ²Π»ΡΠ΅ΡΡΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ Lark
ΠΈ Π΄ΠΎΠ±Π°Π²Π»Π΅Π½ΠΈΡ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»Π΅ΠΌ ΡΠ°ΠΉΠ»ΠΎΠ² Ρ ΠΏΡΠΎΠΏΠΈΡΠ°Π½Π½ΠΎΠΉ BNF Π³ΡΠ°ΠΌΠΌΠ°ΡΠΈΠΊΠΎΠΉ Π½Π΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΠΎΠ³ΠΎ ΠΏΠ°ΡΡΠ΅ΡΠ½Π°. ΠΡΠΈΠΌΠ΅Ρ Π³ΡΠ°ΠΌΠΌΠ°ΡΠΈΠΊΠΈ ΠΈ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ ΠΏΠ°ΡΡΠ΅ΡΠ½Π° - Π½ΠΈΠΆΠ΅
# ΠΠ°ΠΊ Π²ΡΠ³Π»ΡΠ΄ΠΈΡ ΡΠ°ΠΉΠ» Ρ BNF Π³ΡΠ°ΠΌΠΌΠ°ΡΠΈΠΊΠΎΠΉ
!cat rules/filter2.lark
SELECT WILDCARD ANTIPATTERN :END_OF_COMMENT:
start: /./ start | pattern start | /./ | pattern
pattern: "SELECT *"i
%import common.INT
//%ignore " "
%ignore "\n"
ΠΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΠ΅ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»ΡΡΠΊΠΎΠ³ΠΎ ΠΏΡΠ°Π²ΠΈΠ»Π°:
# ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΎΡΠ΄Π΅Π»ΡΠ½ΠΎΠ³ΠΎ ΠΌΠΎΠ΄ΡΠ»Ρ - Rules
!python -m swissql rule -r filter2.lark -q "SELECT * from x join y on x.id = y.id where x.id > 2"
[Finding rules using lark]
Rule filter2.lark found:
Positions: (1, 1)
Comment: SELECT WILDCARD ANTIPATTERN
Extract
Extract - ΠΌΠΎΠ΄ΡΠ»Ρ, Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΡΠ΅Π³ΡΠ»ΡΡΠ½ΡΡ
Π²ΡΡΠ°ΠΆΠ΅Π½ΠΈΠΉ Π΄Π»Ρ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ ΡΡΡΠΎΠΊ ΠΈ ΠΈΡ
ΠΏΠ°ΡΡΠΈΠ½Π³Π° Ρ ΠΏΠΎΠΌΠΎΡΡΡ Lark
Π΄Π»Ρ ΠΎΡΡΠ΅ΠΈΠ²Π°Π½ΠΈΡ Π½Π΅ SQL-Π·Π°ΠΏΡΠΎΡΠΎΠ². ΠΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΠΈΠ·Π²Π»Π΅ΡΡ ΠΈΠ· ΠΊΠΎΠ΄ΠΎΠ²ΠΎΠΉ Π±Π°Π·Ρ SQL-Π·Π°ΠΏΡΠΎΡΡ Π΄Π»Ρ Π΄Π°Π»ΡΠ½Π΅ΠΉΡΠ΅ΠΉ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠΈ ΠΈ ΡΠ΅ΡΠ°ΠΊΡΠΎΡΠΈΠ½Π³Π°. ΠΡΠΈΠΌΠ΅Ρ ΡΠ°Π±ΠΎΡΡ:
# ΠΠ½Π°Π»ΠΈΠ·ΠΈΡΡΠ΅ΠΌΡΠΉ ΠΊΡΡΠΎΠΊ ΠΊΠΎΠ΄Π°
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType
from datetime import datetime, date
from pyspark.sql import Row
def init_spark():
spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
sc = spark.sparkContext
return spark, sc
# SQl Queries
sql_queries = [
'SELECT * FROM table1;',
'SELECT * FROM table1 WHERE id = 1;',
'SELECT * FROM table1 WHERE id = 1 AND name = "John" OR name = "Jane" OR name = "Jack";',
]
def main():
spark_sql_queries = [
"SELECT * FROM table",
]
...
if __name__ == "__main__":
main()
Π Π΅Π·ΡΠ»ΡΡΠ°Ρ:
# ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΎΡΠ΄Π΅Π»ΡΠ½ΠΎΠ³ΠΎ ΠΌΠΎΠ΄ΡΠ»Ρ - Swissql extractor
!python -m swissql extract -f examples/example3.py
[Extracting Spark SQLs from file using lark]
Found:
SELECT * FROM table1
SELECT * FROM table1 WHERE id = 1
SELECT * FROM table1 WHERE id = 1 AND name = "John" OR name = "Jane" OR name = "Jack"
SELECT * FROM table
select * from tableA
Π‘ΡΠ΅ΠΊ
- argparse - ΠΠΎΡΡΡΠΎΠ΅Π½ΠΈΠ΅ CLI
- sqlglot - Π‘ΠΈΠ½ΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΡΠ°Π·Π±ΠΎΡ SparkSQL, ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΡ ΠΈ ΡΠΎΡΠΌΠ°ΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅
- sqlfluff - Π‘ΡΠΈΠ»ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΈΠΉ Π°Π½Π°Π»ΠΈΠ· SparkSQL, ΡΠ΅ΠΊΠΎΠΌΠ΅Π½Π΄Π°ΡΠΈΠΈ ΠΏΠΎ ΡΡΠΈΠ»Ρ ΠΈ ΡΠΎΡΠΌΠ°ΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅
- sqlcheck - ΠΠ΅ΡΠ΅ΠΊΡΠΈΡ Π°Π½ΡΠΈ-ΠΏΠ°ΡΡΠ΅ΡΠ½ΠΎΠ² Π² Π·Π°ΠΏΡΠΎΡΠ°Ρ SparkSQL. Π Π΅ΠΊΠΎΠΌΠΌΠ΅Π½Π΄Π°ΡΠΈΠΈ ΠΏΠΎ ΡΠ»ΡΡΡΠ΅Π½ΠΈΡ ΠΊΠΎΠ΄Π°.
- Lark - ΠΠ°ΡΡΠΈΠ½Π³ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ BNF Π³ΡΠ°ΠΌΠΌΠ°ΡΠΈΠΊΠΈ. ΠΡΠΎΠ²Π΅ΡΠΊΠ° ΡΡΡΠΎΠΊ Π½Π° Π½Π°Π»ΠΈΡΠΈΠ΅ SQL Π·Π°ΠΏΡΠΎΡΠ°, ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»ΡΡΠΊΠΈΠ΅ ΠΏΡΠ°Π²ΠΈΠ»Π°.
ΠΠ°ΡΠ΅ΡΡΠ²ΠΎ ΠΊΠΎΠ΄Π°
ΠΠ°ΡΠ΅ΡΡΠ²ΠΎ ΠΈ Π΅Π΄ΠΈΡΠ½ΡΠ²ΠΎ ΡΡΠΈΠ»Ρ ΠΊΠΎΠ΄Π° Π² ΡΠ΅ΠΏΠΎΠ·ΠΈΡΠΎΡΠΈΠΈ ΠΎΠ±Π΅ΡΠΏΠ΅ΡΠΈΠ²Π°Π΅ΡΡΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ Π»ΠΈΠ½ΡΠ΅ΡΠ° black
Π§ΡΠΎ ΠΌΡ ΠΎΡΠ΅Π½Ρ Ρ ΠΎΡΠ΅Π»ΠΈ, Π½ΠΎ Π½Π΅ ΡΡΠΏΠ΅Π»ΠΈ ΡΠ΄Π΅Π»Π°ΡΡ
- Π Π°ΡΠΏΠ°ΡΠ°Π»Π»Π΅Π»ΠΈΠ²Π°Π½ΠΈΠ΅ Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΠΉ ΠΏΡΠΈ Π²ΡΠ·ΠΎΠ²Π΅
swissql all
- ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΠ»ΠΎ Π±Ρ Π² Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΎ ΡΠ°Π· ΡΡΠΊΠΎΡΠΈΡΡ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΡ - ΠΡΠ°ΡΠΈΠ²ΡΠΉ Π²Π΅Π±-ΠΈΠ½ΡΠ΅ΡΡΠ΅ΠΉΡ
- ΠΠΎΠΈΡΠΊ Π°Π½ΡΠΈΠΏΠ°ΡΡΠ΅ΡΠ½ΠΎΠ² Π² SparkSQL Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ ML/GNN
ΠΠ²ΡΠΎΡΡ
I PRO - ΡΡΠΎ ΠΌΡ!
- ΠΠ³ΠΎΡ ΠΠΎΠ»ΡΠ±Π΅Π² - Π±Π΅ΠΊΠ΅Π½Π΄-ΡΠ°Π·ΡΠ°Π±ΠΎΡΡΠΈΠΊ
- ΠΠ½Π΄ΡΠ΅ΠΉ ΠΠ°ΡΠ°Π½ΠΎΠ² - Π±Π΅ΠΊΠ΅Π½Π΄-ΡΠ°Π·ΡΠ°Π±ΠΎΡΡΠΈΠΊ
- ΠΠ³ΠΎΡ Π‘ΠΌΡΡΡΠ³ΠΈΠ½ - ΠΏΡΠΎΠ΄ΡΠΊΡ-ΠΌΠ΅Π½Π΅Π΄ΠΆΠ΅Ρ
- ΠΠ΅Π±Π΅Π΄Π΅Π²Π° Π’Π°ΡΡΡΠ½Π° - Π±Π΅ΠΊΠ΅Π½Π΄-ΡΠ°Π·ΡΠ°Π±ΠΎΡΡΠΈΠΊ
- Π’Π°ΠΌΠΏΠΈΠΎ ΠΠ»ΡΡ - Π±Π΅ΠΊΠ΅Π½Π΄-ΡΠ°Π·ΡΠ°Π±ΠΎΡΡΠΈΠΊ