Spark SQL analytic engine


License
GPL-3.0
Install
pip install swissql==0.2.2

Documentation

swissql - статичСский Π°Π½Π°Π»ΠΈΠ· SparkSQL

swissql

Плюшки β€’ Как ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ? β€’ Установка β€’ Π‘Ρ‚Π΅ΠΊ β€’ Авторы

ΠŸΠΎΡΡ‚Π°Π½ΠΎΠ²ΠΊΠ° Π·Π°Π΄Π°Ρ‡ΠΈ

НСобходимо Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚Π°Ρ‚ΡŒ статичСский Π°Π½Π°Π»ΠΈΠ·Π°Ρ‚ΠΎΡ€ SparkSQL ΠΊΠΎΠ΄Π°. Π“Ρ€Π°ΠΌΠΌΠ°Ρ‚ΠΈΠΊΠ° Π΄ΠΎΠ»ΠΆΠ½Π° Ρ€Π°ΡΡˆΠΈΡ€ΡΡ‚ΡŒΡΡ Ρ‡Π΅Ρ€Π΅Π· ΠΎΡ‚Π΄Π΅Π»ΡŒΠ½Ρ‹ΠΉ BNF Ρ„Π°ΠΉΠ» (https://en.wikipedia.org/wiki/Backus–Naur_form). Π•ΡΡ‚ΡŒ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡ‚ΡŒ Π΄ΠΎΠ±Π°Π²Π»ΡΡ‚ΡŒ ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»ΡŒΡΠΊΠΈΠ΅ ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΊΠΈ Π² ΠΊΠΎΠ΄Π΅. Π•ΡΡ‚ΡŒ ООП интСрфСйс для использования ΠΈΠ· ΠΊΠΎΠ΄Π°. Π˜Π½ΡΡ‚Ρ€ΡƒΠΌΠ΅Π½Ρ‚ Π΄ΠΎΠ»ΠΆΠ΅Π½ Π±Ρ‹Ρ‚ΡŒ быстрым ΠΈ ΠΈΠ·ΠΎΠ»ΠΈΡ€ΠΎΠ²Π°Π½Π½Ρ‹ΠΌ.

РСшСниС - Π§Ρ‚ΠΎ Ρ‚Π°ΠΊΠΎΠ΅ swissql

swissql - это python Π±ΠΈΠ±Π»ΠΈΠΎΡ‚Π΅ΠΊΠ° для статичСского Π°Π½Π°Π»ΠΈΠ·Π° SparkSQL, ΠΊ ΠΊΠΎΡ‚ΠΎΡ€ΠΎΠΉ Ρ‚Π°ΠΊΠΆΠ΅ прСдоставляСтся CLI. Она состоит ΠΈΠ· ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΠΈΡ… ΠΌΠΎΠ΄ΡƒΠ»Π΅ΠΉ:

Плюшки

...ΠΈΠ»ΠΈ Ρ‡Ρ‚ΠΎ Π΄Π΅Π»Π°Π΅Ρ‚ это Ρ€Π΅ΡˆΠ΅Π½ΠΈΠ΅ особСнным.

  • CLI интСрфСйс - сдСлаСт ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Ρ†ΠΈΡŽ Π² CI/CD ΠΊΠ°ΠΊ Π½ΠΈΠΊΠΎΠ³Π΄Π° ΠΏΡ€ΠΈΠ²Ρ‹Ρ‡Π½ΠΎΠΉ.
  • Π Π°Π·Π½ΠΎΡ†Π²Π΅Ρ‚Π½Ρ‹ΠΉ тСкст для облСгчСния Π²ΠΈΠ·ΡƒΠ°Π»ΡŒΠ½ΠΎΠ³ΠΎ восприятия Π² консоли.
  • Π£Ρ‚ΠΈΠ»ΠΈΡ‚Π° Π²Ρ‹Π»ΠΎΠΆΠ΅Π½Π° ΠΈ доступна Π½Π° PyPi, Ρ‡Ρ‚ΠΎ ΠΎΠ±Π»Π΅Π³Ρ‡Π°Π΅Ρ‚ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Ρ†ΠΈΡŽ Π² Ρ€Π°Π±ΠΎΡ‡ΠΈΠΉ процСсс.
  • ИспользованиС ΠΏΠΎΠ΄Π΄Π΅Ρ€ΠΆΠΈΠ²Π°Π΅ΠΌΡ‹Ρ… Ρ€Π΅ΡˆΠ΅Π½ΠΈΠΉ - это позволяСт Π±Π΅Π· ΠΏΡ€ΠΎΠ±Π»Π΅ΠΌ ΠΌΠ°ΡΡˆΡ‚Π°Π±ΠΈΡ€ΠΎΠ²Π°Ρ‚ΡŒ ΠΏΡ€ΠΎΠ΅ΠΊΡ‚ ΠΈ Π»Π΅Π³Ρ‡Π΅ Π°Π΄Π°ΠΏΡ‚ΠΈΡ€ΠΎΠ²Π°Ρ‚ΡŒ Π΅Π³ΠΎ ΠΏΠΎΠ΄ Π΄Ρ€ΡƒΠ³ΠΈΠ΅ Π΄ΠΈΠ°Π»Π΅ΠΊΡ‚Ρ‹ SQL.
  • ΠšΠΎΠΌΠΌΠ΅Π½Ρ‚Π°Ρ€ΠΈΠΈ Π² ΠΊΠΎΠ΄Π΅ ΠΏΠΎΠΌΠΎΠ³ΡƒΡ‚ Π±Π΅Π· Π»ΠΈΡˆΠ½ΠΈΡ… ΠΏΡ€ΠΎΠ±Π»Π΅ΠΌ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ Π±ΠΈΠ±Π»ΠΈΠΎΡ‚Π΅ΠΊΡƒ Π² Π΄Ρ€ΡƒΠ³ΠΈΡ… python скриптах.
  • ΠžΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΡ запросов ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΡ‚ ΡΡΠΊΠΎΠ½ΠΎΠΌΠΈΡ‚ΡŒ рСсурсы ΠΏΡ€ΠΈ Ρ€Π°Π±ΠΎΡ‚Π΅ с BigQuery.

Установка

# Установка sqlcheck ( https://github.com/jarulraj/sqlcheck#installation )
# Debian
wget https://github.com/jarulraj/sqlcheck/releases/download/v1.3/sqlcheck-x86_64.deb
dpkg -i sqlcheck-x86_64.deb

# Windows
# ΠŸΡ€ΠΈ установкС для Windows Π½Π΅ Π·Π°Π±ΡƒΠ΄ΡŒΡ‚Π΅ Π΄ΠΎΠ±Π°Π²ΠΈΡ‚ΡŒ sqlcheck Π² PATH
wget https://github.com/jarulraj/sqlcheck/releases/download/v1.3/sqlcheck.exe

# ПослС sqlcheck, установим swissql
pip install swissql

Π€ΡƒΠ½ΠΊΡ†ΠΈΠΎΠ½Π°Π»

Наглядно использованиС swissql описано Π² нашСм Π½ΠΎΡƒΡ‚Π±ΡƒΠΊΠ΅ Π½Π° Google Colab. НиТС ΠΏΡ€ΠΈΠ²Π΅Π΄Π΅Π½Ρ‹ Π΅Π³ΠΎ Ρ„Ρ€Π°Π³ΠΌΠ΅Π½Ρ‚Ρ‹ с объяснСниСм Ρ‚ΠΎΠ³ΠΎ, ΠΊΠ°ΠΊ ΠΈ Ρ‡Ρ‚ΠΎ Ρ€Π°Π±ΠΎΡ‚Π°Π΅Ρ‚.

Syntax


БинтаксичСский Ρ€Π°Π·Π±ΠΎΡ€ SparkSQL ΠΈ построСниС AST с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ sqlglot позволяСт ΠΏΡ€ΠΎΠ²ΠΎΠ΄ΠΈΡ‚ΡŒ RBO (rule-based) ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΈ Π² ΠΌΠΎΠ΄ΡƒΠ»Π΅ Optimize. ΠŸΡ€ΠΈΠΌΠ΅Ρ€ построСнного AST:

$ # ИспользованиС ΠΎΡ‚Π΄Π΅Π»ΡŒΠ½ΠΎΠ³ΠΎ модуля - Syntax
!python -m swissql syntax -q "SELECT * from x join y on x.id = y.id where x.id > 2"

[Generating syntax tree using sqlglot]
(SELECT expressions: 
  (STAR ), from: 
  (FROM expressions: 
    (TABLE this: 
      (IDENTIFIER this: x, quoted: False))), joins: 
  (JOIN this: 
    (TABLE this: 
      (IDENTIFIER this: y, quoted: False)), on: 
    (EQ this: 
      (COLUMN this: 
        (IDENTIFIER this: id, quoted: False), table: 
        (IDENTIFIER this: x, quoted: False)), expression: 
      (COLUMN this: 
        (IDENTIFIER this: id, quoted: False), table: 
        (IDENTIFIER this: y, quoted: False)))), where: 
  (WHERE this: 
    (GT this: 
      (COLUMN this: 
        (IDENTIFIER this: id, quoted: False), table: 
        (IDENTIFIER this: x, quoted: False)), expression: 
      (LITERAL this: 2, is_string: False))))

Style


Π‘ ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ sqlfluff ΠΌΡ‹ ΠΏΠΎΠ»ΡƒΡ‡Π°Π΅ΠΌ Ρ€Π΅ΠΊΠΎΠΌΠ΅Π½Π΄Π°Ρ†ΠΈΠΈ ΠΏΠΎ ΡΡ‚ΠΈΠ»ΡŽ ΠΊΠΎΠ΄Π°. ΠŸΡ€ΠΈΠΌΠ΅Ρ€ Ρ€Π΅ΠΊΠΎΠΌΠ΅Π½Π΄Π°Ρ†ΠΈΠΉ ΠΌΠΎΠΆΠ½ΠΎ ΡƒΠ²ΠΈΠ΄Π΅Ρ‚ΡŒ Π½ΠΈΠΆΠ΅:

# ИспользованиС ΠΎΡ‚Π΄Π΅Π»ΡŒΠ½ΠΎΠ³ΠΎ модуля - Style
!python -m swissql style -q "SELECT * from x join y on x.id = y.id where x.id > 2"

[Style sql query use sqlfluff]
/home/quakumei/Workspace/Code/sparkySQL/SparkSQL-Analyzer/swissql/analyzers/style/intermediate.sql
== [/home/quakumei/Workspace/Code/sparkySQL/SparkSQL-Analyzer/swissql/analyzers/style/intermediate.sql] FAIL
L:   1 | P:   1 | L044 | Query produces an unknown number of result columns.
L:   1 | P:   8 | L027 | Unqualified reference '*' found in select with more than
                       | one referenced table/view.
L:   1 | P:  10 | L010 | Keywords must be consistently upper case.
L:   1 | P:  17 | L010 | Keywords must be consistently upper case.
L:   1 | P:  17 | L051 | Join clauses should be fully qualified.
L:   1 | P:  24 | L010 | Keywords must be consistently upper case.
L:   1 | P:  39 | L010 | Keywords must be consistently upper case.
L:   1 | P:  53 | L009 | Files must end with a single trailing newline.
All Finished!

Format


Π€ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΊΠΎΠ΄Π° осущСствляСтся Ρ‚Π°ΠΊ ΠΆΠ΅ с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ sqlglot. ΠŸΡ€ΠΈΠΌΠ΅Ρ€ ΠΎΡ‚Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΡ€ΠΎΠ²Π°Π½Π½ΠΎΠ³ΠΎ запроса:

[Formatting sql query using sqlglot]
SELECT
  *
FROM x
JOIN y
  ON x.id = y.id
WHERE
  x.id > 2

Optimize


ΠŸΠ΅Ρ€Π²ΠΈΡ‡Π½Π°Ρ оптимизация запроса производится с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ sqlglot Π½Π° основС построСнного Ρ€Π°Π½Π΅Π΅ AST Π΄Π΅Ρ€Π΅Π²Π° ΠΈ схСмы Π΄Π°Π½Π½Ρ‹Ρ…. ΠŸΠΎΠ»Π½Ρ‹ΠΉ список ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΉ запроса Π²ΠΊΠ»ΡŽΡ‡Π°Π°Π΅Ρ‚ Π² сСбя Ρ‚Π°ΠΊΠΈΠ΅ ΠΎΠΏΠ΅Ρ€Π°Ρ†ΠΈΠΈ ΠΊΠ°ΠΊ

    expression = qualify_tables(expression, db=db, catalog=catalog)
    expression = isolate_table_selects(expression)
    expression = qualify_columns(expression, schema)
    expression = pushdown_projections(expression)
    expression = normalize(expression)
    expression = unnest_subqueries(expression)
    expression = expand_multi_table_selects(expression)
    expression = pushdown_predicates(expression)
    expression = optimize_joins(expression)
    expression = eliminate_subqueries(expression)
    expression = quote_identities(expression)
    return expression

ΠŸΡ€ΠΈΠΌΠ΅Ρ€ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·ΠΈΡ€ΠΎΠ²Π°Π½Π½ΠΎΠ³ΠΎ запроса:

# ИспользованиС ΠΎΡ‚Π΄Π΅Π»ΡŒΠ½ΠΎΠ³ΠΎ модуля - Optimize
!python -m swissql optimize -q "SELECT * from x join y on x.id = y.id where x.id > 2" -s '{"x":{"id":"INT", "name":"STRING"},"y":{"id":"INT","name":"STRING"}}' -o optimize

[Optimizing sql query using sqlglot]
Optimization: optimize
SELECT
  "x"."id" AS "id",
  "x"."name" AS "name",
  "y"."id" AS "id",
  "y"."name" AS "name"
FROM (
  SELECT
    "x"."id" AS "id",
    "x"."name" AS "name"
  FROM "x" AS "x"
  WHERE
    "x"."id" > 2
) AS "x"
JOIN (
  SELECT
    "y"."id" AS "id",
    "y"."name" AS "name"
  FROM "y" AS "y"
) AS "y"
  ON "x"."id" = "y"."id"

Π‘Ρ‚ΠΎΠΈΡ‚ ΠΎΡ‚ΠΌΠ΅Ρ‚ΠΈΡ‚ΡŒ, Ρ‡Ρ‚ΠΎ с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ Ρ„Π»Π°Π³Π° -o Π²Ρ‹ ΠΌΠΎΠΆΠ΅Ρ‚Π΅ ΡƒΠΊΠ°Π·Π°Ρ‚ΡŒ ΠΊΠΎΠ½ΠΊΡ€Π΅Ρ‚Π½ΠΎ ΠΊΠ°ΠΊΠΈΠ΅ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΈ Π²Π°ΠΌ Ρ‚Ρ€Π΅Π±ΡƒΡŽΡ‚ΡΡ.

Anti_pattern


Π‘ ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ sqlcheck ΠΌΡ‹ провСряСм наши запросы Π½Π° Π½Π°Π»ΠΈΡ‡ΠΈΠ΅ Π² Π½ΠΈΡ… Π°Π½Ρ‚ΠΈΠΏΠ°Ρ‚Ρ‚Π΅Ρ€Π½ΠΎΠ². ΠœΠΎΠ΄ΡƒΠ»ΡŒ особСнно ΠΏΠΎΠ»Π΅Π·Π΅Π½, ΠΏΠΎΡΠΊΠΎΠ»ΡŒΠΊΡƒ прСдоставляСт ΠΏΠΎΠ΄Ρ€ΠΎΠ±Π½ΠΎΠ΅ объяснСниС, ΠΏΠΎΡ‡Π΅ΠΌΡƒ Ρ‚ΠΎ ΠΈΠ»ΠΈ ΠΈΠ½ΠΎΠ΅ Ρ€Π΅ΡˆΠ΅Π½ΠΈΠ΅ - ΠΏΠ»ΠΎΡ…ΠΎΠ΅, ΠΈ ΠΏΡ€Π΅Π΄Π»Π°Π³Π°Π΅Ρ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ Π΅Π³ΠΎ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ. ΠΠΊΡ‚ΡƒΠ°Π»ΡŒΠ½Ρ‹ΠΉ список выявляСмых Π°Π½Ρ‚ΠΈΠΏΠ°Ρ‚Ρ‚Π΅Ρ€Π½ΠΎΠ² прСдставлСн здСсь

# ИспользованиС ΠΎΡ‚Π΄Π΅Π»ΡŒΠ½ΠΎΠ³ΠΎ модуля - Anti_Pattern
!python -m swissql anti_pattern -q "SELECT * from x join y on x.id = y.id where x.id > 2"

οΏ½[33m[Detecting anti-patterns using sqlcheck]οΏ½[0m
sqlcheck version 1.2.1
Debug build (NDEBUG not #defined)
+-------------------------------------------------+
|                   SQLCHECK                      |
+-------------------------------------------------+
> RISK LEVEL    :: ALL ANTI-PATTERNS
> SQL FILE NAME :: temp.sql
> COLOR MODE    :: ENABLED
> VERBOSE MODE  :: ENABLED
> DELIMITER     :: ;
-------------------------------------------------
==================== Results ===================

-------------------------------------------------
SQL Statement at line 1: οΏ½[1mοΏ½[31mselect * from x join y on x.id = y.id where x.id > 2;οΏ½[0mοΏ½[39m
[temp.sql]: (οΏ½[1mοΏ½[32mHIGH RISKοΏ½[0mοΏ½[39m) οΏ½[1mοΏ½[34mSELECT *οΏ½[0mοΏ½[39m
● Inefficiency in moving data to the consumer:
When you SELECT *, you're often retrieving more columns from the database than
your application really needs to function. This causes more data to move from
the database server to the client, slowing access and increasing load on your
machines, as well as taking more time to travel across the network. This is
especially true when someone adds new columns to underlying tables that didn't
exist and weren't needed when the original consumers coded their data access.


● Indexing issues:
Consider a scenario where you want to tune a query to a high level of
performance. If you were to use *, and it returned more columns than you
actually needed, the server would often have to perform more expensive methods
to retrieve your data than it otherwise might. For example, you wouldn't be able
to create an index which simply covered the columns in your SELECT list, and
even if you did (including all columns [shudder]), the next guy who came around
and added a column to the underlying table would cause the optimizer to ignore
your optimized covering index, and you'd likely find that the performance of
your query would drop substantially for no readily apparent reason.

● Binding
Problems:
When you SELECT *, it's possible to retrieve two columns of the same name from
two different tables. This can often crash your data consumer. Imagine a query
that joins two tables, both of which contain a column called "ID". How would a
consumer know which was which? SELECT * can also confuse views (at least in some
versions SQL Server) when underlying table structures change -- the view is not
rebuilt, and the data which comes back can be nonsense. And the worst part of it
is that you can take care to name your columns whatever you want, but the next
guy who comes along might have no way of knowing that he has to worry about
adding a column which will collide with your already-developed names.
[Matching Expression: οΏ½[1mοΏ½[34mselect *οΏ½[0mοΏ½[39m at line 1]


==================== Summary ===================
All Anti-Patterns and Hints  :: 1
>  High Risk   :: 1
>  Medium Risk :: 0
>  Low Risk    :: 0
>  Hints       :: 0

Rule


ΠŸΠΎΠ΄Π΄Π΅Ρ€ΠΆΠΊΠ° ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»ΡŒΡΠΊΠΈΡ… ΠΏΡ€Π°Π²ΠΈΠ» осущСствляСтся с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ Lark ΠΈ добавлСния ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»Π΅ΠΌ Ρ„Π°ΠΉΠ»ΠΎΠ² с прописанной BNF Π³Ρ€Π°ΠΌΠΌΠ°Ρ‚ΠΈΠΊΠΎΠΉ Π½Π΅ΠΎΠ±Ρ…ΠΎΠ΄ΠΈΠΌΠΎΠ³ΠΎ ΠΏΠ°Ρ‚Ρ‚Π΅Ρ€Π½Π°. ΠŸΡ€ΠΈΠΌΠ΅Ρ€ Π³Ρ€Π°ΠΌΠΌΠ°Ρ‚ΠΈΠΊΠΈ ΠΈ обнаруТСния ΠΏΠ°Ρ‚Ρ‚Π΅Ρ€Π½Π° - Π½ΠΈΠΆΠ΅

# Как выглядит Ρ„Π°ΠΉΠ» с BNF Π³Ρ€Π°ΠΌΠΌΠ°Ρ‚ΠΈΠΊΠΎΠΉ
!cat rules/filter2.lark

SELECT WILDCARD ANTIPATTERN :END_OF_COMMENT:
start: /./ start | pattern start | /./ | pattern
pattern: "SELECT *"i
%import common.INT
//%ignore " "
%ignore "\n"

ΠžΠ±Π½Π°Ρ€ΡƒΠΆΠ΅Π½ΠΈΠ΅ ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»ΡŒΡΠΊΠΎΠ³ΠΎ ΠΏΡ€Π°Π²ΠΈΠ»Π°:

# ИспользованиС ΠΎΡ‚Π΄Π΅Π»ΡŒΠ½ΠΎΠ³ΠΎ модуля - Rules
!python -m swissql rule -r filter2.lark -q "SELECT * from x join y on x.id = y.id where x.id > 2"

[Finding rules using lark]
Rule filter2.lark found:
Positions: (1, 1)
Comment: SELECT WILDCARD ANTIPATTERN 

Extract


Extract - ΠΌΠΎΠ΄ΡƒΠ»ΡŒ, Π½Π° основС рСгулярных Π²Ρ‹Ρ€Π°ΠΆΠ΅Π½ΠΈΠΉ для обнаруТСния строк ΠΈ ΠΈΡ… парсинга с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ Lark для отсСивания Π½Π΅ SQL-запросов. ΠŸΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ‚ ΠΈΠ·Π²Π»Π΅Ρ‡ΡŒ ΠΈΠ· ΠΊΠΎΠ΄ΠΎΠ²ΠΎΠΉ Π±Π°Π·Ρ‹ SQL-запросы для дальнСйшСй ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ ΠΈ Ρ€Π΅Ρ„Π°ΠΊΡ‚ΠΎΡ€ΠΈΠ½Π³Π°. ΠŸΡ€ΠΈΠΌΠ΅Ρ€ Ρ€Π°Π±ΠΎΡ‚Ρ‹:

# АнализируСмый кусок ΠΊΠΎΠ΄Π°
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType
from datetime import datetime, date
from pyspark.sql import Row

def init_spark():
    spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
    sc = spark.sparkContext
    return spark, sc


# SQl Queries
sql_queries = [
    'SELECT * FROM table1;',
    'SELECT * FROM table1 WHERE id = 1;',
    'SELECT * FROM table1 WHERE id = 1 AND name = "John" OR name = "Jane" OR name = "Jack";',
]

def main():
    spark_sql_queries = [
        "SELECT * FROM table",
    ]
...

if __name__ == "__main__":
    main()

Π Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚:

# ИспользованиС ΠΎΡ‚Π΄Π΅Π»ΡŒΠ½ΠΎΠ³ΠΎ модуля - Swissql extractor
!python -m swissql extract -f examples/example3.py
[Extracting Spark SQLs from file using lark]
Found:
SELECT * FROM table1
SELECT * FROM table1 WHERE id = 1
SELECT * FROM table1 WHERE id = 1 AND name = "John" OR name = "Jane" OR name = "Jack"
SELECT * FROM table
select * from tableA

Π‘Ρ‚Π΅ΠΊ

  • argparse - ΠŸΠΎΡΡ‚Ρ€ΠΎΠ΅Π½ΠΈΠ΅ CLI
  • sqlglot - БинтаксичСский Ρ€Π°Π·Π±ΠΎΡ€ SparkSQL, оптимизация ΠΈ Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅
  • sqlfluff - БтилистичСский Π°Π½Π°Π»ΠΈΠ· SparkSQL, Ρ€Π΅ΠΊΠΎΠΌΠ΅Π½Π΄Π°Ρ†ΠΈΠΈ ΠΏΠΎ ΡΡ‚ΠΈΠ»ΡŽ ΠΈ Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅
  • sqlcheck - ДСтСкция Π°Π½Ρ‚ΠΈ-ΠΏΠ°Ρ‚Ρ‚Π΅Ρ€Π½ΠΎΠ² Π² запросах SparkSQL. Π Π΅ΠΊΠΎΠΌΠΌΠ΅Π½Π΄Π°Ρ†ΠΈΠΈ ΠΏΠΎ ΡƒΠ»ΡƒΡ‡ΡˆΠ΅Π½ΠΈΡŽ ΠΊΠΎΠ΄Π°.
  • Lark - ΠŸΠ°Ρ€ΡΠΈΠ½Π³ Π½Π° основС BNF Π³Ρ€Π°ΠΌΠΌΠ°Ρ‚ΠΈΠΊΠΈ. ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° строк Π½Π° Π½Π°Π»ΠΈΡ‡ΠΈΠ΅ SQL запроса, ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»ΡŒΡΠΊΠΈΠ΅ ΠΏΡ€Π°Π²ΠΈΠ»Π°.

ΠšΠ°Ρ‡Π΅ΡΡ‚Π²ΠΎ ΠΊΠΎΠ΄Π°

ΠšΠ°Ρ‡Π΅ΡΡ‚Π²ΠΎ ΠΈ Сдиснтво стиля ΠΊΠΎΠ΄Π° Π² Ρ€Π΅ΠΏΠΎΠ·ΠΈΡ‚ΠΎΡ€ΠΈΠΈ обСспСчиваСтся с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ Π»ΠΈΠ½Ρ‚Π΅Ρ€Π° black

Π§Ρ‚ΠΎ ΠΌΡ‹ ΠΎΡ‡Π΅Π½ΡŒ Ρ…ΠΎΡ‚Π΅Π»ΠΈ, Π½ΠΎ Π½Π΅ успСли ΡΠ΄Π΅Π»Π°Ρ‚ΡŒ

  • РаспараллСливаниС вычислСний ΠΏΡ€ΠΈ Π²Ρ‹Π·ΠΎΠ²Π΅ swissql all - ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΠ»ΠΎ Π±Ρ‹ Π² нСсколько Ρ€Π°Π· ΡƒΡΠΊΠΎΡ€ΠΈΡ‚ΡŒ Π²Ρ‹ΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅ ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΡ‹
  • ΠšΡ€Π°ΡΠΈΠ²Ρ‹ΠΉ Π²Π΅Π±-интСрфСйс
  • Поиск Π°Π½Ρ‚ΠΈΠΏΠ°Ρ‚Ρ‚Π΅Ρ€Π½ΠΎΠ² Π² SparkSQL с использованиСм ML/GNN

Авторы

I PRO - это ΠΌΡ‹!

  • Π•Π³ΠΎΡ€ Π“ΠΎΠ»ΡƒΠ±Π΅Π² - Π±Π΅ΠΊΠ΅Π½Π΄-Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚Ρ‡ΠΈΠΊ
  • АндрСй Π‘Π°Ρ€Π°Π½ΠΎΠ² - Π±Π΅ΠΊΠ΅Π½Π΄-Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚Ρ‡ΠΈΠΊ
  • Π•Π³ΠΎΡ€ Π‘ΠΌΡƒΡ€Ρ‹Π³ΠΈΠ½ - ΠΏΡ€ΠΎΠ΄ΡƒΠΊΡ‚-ΠΌΠ΅Π½Π΅Π΄ΠΆΠ΅Ρ€
  • Π›Π΅Π±Π΅Π΄Π΅Π²Π° Π’Π°Ρ‚ΡŒΡΠ½Π° - Π±Π΅ΠΊΠ΅Π½Π΄-Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚Ρ‡ΠΈΠΊ
  • Π’Π°ΠΌΠΏΠΈΠΎ Илья - Π±Π΅ΠΊΠ΅Π½Π΄-Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚Ρ‡ΠΈΠΊ