SQL¶
Use SQL functions to operate on the underlying database storing the chain data. Useful
for operations like DataChain.filter
and DataChain.mutate
. Import
these functions from datachain.sql.functions
.
avg
¶
avg(col: str) -> Func
Returns the AVG aggregate SQL function for the given column name.
The AVG function returns the average of a numeric column in a table. It calculates the mean of all values in the specified column.
Parameters:
-
col
(str
) –The name of the column for which to calculate the average.
Returns:
-
Func
(Func
) –A Func object that represents the AVG aggregate function.
Notes
- The
avg
function should be used on numeric columns. - Result column will always be of type float.
Source code in datachain/func/aggregate.py
count
¶
Returns the COUNT aggregate SQL function for the given column name.
The COUNT function returns the number of rows in a table.
Parameters:
-
col
(str
, default:None
) –The name of the column for which to count rows. If not provided, it defaults to counting all rows.
Returns:
-
Func
(Func
) –A Func object that represents the COUNT aggregate function.
Notes
- Result column will always be of type int.
Source code in datachain/func/aggregate.py
greatest
¶
Returns the greatest (largest) value from the given input values.
Parameters:
-
args
(ColT | str | int | float | Sequence
, default:()
) –The values to compare. If a string is provided, it is assumed to be the name of the column. If a Func is provided, it is assumed to be a function returning a value. If an int, float, or Sequence is provided, it is assumed to be a literal.
Returns:
-
Func
(Func
) –A Func object that represents the greatest function.
Note
- Result column will always be of the same type as the input columns.
Source code in datachain/func/conditional.py
least
¶
Returns the least (smallest) value from the given input values.
Parameters:
-
args
(ColT | str | int | float | Sequence
, default:()
) –The values to compare. If a string is provided, it is assumed to be the name of the column. If a Func is provided, it is assumed to be a function returning a value. If an int, float, or Sequence is provided, it is assumed to be a literal.
Returns:
-
Func
(Func
) –A Func object that represents the least function.
Note
- Result column will always be of the same type as the input columns.
Source code in datachain/func/conditional.py
max
¶
max(col: str) -> Func
Returns the MAX aggregate SQL function for the given column name.
The MAX function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the maximum value.
Parameters:
-
col
(str
) –The name of the column for which to find the maximum value.
Returns:
-
Func
(Func
) –A Func object that represents the MAX aggregate function.
Notes
- The
max
function can be used with numeric, date, and string columns. - Result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
min
¶
min(col: str) -> Func
Returns the MIN aggregate SQL function for the given column name.
The MIN function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the minimum value.
Parameters:
-
col
(str
) –The name of the column for which to find the minimum value.
Returns:
-
Func
(Func
) –A Func object that represents the MIN aggregate function.
Notes
- The
min
function can be used with numeric, date, and string columns. - Result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
rand
¶
Returns the random integer value.
Returns:
-
Func
(Func
) –A Func object that represents the rand function.
Note
- Result column will always be of type integer.
Source code in datachain/func/random.py
sum
¶
sum(col: str) -> Func
Returns the SUM aggregate SQL function for the given column name.
The SUM function returns the total sum of a numeric column in a table. It sums up all the values for the specified column.
Parameters:
-
col
(str
) –The name of the column for which to calculate the sum.
Returns:
-
Func
(Func
) –A Func object that represents the SUM aggregate function.
Notes
- The
sum
function should be used on numeric columns. - Result column type will be the same as the input column type.
Source code in datachain/func/aggregate.py
array
¶
cosine_distance
¶
Computes the cosine distance between two vectors.
The cosine distance is derived from the cosine similarity, which measures the angle between two vectors. This function returns the dissimilarity between the vectors, where 0 indicates identical vectors and values closer to 1 indicate higher dissimilarity.
Parameters:
-
args
(str | Sequence
, default:()
) –Two vectors to compute the cosine distance between. If a string is provided, it is assumed to be the name of the column vector. If a sequence is provided, it is assumed to be a vector of values.
Returns:
-
Func
(Func
) –A Func object that represents the cosine_distance function.
Example
Notes
- Ensure both vectors have the same number of elements.
- Result column will always be of type float.
Source code in datachain/func/array.py
euclidean_distance
¶
Computes the Euclidean distance between two vectors.
The Euclidean distance is the straight-line distance between two points in Euclidean space. This function returns the distance between the two vectors.
Parameters:
-
args
(str | Sequence
, default:()
) –Two vectors to compute the Euclidean distance between. If a string is provided, it is assumed to be the name of the column vector. If a sequence is provided, it is assumed to be a vector of values.
Returns:
-
Func
(Func
) –A Func object that represents the euclidean_distance function.
Example
Notes
- Ensure both vectors have the same number of elements.
- Result column will always be of type float.
Source code in datachain/func/array.py
length
¶
Returns the length of the array.
Parameters:
-
arg
(str | Sequence | Func
) –Array to compute the length of. If a string is provided, it is assumed to be the name of the array column. If a sequence is provided, it is assumed to be an array of values. If a Func is provided, it is assumed to be a function returning an array.
Returns:
-
Func
(Func
) –A Func object that represents the array length function.
Example
Note
- Result column will always be of type int.
Source code in datachain/func/array.py
sip_hash_64
¶
Computes the SipHash-64 hash of the array.
Parameters:
-
arg
(str | Sequence
) –Array to compute the SipHash-64 hash of. If a string is provided, it is assumed to be the name of the array column. If a sequence is provided, it is assumed to be an array of values.
Returns:
-
Func
(Func
) –A Func object that represents the sip_hash_64 function.
Example
Note
- This function is only available for the ClickHouse warehouse.
- Result column will always be of type int.
Source code in datachain/func/array.py
path
¶
file_ext
¶
Returns the extension of the given path.
Parameters:
-
col
(str | literal
) –String to compute the file extension of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
Returns:
-
Func
(Func
) –A Func object that represents the file extension function.
Note
- Result column will always be of type string.
Source code in datachain/func/path.py
file_stem
¶
Returns the path without the extension.
Parameters:
-
col
(str | literal
) –String to compute the file stem of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
Returns:
-
Func
(Func
) –A Func object that represents the file stem function.
Note
- Result column will always be of type string.
Source code in datachain/func/path.py
name
¶
Returns the final component of a posix-style path.
Parameters:
-
col
(str | literal
) –String to compute the path name of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
Returns:
-
Func
(Func
) –A Func object that represents the path name function.
Note
- Result column will always be of type string.
Source code in datachain/func/path.py
parent
¶
Returns the directory component of a posix-style path.
Parameters:
-
col
(str | literal | Func
) –String to compute the path parent of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
Returns:
-
Func
(Func
) –A Func object that represents the path parent function.
Note
- Result column will always be of type string.
Source code in datachain/func/path.py
string
¶
length
¶
Returns the length of the string.
Parameters:
-
col
(str | literal | Func
) –String to compute the length of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
Returns:
-
Func
(Func
) –A Func object that represents the string length function.
Example
Note
- Result column will always be of type int.
Source code in datachain/func/string.py
regexp_replace
¶
Replaces substring that match a regular expression.
Parameters:
-
col
(str | literal
) –Column to split. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
-
regex
(str
) –Regular expression pattern to replace.
-
replacement
(str
) –Replacement string.
Returns:
-
Func
(Func
) –A Func object that represents the regexp_replace function.
Note
- Result column will always be of type string.
Source code in datachain/func/string.py
replace
¶
Replaces substring with another string.
Parameters:
-
col
(str | literal
) –Column to split. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
-
pattern
(str
) –Pattern to replace.
-
replacement
(str
) –Replacement string.
Returns:
-
Func
(Func
) –A Func object that represents the replace function.
Note
- Result column will always be of type string.
Source code in datachain/func/string.py
split
¶
Takes a column and split character and returns an array of the parts.
Parameters:
-
col
(str | literal
) –Column to split. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
-
sep
(str
) –Separator to split the string.
-
limit
(int
, default:None
) –Maximum number of splits to perform.
Returns:
-
Func
(Func
) –A Func object that represents the split function.
Example
Note
- Result column will always be of type array of strings.