| Title: | An R Tool for Streamlining Database Joins |
|---|---|
| Description: | Simplifies and automates the process of exploring and merging data from relational databases. This package allows users to discover table relationships, create a map of all possible joins, and generate executable plans to merge data based on a structured metadata framework. |
| Authors: | Akshat Maurya [aut, cre], David Shilane [aut] |
| Maintainer: | Akshat Maurya <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.0.9000 |
| Built: | 2026-05-11 08:28:14 UTC |
| Source: | https://github.com/akshat09867/dbmaps |
A generic function to add new table metadata to a registry object.
add_table(registry, table_metadata)add_table(registry, table_metadata)
registry |
The registry object to which metadata will be added. |
table_metadata |
A data.table object created by |
The updated registry object.
This function acts as a "planner." It takes a user's request for a final dataset, finds a path using a join map, and creates a structured plan (or "recipe") of the necessary steps.
create_join_plan( base_table, selections, metadata_dt, join_map = NULL, tables_dis = NULL )create_join_plan( base_table, selections, metadata_dt, join_map = NULL, tables_dis = NULL )
base_table |
A character string specifying the main table. |
selections |
A named list specifying the columns or aggregations to include. |
metadata_dt |
The master metadata data.table. |
join_map |
An optional "Join Map" data.table produced by |
tables_dis |
An optional named list of data.tables used for data‑driven (inferred) join discovery. If |
A list object representing the "join plan."
# --- 1. Define Metadata (Prerequisite) --- customers_meta <- table_info( table_name = "customers", source_identifier = "customers.csv", identifier_columns = "customer_id", key_outcome_specs = list( list(OutcomeName = "CustomerCount", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "CountByRegion", AggregationFunction = "sum", GroupingVariables = "region") )) ) ) transactions_meta <- table_info( "transactions", "t.csv", "tx_id", key_outcome_specs = list(list(OutcomeName = "Revenue", ValueExpression = quote(r), AggregationMethods = list(list(AggregatedName = "RevenueByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id")))) ) master_metadata <- data.table::rbindlist(list(customers_meta, transactions_meta)) # --- 2. Define the Desired Output --- user_selections <- list( customers = "region", transactions = "RevenueByCustomer" ) # --- 3. Create the Join Plan WITHOUT providing the join_map --- # The function will now generate it automatically. join_plan <- create_join_plan( base_table = "customers", selections = user_selections, metadata_dt = master_metadata ) # --- 4. Inspect the Plan --- str(join_plan)# --- 1. Define Metadata (Prerequisite) --- customers_meta <- table_info( table_name = "customers", source_identifier = "customers.csv", identifier_columns = "customer_id", key_outcome_specs = list( list(OutcomeName = "CustomerCount", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "CountByRegion", AggregationFunction = "sum", GroupingVariables = "region") )) ) ) transactions_meta <- table_info( "transactions", "t.csv", "tx_id", key_outcome_specs = list(list(OutcomeName = "Revenue", ValueExpression = quote(r), AggregationMethods = list(list(AggregatedName = "RevenueByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id")))) ) master_metadata <- data.table::rbindlist(list(customers_meta, transactions_meta)) # --- 2. Define the Desired Output --- user_selections <- list( customers = "region", transactions = "RevenueByCustomer" ) # --- 3. Create the Join Plan WITHOUT providing the join_map --- # The function will now generate it automatically. join_plan <- create_join_plan( base_table = "customers", selections = user_selections, metadata_dt = master_metadata ) # --- 4. Inspect the Plan --- str(join_plan)
Initializes an empty data.table with a custom class "MetadataRegistry" to store and manage metadata definitions.
create_metadata_registry()create_metadata_registry()
An empty data.table with the class "MetadataRegistry".
A sample dataset containing demographic information for customers included with the DBmaps package.
customerscustomers
A data.table with 5 variables:
A unique identifier for each customer.
The age of the customer in years.
The gender of the customer.
The income level of the customer.
The geographical region where the customer resides.
Generated for package examples.
Takes a plan generated by create_join_plan() and executes it sequentially
to produce a final, merged data.table.
execute_join_plan(join_plan, data_list)execute_join_plan(join_plan, data_list)
join_plan |
A data.table created by |
data_list |
A named list of the source data.tables. |
A final, merged data.table.
Reads metadata from a master data.table and generates executable data.table code strings for performing aggregations.
generate_aggregation_code(table_name_filter, metadata_dt)generate_aggregation_code(table_name_filter, metadata_dt)
table_name_filter |
Character string, the name of the table for which to generate aggregation code. |
metadata_dt |
A data.table containing the master metadata, created by
calling |
A named character vector where each element is a runnable
data.table code string, and the names correspond to the grouping variables.
# First, create some metadata customers_info <- table_info( table_name = "customers", source_identifier = "customers.csv", identifier_columns = "customer_id", key_outcome_specs = list( list(OutcomeName = "CustomerCount", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "CountByRegion", AggregationFunction = "sum", GroupingVariables = "region") )) )) transactions_info <- table_info( table_name = "transactions", source_identifier = "transactions.csv", identifier_columns = "transaction_id", key_outcome_specs = list( list(OutcomeName = "Revenue", ValueExpression = quote(amount), AggregationMethods = list( list(AggregatedName = "RevenueByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id"), list(AggregatedName = "RevenueByProduct", AggregationFunction = "sum", GroupingVariables = "product_id") )), list(OutcomeName = "Transactions", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "TransactionsByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id") )) )) master_metadata <- data.table::rbindlist(list(customers_info, transactions_info)) # Now, generate the code for the "transactions" table generated_code <- generate_aggregation_code("transactions", master_metadata) print(generated_code) # To demonstrate execution: # 1. Create the sample data transactions <- data.table::data.table( transaction_id = c("T001", "T002", "T003"), customer_id = c("C001", "C002", "C001"), product_id = c("P001", "P002", "P001"), amount = c(10.0, 20.0, 15.0) ) # 2. Parse and evaluate the first generated statement revenue_by_customer_code <- generated_code["customer_id"] cat("Executing code:\n", revenue_by_customer_code) revenue_by_customer_dt <- eval(parse(text = revenue_by_customer_code)) print(revenue_by_customer_dt)# First, create some metadata customers_info <- table_info( table_name = "customers", source_identifier = "customers.csv", identifier_columns = "customer_id", key_outcome_specs = list( list(OutcomeName = "CustomerCount", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "CountByRegion", AggregationFunction = "sum", GroupingVariables = "region") )) )) transactions_info <- table_info( table_name = "transactions", source_identifier = "transactions.csv", identifier_columns = "transaction_id", key_outcome_specs = list( list(OutcomeName = "Revenue", ValueExpression = quote(amount), AggregationMethods = list( list(AggregatedName = "RevenueByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id"), list(AggregatedName = "RevenueByProduct", AggregationFunction = "sum", GroupingVariables = "product_id") )), list(OutcomeName = "Transactions", ValueExpression = 1, AggregationMethods = list( list(AggregatedName = "TransactionsByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id") )) )) master_metadata <- data.table::rbindlist(list(customers_info, transactions_info)) # Now, generate the code for the "transactions" table generated_code <- generate_aggregation_code("transactions", master_metadata) print(generated_code) # To demonstrate execution: # 1. Create the sample data transactions <- data.table::data.table( transaction_id = c("T001", "T002", "T003"), customer_id = c("C001", "C002", "C001"), product_id = c("P001", "P002", "P001"), amount = c(10.0, 20.0, 15.0) ) # 2. Parse and evaluate the first generated statement revenue_by_customer_code <- generated_code["customer_id"] cat("Executing code:\n", revenue_by_customer_code) revenue_by_customer_dt <- eval(parse(text = revenue_by_customer_code)) print(revenue_by_customer_dt)
Analyzes metadata for explicit joins and optionally scans data to infer additional joins. Handles single- and multi-variable join keys.
map_join_paths(metadata_dt, data_list = NULL)map_join_paths(metadata_dt, data_list = NULL)
metadata_dt |
A data.table containing the master metadata. |
data_list |
A named list of data.tables (names match |
A data.table representing the "Join Map" with columns:
table_from, table_to, key_from, key_to
Takes a plan generated by create_join_plan() and creates a flowchart
visualizing the sequence of aggregations and merges.
plot_join_plan(join_plan)plot_join_plan(join_plan)
join_plan |
A |
A DiagrammeR graph object that can be printed to the RStudio
Viewer pane.
A sample dataset containing product information included with the DBmaps package.
productsproducts
A data.table with 3 variables:
A unique identifier for each product.
The category to which the product belongs.
The original price of the product.
Generated for package examples.
Takes descriptive information about a table and returns a tidy data.table.
table_info( table_name, source_identifier, identifier_columns, key_outcome_specs )table_info( table_name, source_identifier, identifier_columns, key_outcome_specs )
table_name |
Character string, the conceptual name of the table. |
source_identifier |
Character string, the file name or DB table identifier. |
identifier_columns |
Character vector, names of column(s) acting as primary key(s). |
key_outcome_specs |
A list of 'OutcomeSpec' lists. |
A tidy data.table with the table's metadata. The identifier_columns and
grouping_variables columns are list-columns.
transactions_info <- table_info( table_name = "transactions", source_identifier = "transactions.csv", identifier_columns = c("customer_id", "product_id", "time"), key_outcome_specs = list( list( OutcomeName = "Revenue", ValueExpression = quote(price * quantity), AggregationMethods = list( list(AggregatedName = "RevenueByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id"), list(AggregatedName = "RevenueByProduct", AggregationFunction = "sum", GroupingVariables = "product_id") ) ) ) ) # Note the structure of the list-columns print(transactions_info) str(transactions_info[, .(identifier_columns, grouping_variable)])transactions_info <- table_info( table_name = "transactions", source_identifier = "transactions.csv", identifier_columns = c("customer_id", "product_id", "time"), key_outcome_specs = list( list( OutcomeName = "Revenue", ValueExpression = quote(price * quantity), AggregationMethods = list( list(AggregatedName = "RevenueByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id"), list(AggregatedName = "RevenueByProduct", AggregationFunction = "sum", GroupingVariables = "product_id") ) ) ) ) # Note the structure of the list-columns print(transactions_info) str(transactions_info[, .(identifier_columns, grouping_variable)])
A sample dataset of transaction events, linking customers and products. This is a typical "fact" table in a relational schema.
transactionstransactions
A data.table with 5 variables:
Identifier for the customer making the transaction.
Identifier for the product being purchased.
The timestamp of the transaction (POSIXct format).
The number of units of the product purchased.
The price per unit at the time of transaction.
Generated for package examples.
A sample dataset of product view events, linking customers and products. This is a smaller, sampled version of a potentially very large event log.
viewsviews
A data.table with 3 variables:
Identifier for the customer viewing the product.
Identifier for the product being viewed.
The timestamp of the view event (POSIXct format).
Generated for package examples.