| Title: | Efficient Data Filtering and Aggregation Using Grep |
|---|---|
| Description: | Provides an interface to the system-level 'grep' utility for efficiently reading, filtering, and aggregating data from multiple flat files. By pre-filtering data at the command line before it enters the R environment, the package reduces memory overhead and improves ingestion speed. Includes functions for counting records across large file systems and supports recursive directory searching. |
| Authors: | David Shilane [aut], Atharv Raskar [aut], Akshat Maurya [aut, cre] |
| Maintainer: | Akshat Maurya <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.1 |
| Built: | 2026-05-29 10:55:37 UTC |
| Source: | https://github.com/akshat09867/grepreaper |
Constructs a safe and properly formatted grep command string for system execution. This function handles input sanitization by utilizing R's internal shell quoting mechanism, ensuring compatibility across different operating systems.
build_grep_cmd(pattern, files, options = "", fixed = FALSE)build_grep_cmd(pattern, files, options = "", fixed = FALSE)
pattern |
Character vector of patterns to search for. |
files |
Character vector of file paths to search in. |
options |
Character string containing grep flags (e.g., "-i", "-v"). |
fixed |
Logical; if TRUE, grep is told to treat patterns as fixed strings. |
A properly formatted command string ready for system execution.
grep_count: Efficiently count the number of relevant records from one or more files using grep
grep_count( files = NULL, path = NULL, file_pattern = NULL, pattern = "", invert = FALSE, ignore_case = FALSE, fixed = FALSE, recursive = FALSE, word_match = FALSE, only_matching = FALSE, skip = 0, header = TRUE, include_filename = FALSE, show_cmd = FALSE, show_progress = FALSE, ... )grep_count( files = NULL, path = NULL, file_pattern = NULL, pattern = "", invert = FALSE, ignore_case = FALSE, fixed = FALSE, recursive = FALSE, word_match = FALSE, only_matching = FALSE, skip = 0, header = TRUE, include_filename = FALSE, show_cmd = FALSE, show_progress = FALSE, ... )
files |
Character vector of file paths to read. |
path |
Optional. Directory path to search for files. |
file_pattern |
Optional. A pattern to filter filenames when using the
|
pattern |
Pattern to search for within files (passed to grep). |
invert |
Logical; if TRUE, return non-matching lines. |
ignore_case |
Logical; if TRUE, perform case-insensitive matching (default: TRUE). |
fixed |
Logical; if TRUE, pattern is a fixed string, not a regular expression. |
recursive |
Logical; if TRUE, search recursively through directories. |
word_match |
Logical; if TRUE, match only whole words. |
only_matching |
Logical; if TRUE, return only the matching part of the lines. |
skip |
Integer; number of rows to skip. |
header |
Logical; if TRUE, treat first row as header. |
include_filename |
Logical; if TRUE, include source filename as a column. |
show_cmd |
Logical; if TRUE, return the grep command string instead of executing it. |
show_progress |
Logical; if TRUE, show progress indicators. |
... |
Additional arguments passed to fread. |
A data.table containing file names and counts.
grep_read: Efficiently read and filter lines from one or more files using grep, returning a data.table.
grep_read( files = NULL, path = NULL, file_pattern = NULL, pattern = "", invert = FALSE, ignore_case = FALSE, fixed = FALSE, show_cmd = FALSE, recursive = FALSE, word_match = FALSE, show_line_numbers = FALSE, only_matching = FALSE, nrows = Inf, skip = 0, header = TRUE, col.names = NULL, include_filename = FALSE, show_progress = FALSE, ... )grep_read( files = NULL, path = NULL, file_pattern = NULL, pattern = "", invert = FALSE, ignore_case = FALSE, fixed = FALSE, show_cmd = FALSE, recursive = FALSE, word_match = FALSE, show_line_numbers = FALSE, only_matching = FALSE, nrows = Inf, skip = 0, header = TRUE, col.names = NULL, include_filename = FALSE, show_progress = FALSE, ... )
files |
Character vector of file paths to read. |
path |
Optional. Directory path to search for files. |
file_pattern |
Optional. A pattern to filter filenames when using the
|
pattern |
Pattern to search for within files (passed to grep). |
invert |
Logical; if TRUE, return non-matching lines. |
ignore_case |
Logical; if TRUE, perform case-insensitive matching (default: TRUE). |
fixed |
Logical; if TRUE, pattern is a fixed string, not a regular expression. |
show_cmd |
Logical; if TRUE, return the grep command string instead of executing it. |
recursive |
Logical; if TRUE, search recursively through directories. |
word_match |
Logical; if TRUE, match only whole words. |
show_line_numbers |
Logical; if TRUE, include line numbers from source files. Headers are automatically removed and lines renumbered. |
only_matching |
Logical; if TRUE, return only the matching part of the lines. |
nrows |
Integer; maximum number of rows to read. |
skip |
Integer; number of rows to skip. |
header |
Logical; if TRUE, treat first row as header. Note that using FALSE means that the first row will be included as a row of data in the reading process. |
col.names |
Character vector of column names. |
include_filename |
Logical; if TRUE, include source filename as a column. |
show_progress |
Logical; if TRUE, show progress indicators. |
... |
Additional arguments passed to fread. |
A data.table with different structures based on the options:
Default: Data columns with original types preserved
show_line_numbers=TRUE: Additional 'line_number' column (integer) with source file line numbers
include_filename=TRUE: Additional 'source_file' column (character)
only_matching=TRUE: Single 'match' column with matched substrings
show_cmd=TRUE: Character string containing the grep command
When searching for literal strings (not regex patterns), set
fixed = TRUE to avoid regex interpretation. For example, searching for
"3.94" with fixed = FALSE will match "3894" because "." is a regex
metacharacter.
Header rows are automatically handled:
With show_line_numbers=TRUE: Headers (line_number=1) are removed and lines renumbered
Without line numbers: Headers matching column names are removed
Empty rows and all-NA rows are automatically filtered out
Efficiently splits character vectors into multiple columns based on a specified delimiter. This function is optimized for performance and handles common use cases like parsing grep output or other delimited text data.
split_columns( x, column.names = NA, split = ":", resulting.columns = 3, fixed = TRUE )split_columns( x, column.names = NA, split = ":", resulting.columns = 3, fixed = TRUE )
x |
Character vector to split |
column.names |
Names for the resulting columns (optional) |
split |
Delimiter to split on (default: ":") |
resulting.columns |
Number of columns to create (default: 3) |
fixed |
Whether to use fixed string matching (default: TRUE) |
A data.table with split columns. Column names are automatically assigned
as V1, V2, V3, etc. unless custom names are provided via column.names.
# Split grep-like output with colon delimiter data <- c("file.txt:15:error message", "file.txt:23:warning message") result <- split_columns(data, resulting.columns = 3) print(result) # With custom column names result_named <- split_columns(data, column.names = c("filename", "line", "message"), resulting.columns = 3) print(result_named) # Split into 2 columns (combining remaining elements) result_2col <- split_columns(data, resulting.columns = 2) print(result_2col)# Split grep-like output with colon delimiter data <- c("file.txt:15:error message", "file.txt:23:warning message") result <- split_columns(data, resulting.columns = 3) print(result) # With custom column names result_named <- split_columns(data, column.names = c("filename", "line", "message"), resulting.columns = 3) print(result_named) # Split into 2 columns (combining remaining elements) result_2col <- split_columns(data, resulting.columns = 2) print(result_2col)