-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can it replace the H2 database engine? #104
Comments
No, csvq reads all the data into memory at runtime, so trying to handle a 1TB file is nearly impossible. |
|
The file size you can handle depends on the query you want to execute and your system. Csvq is not a DBMS, but an SQL interpreter that executes queries against text files such as csv. |
I want to use this SQL interpreter to process large text files, but I am not sure how efficient it is to process large files. If it is very efficient, it can do many things in memory databases |
Have you tried DuckDB? |
I am not looking for a database, but for a solution that can process big data in memory. I feel that CVSQ is very powerful. |
I agree that csvq is a very capable tool for querying CSV files using SQL and I still use it where it makes sense. DuckDB can do many of the same things and can process as much data as will fit in memory without ever creating a single table. Like csvq, it can read this data directly from CSV files, but also from Parquet files, and write results to such files. You need not create a DuckDB database to query very large CSV files in memory! |
What types of databases does DuckDB belong to? I read the document and feel that it is similar to a sqlite database. What are the main scenarios in which DuckDB is used? |
DuckDB is similar to SQLite in that it is an embedded database, but unlike SQLite, DuckDB is a column-oriented database such that it stores data column-wise, rather than row-wise. Column stores like DuckDB are designed for analysis of large data sets while SQLite is designed more for transaction processing. Neither has a server component, so they both are designed for local data processing on a personal computer. Column-oriented databases use various column data compression methods to store data more efficiently, scan and retrieve only the columns that the query selects (good for querying wide tables), and generally execute aggregate queries on columns more quickly. |
|
DuckDB streams the input and results of most (or many) query operations, so most (or many) DuckDB commands can query very large data sets, even 1 TB, on computers that have much lower memory capacity.
|
|
@hw2499 , @derekmahar Maybe the end of this discussion should be moved to the DuckDB Discussion Forum ? IMHO, the |
@kpym, I agree. I actually didn't realise that this discussion was about a csvq issue. I thought it was a csvq discussion topic. In any case, in the DuckDB discussion forums on GitHub or the DuckDB Discord server, the DuckDB developers and other users could better answer @hw2499's questions. |
Okay, thank you. @kpym @derekmahar |
The functionality is very powerful. I have a question. If a csv file has a size of 1TB, what would the query efficiency be?
Does csvq want to read all 1TB of content into memory before subsequent processing?
Can it replace the H2 database engine?
The text was updated successfully, but these errors were encountered: