Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can it replace the H2 database engine? #104

Open
hw2499 opened this issue Mar 30, 2023 · 15 comments
Open

Can it replace the H2 database engine? #104

hw2499 opened this issue Mar 30, 2023 · 15 comments

Comments

@hw2499
Copy link

hw2499 commented Mar 30, 2023

The functionality is very powerful. I have a question. If a csv file has a size of 1TB, what would the query efficiency be?
Does csvq want to read all 1TB of content into memory before subsequent processing?
Can it replace the H2 database engine?

@mithrandie
Copy link
Owner

No, csvq reads all the data into memory at runtime, so trying to handle a 1TB file is nearly impossible.

@hw2499
Copy link
Author

hw2499 commented Apr 2, 2023

  1. Is there any suggestion on how large a file size csvq is suitable for processing?
  2. How can I solve this 1TB file scenario?

@mithrandie
Copy link
Owner

The file size you can handle depends on the query you want to execute and your system.

Csvq is not a DBMS, but an SQL interpreter that executes queries against text files such as csv.
I don't know what you want to do with the file and how you want to handle it, so I can't say what method would be appropriate. In general, however, it is appropriate to use a DBMS that is designed to handle large amounts of data.

@hw2499
Copy link
Author

hw2499 commented Apr 3, 2023

I want to use this SQL interpreter to process large text files, but I am not sure how efficient it is to process large files. If it is very efficient, it can do many things in memory databases

@derekmahar
Copy link
Contributor

Have you tried DuckDB?

@hw2499
Copy link
Author

hw2499 commented Apr 8, 2023

I am not looking for a database, but for a solution that can process big data in memory. I feel that CVSQ is very powerful.

@derekmahar
Copy link
Contributor

derekmahar commented Apr 8, 2023

I agree that csvq is a very capable tool for querying CSV files using SQL and I still use it where it makes sense. DuckDB can do many of the same things and can process as much data as will fit in memory without ever creating a single table. Like csvq, it can read this data directly from CSV files, but also from Parquet files, and write results to such files. You need not create a DuckDB database to query very large CSV files in memory!

@hw2499
Copy link
Author

hw2499 commented Apr 10, 2023

What types of databases does DuckDB belong to? I read the document and feel that it is similar to a sqlite database. What are the main scenarios in which DuckDB is used?

@derekmahar
Copy link
Contributor

derekmahar commented Apr 10, 2023

DuckDB is similar to SQLite in that it is an embedded database, but unlike SQLite, DuckDB is a column-oriented database such that it stores data column-wise, rather than row-wise. Column stores like DuckDB are designed for analysis of large data sets while SQLite is designed more for transaction processing. Neither has a server component, so they both are designed for local data processing on a personal computer. Column-oriented databases use various column data compression methods to store data more efficiently, scan and retrieve only the columns that the query selects (good for querying wide tables), and generally execute aggregate queries on columns more quickly.

@hw2499
Copy link
Author

hw2499 commented Apr 11, 2023

  1. Can duckDB process (query) big data like 1TB in memory, or are there any other solutions?
  2. Does duckDB Client API support golang?

@derekmahar
Copy link
Contributor

  1. Can duckDB process (query) big data like 1TB in memory, or are there any other solutions?

DuckDB streams the input and results of most (or many) query operations, so most (or many) DuckDB commands can query very large data sets, even 1 TB, on computers that have much lower memory capacity.

  1. Does duckDB Client API support golang?

https://pkg.go.dev/github.com/benjajaja/go-duckdb

@hw2499
Copy link
Author

hw2499 commented Apr 12, 2023

  1. ok, I will look into the use of duckdb and then adapt it to etl-engine products.
    https://github.com/hw2499/etl-engine
  2. What olap functions does duckbd support?

@kpym
Copy link

kpym commented Apr 12, 2023

@hw2499 , @derekmahar Maybe the end of this discussion should be moved to the DuckDB Discussion Forum ? IMHO, the csvq issue tracker is not the place for this.

@derekmahar
Copy link
Contributor

@kpym, I agree. I actually didn't realise that this discussion was about a csvq issue. I thought it was a csvq discussion topic. In any case, in the DuckDB discussion forums on GitHub or the DuckDB Discord server, the DuckDB developers and other users could better answer @hw2499's questions.

@hw2499
Copy link
Author

hw2499 commented Apr 12, 2023

Okay, thank you. @kpym @derekmahar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants