Is there a way to combine multiple sources to query as one? #294
-
In my requirement I receive 30 csv files. 10 files are companies, 10 files are the partners of those companies and 10 files are details about the activity. I need to join data from these 3 types of files to create a json that will be imported to a mongodb instance. My idea is to loop row by row in each of the company csv, take the Id and query partners csv and activity csv. With those data, I can make a json to import like this:
the csvs are zipped and with all together I have 5 Gb of data. Unzipped I have 14 gb of data. There is about 100,000,000 lines. Is it possible? Is it the best way to suit this problem? Whats the performance for huge file sizes? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
The very point of etl net is to deal with many streams and combine them. |
Beta Was this translation helpful? Give feedback.
-
The problem is that no miracle is possible, whatever the ETL you will use. High level example to give an idea: var type1Stream = fileType1Stream
.CrossApplyTextFile("parse Type1 file", ...)
.Select("create Type1", ...)
.EfCoreSave("save Type1");
var type4Stream = fileType4Stream
.CrossApplyTextFile("parse Type4 file", ...)
.Select("create Type4", ...)
.EfCoreSave("save Type4");
var type2Stream = fileType2Stream
.WaitWhenDone("wait the type 1 files are pocessed", type1Stream)
.CrossApplyTextFile("parse Type2 file", ...)
.EfCoreLookup("get related Type1", ...)
.Select("create Type2", ...)
.EfCoreSave("save Type2");
var type3Stream = fileType3Stream
.WaitWhenDone("wait the type 2 files are pocessed", type2Stream)
.CrossApplyTextFile("parse Type3 file", ...)
.EfCoreLookup("get related Type2", ...)
.WaitWhenDone("wait the type 4 files are pocessed", type4Stream )
.EfCoreLookup("get related Type4", ...)
.Select("create Type3", ...)
.EfCoreSave("save Type3"); |
Beta Was this translation helpful? Give feedback.
The problem is that no miracle is possible, whatever the ETL you will use.
Join together 10 files of 4 million rows each is possible in an efficient way only if you join all of them based on the same key, and that every file is sorted on this key. In this case you will jon them using
LeftJoin
operator like described here: https://paillave.github.io/Etl.Net/docs/recipes/linkStreams#join-and-leftjoinOtherwise, nothing is possible out of the box for any ETL. Therefore, the only solution is to save files one by one in the database, ensuring the sequence of dependencies using the operator
WaiWhenDone
. Then you will lookup on the depending table (you just imported) using properly theEfCoreLookup