Possibly Related Threads…

uninspiredusername · Apr 09, 2024, 05:17 PM

Rebuilding my collection after some unfortunate data loss (was retarded) (still am)

Considering a more consistent solution this time around. Y'all have any interesting storage methods that might be worth exploring?

At this point, I'm probably gonna end up with a simple psql db with a table for each dump and consistent col names for common types. Something that allows for more consistent processing rather than dealing with a dozen different formats and such.

( I don't care what you do with the data -- just contemplating raw original formats vs some relational dbs vs nosql vs elastic etc )

ygrek · Apr 09, 2024, 05:33 PM

I also have the same issue, i'm considering using solr as the search.0t.rocks suggest, I have tried but for real their are a lot of work for each file, I am also considering elastic search with kibana, another way is if you have all your data on ssd and a nice CPU simply fgrep in parallel with a search but for real the solr option look like the best one since it let you do as you please and associate the data. You'll still be left with a "little" issue: some sql dump seem like they don't get any interest in adding some new line after each record, that simply mean you'll have to do it first before being able to index it.
The psql db is not really in my priority since it'll only let me query something I already know but not a diving across everything at once. What do you think about that ?
(If anyone already did all the indexing part please share your python script to do so, since we need one script per file/dump/db/data, the "unstructured" option is really bad for the association feature)

uninspiredusername · Apr 10, 2024, 11:45 PM

(Apr 09, 2024, 05:33 PM)ygrek Wrote: [ . . . ]

I've actually never messed with Solr. I should set some time aside soon to try it out -- at least to have it on the back burner for later

> ome sql dump seem like they don't get any interest in adding some new line after each record

Honestly, that doesn't bother me too much. Even if it's not consistent across dumps, it's (normally) consistent within any one file. 5 minutes to figure out the formatting and you can add it to a db pretty easily. It's definitely an extra step that has to be taken though

> The psql db is not really in my priority since it'll only let me query something I already know but not a diving across everything at once. What do you think about that ?

I don't see it that way actually

Let's say I want unique email addresses from the AT&T leak. Theoretically, I could just query SELECT EmailAddr FROM AT&T or whatever tf, and then I can process the return data however I would normally

Anything you think I'm overlooking on that part?

joepa · (This post was last modified: Apr 11, 2024, 09:24 AM by joepa.)

By PCIe 4.0 x4 NVME SSDs and just grep the raw files...

uninspiredusername · Apr 11, 2024, 03:09 PM

(Apr 11, 2024, 09:23 AM)joepa Wrote: By PCIe 4.0 x4 NVME SSDs and just grep the raw files...

too easy

if you're not gonna reinvent the wheel why bother (/s)

ygrek · (This post was last modified: Apr 14, 2024, 04:25 PM by ygrek.)

(Apr 11, 2024, 09:23 AM)joepa Wrote: By PCIe 4.0 x4 NVME SSDs and just grep the raw files...

that's actually what i already have but still that's not fast enough, think about a few To of data not a few Go

(Apr 10, 2024, 11:45 PM)uninspiredusername Wrote:
(Apr 09, 2024, 05:33 PM)ygrek Wrote: [ . . . ]

I've actually never messed with Solr. I should set some time aside soon to try it out -- at least to have it on the back burner for later

> ome sql dump seem like they don't get any interest in adding some new line after each record

Honestly, that doesn't bother me too much. Even if it's not consistent across dumps, it's (normally) consistent within any one file. 5 minutes to figure out the formatting and you can add it to a db pretty easily. It's definitely an extra step that has to be taken though

> The psql db is not really in my priority since it'll only let me query something I already know but not a diving across everything at once. What do you think about that ?

I don't see it that way actually

Let's say I want unique email addresses from the AT&T leak. Theoretically, I could just query SELECT EmailAddr FROM AT&T or whatever tf, and then I can process the return data however I would normally

Anything you think I'm overlooking on that part?

Also about the part to select only from one leak or source, you simply have to passe the source leak when indexing aznd then when qureying you'll have the possibility to filter. solr is very usefull and efficient for data association, OSINT in short Smile

If you need those data for something else like selling, exploiting i don't know how but look evil for me at this point, it's better to use some sql solution since it's for industrialized process

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Largest Discord User History Archive - 10m+ Users	Mega	318	40,226	9 hours ago Last Post: phas3lock
	A collection of deepweb sites [2025]	dg7ka	108	3,108	11 hours ago Last Post: Moneymaking123
	In front an abuse in the school, any suggestion?	dai5	0	120	Yesterday, 11:02 AM Last Post: dai5
	Questrade leak anyone?	username000	0	200	May 01, 2026, 11:36 PM Last Post: username000
	OSINT repositories by country	browdbrowniebread	0	287	Apr 30, 2026, 07:41 PM Last Post: browdbrowniebread