Data integrity concerns TRAID

docbill · Post by **docbill** » 27 Apr 2024, 22:11

I just purchased a F4-423 and I have been testing with TOS 5, TOS 6, and Unraid.

Overall, I think TOS 6 best fits my desired usage. I like the fact pretty much all the drive maintenance can be done on a live system. And I really don''t have need of the extensive pool of docker containers available for unraid.

But I did couple of drive recovery tests with TRAID. For this I used a 8TB, 4TB, 1TB and 500GB. All by the 8TB drive are old drives. Because I want to test what happens when my drives are getting old. Before testing I went through a huge stack of old drives, testing them in TOS 5. I had one more 750 GB drive that tested as good, but I knew that one in particular would fail a long SMART test.

Test 1. I started loading data into the volume as quickly as possible via an NFS connection. It was from an older network across 1 Gbps hardware, so the upload maxed out at 80 MB/s and most of the time ran slower because of the read time of the data on my laptop. Still while it was hitting a maximum speed I opened up in the NAS web interface Jellyfish 120 Mbps. That played very smoothly, allowed me to seek back and forth. I have no idea of it was reading it from disk or memory cache, but it was quite good performance. At that point I opened the bay drive and removed the 4TB drive. Everything continued to function normally, and the NAS started beeping to let me know there had been a drive failure. After I got tired of the beeping, I reinsert the drive and began the repair.

Test 2. Next morning, everything is recovered. I decided I didn't need to stress test it again for this test. I opened the bay door and removed the 500 GB drive. Again the beeping starts. This time I swap in the 750 GB drive. As I am curious will it detect the drive is bad? How will the process for increasing size work if it doesn't? I click the button for repair but after a short period the dialog finishes, no errors and the volume remains in degraded state. I try again, this time watching closely on the screen. I notice it complaining about a missing raid 1.2 super block on a drive I didn't just replace... I try again, same error. I swap back in the 500 GB drive. Now it complains about a missing raid 1.2 super block on the 500 GB drive. Keep in mind these messages only appear on the console. There is no logging in the UI to tell me anything is going wrong.

I tried upgrading to TOS 6, to see if that could recover the volume. No luck. At this point my only option would be to recreate the volume and restore the full array from backup.

Now, I think I know what is going wrong. TNAS is essentially building a RAID0 volume with the 4TB, 1TB, and 500GB drive, and then it uses that RAID0 volume for a RAID1 volume with the 8TB. Now the order it assembles those drives in the RAID0 is probably potluck. So when it test for the superblock whichever volume comes first in that RAID0 is the one that generates the error. Now either the superblock was never restored on the 4TB, or it was auto erased by the failure. And the superblock is also missing from the 500 GB and 750 GB. In theory I could probably restore these super blocks manually, with a little research on the correct procedure. But I am highly concerned this doesn't just work. I'm even more concerned the error seems to be completely missed by the TRAID logging.

If I can make TOS 6 pass my recovery tests consistently, it will will be my preferred choice. But if TRAID doesn't work reliably, for drive recovery, what is the point?

Post by **TMzethar** » 28 Apr 2024, 19:31

The actual situation differs from your test results and speculation.
You can consider TRAID as a more flexible RAID5.

docbill · Post by **docbill** » 28 Apr 2024, 19:51

You should stop using the word "actual". I do not think it means what you think it means.

My only speculation is how the superblocks are cleared and what I might have been able to do to recover. How Terramaster constructs the arrays is described in the terramaster documentation. If that is wrong, it is not speculation on mu part, it is poorly written docentaion on Terramasters part.

The fact remains the only errors I saw on the console that indicated what the problem could be were of the form:

md: md0: recovery done
md: sdc4 does not have a valid v1.2 superblock, not importing!
md: md_import_device returned -22

And:

md: md0: recovery done
md: sdc2 does not have a valid v1.2 superblock, not importing!
md: md_import_device returned -22

Now if this error is a wild goose chase because there is some recovery step internally, it would be highly appropriate to log that recovery step.

In the end, the problem remains. Terramaster TRAID could not recover from a simple drive swap... Even when restoring the original drive. That is a problem. And the fact no error was reported in the UI to indicate the nature of the failure is an even bigger problem. It leaves the user no idea what to try next and does not give them any useful details to report for technical support.

docbill · Post by **docbill** » 28 Apr 2024, 20:26

Admittedly I am doing things like using RAID0 as a generic description of an operation rather than the actual formal definition. If it like using the word aspirin as a generic name for pain medication. The documentation just says traid first combines the smaller drives. The md logging makes it appear like it is standard raid operations, so the generic use of the term seems appropriate in the absence of the detailed trade secrets.

Post by **TMzethar** » 28 Apr 2024, 20:34

We will discuss the issues you raise here with our relevant project teams.

Post by **TMroy** » 28 Apr 2024, 23:22

docbill wrote: ↑28 Apr 2024, 20:26

It's obvious that your 750GB disk has a fault, which led to the failure and interruption of the array repair and synchronization in the middle. Currently, we cannot know at which stage the synchronization was interrupted. It is inferred that this is also the reason why the array failed to trigger repair again after you reinserted the 500GB disk. However, we need to simulate similar situations in the laboratory to draw a conclusion about the specific situation.

docbill · Post by **docbill** » 29 Apr 2024, 05:56

It is not obvious to me the 750 GB disk is at fault. I formatted as a single volume multiple times and ran the short smart test.

And don't get me wrong these are the types of testing where I expect failures. Drives manufactured as old as 2007. It is more a test does the NAS tell me why it failed. Is it really the 750 GB drive that failed? Or did another drive fail during my test?

The gold rule of data recovery, is do not overwrite good data. So if I were to build this type of an array I would use RAID 101 (1+0+1), so i could if desired keep all the other drives read only except the one being replaced. Like many people I buy disks in pairs. So when one disk is fails there is another disk getting ready to fail.

If I see in the log of a recovery another disk is also failing, then I know I had better plan on ordering more disks and restore from backup. If I see no errors, I am just guessing at what to do next.

docbill · Post by **docbill** » 29 Apr 2024, 06:07

docbill wrote: ↑29 Apr 2024, 05:56 RAID 101 (1+0+1)

I do no know what I was thinking 101? It would be a modified 10. First reserve enough space on every drive to store the partition table/datamap. The partition the largest drive into sizes of the smaller drive. RAID 1 the partiontions with the drive. RAID 0 all the RAID 1's. If any of the smaller drives fail restore with the mirrored partition. If the larger drive fails restore all three partitions with the smaller drive.

Of course I wouldn't really do this data storage method. It is just an example that shows it os possible to protect the data in a way were a working drive does not need to be updated by a recovery. So a failed recovery never blocks a future attempt, at least after a reboot to clear any errors in locked devices...

Post by **TMroy** » 29 Apr 2024, 11:18

We appreciate your feedback. TRAID itself is an array combination of RAID 1 and RAID 5, which inherently has redundancy functionality. The working scheme of TRAID has been verified by our engineering team over a long period of time, and we do not believe there are obvious flaws in its working principles. If a disk failure causes the array repair to fail, the system cannot report disk failures in the middle of the repair process. Before you spend a lot of time verifying this issue, please perform a SMART long test on each disk to ensure the health of your disks. Regarding your 500GB and 700GB disks, they should all be over 10-year-old antiques.

docbill · Post by **docbill** » 29 Apr 2024, 20:16

I'm actually fairly convinced it was a legitimate failure. But the lack of logging is the serious issue. In the UI it is never even presented as a failure. The array simply remains in a degrade state, after a submitting the dialog. No % recovery steps and then error. No message the drive cannot be initialised. Nothing. That is the serious problem.

Now if you are telling me TOS uses RAID 1 + 5, then there is no reason in inserting a bad drive would cause the array to become non-recoverable. If it is using RAID 5 + 1, that is a disaster waiting to happen.

The reason is, there is no danger striping individual RAID 1 volumes that can each be independent recovered in the advent of any single drive failure. But to stripe volumes in RAID 5 and then try to protect with RAID 1, can easily result in a cascade failure of multiple hard drives.

Testing using older hard drives is a way to simulate several effects. First is, it is well know if you keep drives in operation past useful EOL in a RAID array, it is well possible that if you manually fail any individual drive, you have a fair chance you won't be able insert the drive back into an array again. So you never want a case where you manually fail a drive while recovery is in progress. To rebuild a RAID 5 + 1 would require precisely that.

The second common effect is it is quite common to receive a band new drive that is bad. When inserting the drive and having the recovery failure you need two things. 1. A very clear error message that can be reported when requesting the RMA. 2. Practically 0 risk that putting a brand new drive that is bad will result in you not being able to try again with another drive.

TerraMaster Official Forum

Data integrity concerns TRAID

Data integrity concerns TRAID

Re: Data integrity concerns TRAID

Re: Data integrity concerns TRAID

Re: Data integrity concerns TRAID

Re: Data integrity concerns TRAID

Re: Data integrity concerns TRAID

Re: Data integrity concerns TRAID

Re: Data integrity concerns TRAID

Re: Data integrity concerns TRAID

Re: Data integrity concerns TRAID