This is not a tutorial, more an account of what I did. May be useful to someone else, may not, it's more notes for myself. ### My situation ### I had a 1 TB magnetic hard drive (HDD, hereafter "the failed drive") holding data in a LUKS-encrypted container. The container contained two volumes (2 logical volumes through Logical Volume Management, LVM) a timeshift volume with backups of the OS, and a data volume. I did not care much about the timeshift volume, but very much about the data volume, for which I had no recent backup (hum...). One day, after a kernel update, the system booted in emergency mode :( with an error about the timeshift volume. A filesystem check (fsck) on the volume showed input/output error. I should have stopped there: I/O errors show that the drive is in a bad state. I rebooted on a live USB, opened the container (it still worked, good sign) and tried fsck on the timeshift and data volumes, which both showed I/O errors again. I also looked at the S.M.A.R.T data of the HDD which showed 120 bad sectors but said in summary that the drive was okay (?!). That's where I realised for sure that I needed to take time and make a plan for data recovery (and buy a new hard drive). What I did is below DISCLAIMER: What I did worked for me, it may not work on another occasion and in different conditions. Particularly if there were more bad sectors, if the LUKS container were damaged, and if the drive were a SSD instead of a HDD. There are professionals for data recovery. They have specialised softwares, but also specialised hardware to treat the drives, clean room etc, and most importantly experience. Do not follow steps below if losing the data is not an option to you. ### Quick summary for experts ### - get a big external drive with a partition large enough to hold an image of the failed drive. Also need more free space to copy the interesting partitions afterwards (can be on another drive if needed) - download OpenSuperClone (OSC) live. https://sourceforge.net/projects/opensuperclone-live/ and burn it to a USB stick - read the tutorial https://www.reddit.com/r/datarecoverysoftware/wiki/hddsuperclone_guide/ and the manual www.hddsuperclone.com/hddsuperclone/manual - boot the USB, identify the failed drive letter /dev/sdX, mount the destination drive, open OSC, tell it to put the log (project) and disk image (destination > image file) on destination, and source drive = the failed drive. Check with drive's model name and serial number. Connect, start, wait for it to complete (for me ~3h for a 1 TB drive with 640 bad LBAs in the end) - hopefully number of bad LBAs is not too high, otherwise probably better to send to a professional in data recovery - mount the image as loop device, optionally decrypt the LUKS container. Copy the interesting partition(s) (crypted or not) to empty space on an external drive (can be the same drive): find exact size in MiB, make unformatted partition of that size with gparted then use dd. - fsck the filesystems. Hopefully not too many inodes discarded otherwise may need a file carver/scraper like photorec (https://www.cgsecurity.org/wiki/photoRec) or see a professional - copy the repaired partitions to the new replacement drive. Change the UUIDs to prevent the OS confusing the replacement drive and the failed drive. Edit /etc/crypttab and /etc/fstab of the OS to put the new UUIDs - replace the failed drive with the replacement one in the computer. boot and voilĂ ! ### Longer versions with details ### ### Prerequisites ### Familiarity with - Linux in general - Linux disk and partition tools, notably gparted and gnome-disks, or equivalent softwares - the terminal, and particularly the utility dd. dd can easily destroy data, e.g. if you mix the if and of arguments (if=input-file, of=output-file) - burning a live USB and booting from it ### Preparations ### * get a big external hard drive (hereafter "the destination drive") to store the clone of the failed drive It needs to be at minimum the same size as the failed drive, if you clone directly to it. I recommend at least twice bigger so that you can clone to a file (disk image), that will be easier to manipulate, and then use the rest of the space to copy the partitions from the file, try to repair etc. I took a 4TB drive, a HDD (slower than SSD but more robust long term, so afterwards I could recycle it as backup storage). I made a 1.5TB ext4 partition to store the disk image. You need the partition to be larger than the drive to account for filesystem overhead and a filesystem does not like to be full. 20% should be enough (so 1.25TB for me, 1.5TB was a bit overkill). Here a recap of memory sizes may be useful: - base unit: Byte (octet in french) - some tools report in base 10: 1 kB = 1000 bytes = 10^3 ; 1 MB = 1000000 bytes = 10^6 ; 1 GB = 10^9 ; 1 TB = 10^12 - other tools report in base 2: 1 kiB = 1024 bytes = 2^10 ; 1 MB = 1048576 bytes = 2^20 ; 1 GiB = 2^30 ; 1 TiB = 2^40 - other report in sectors, then you have to find sector size e.g. with gdisk or gparted or "lsblk -o NAME,PHY-SeC" - Open Super Clone reports in LBAs. LBA = Logical Block Addressing. Seems to be the same as sector but I'm not 100% sure. Can check the math with "hdparm -i /dev/sda" * prepare a live USB for recovery operations - have a USB burner utility (e.. Rufus from windows, balena etcher from any OS, or many alternatives on Linux) - Download OpenSuperClone (OSC) live. https://sourceforge.net/projects/opensuperclone-live/ https://github.com/ISpillMyDrink/OpenSuperClone - burn the ISO to the USB. For me Rufus threw some warnings about secure boot, grub and syslinux, just disregard - read the tutorial at the datarecovery reddit: https://www.reddit.com/r/datarecoverysoftware/wiki/hddsuperclone_guide/ - read the documentation for OpenSuperClone and its predecessor HDDSuperClone (https://www.hddsuperclone.com/ ; manual at https://www.hddsuperclone.com/hddsuperclone/manual) to go more in depth, especially if results are not perfect and you want to retry with direct AHCI access to the failed drive (more on that later) ### Operations (what I did) ### * Step 1: clone the failed drive into a disk image - I booted the computer to the firmware interface (a.k.a. BIOS) and checked the SATA settings: it was in AHCI mode. The interface also gave the SATA port for the failed hard drive: 0, but I found later it's not how Linux calls it. It also nicely gave the model name for the drive. - I booted the OSC live USB, it gets you into a Linux with the xfce desktop (xubuntu I guess) and the OSC utilities can be seen on the desktop (or they can be launched from the menu) I used gparted to identify the letter assigned to the failed drive: it was /dev/sda, and the LUKS partition I wanted to recover in particular: /dev/sda3. With View > Device Information, I also found the drive's model name, serial number and sector size (512 bytes), these are all useful later. This can be checked also with gnome-disks and $ sudo gdisk -l /dev/sda. - I plugged the external drive / destination drive. I dont know why (impatience?) but it first did not appear / did not seem recognized. A posteriori, this might be fixable with $ sudo udevadm trigger (optionally paired with --subsystem-match=usb). Instead, I powered off - I rebooted the OSC live, this time with destination drive plugged before boot => it was now recognized, it appeared :) - make a large enough ext4 partition (1.5TB) and sudo made a directory there to hold the disk image and the cloning log Note: to be fair, I first did this on a first external drive which unfortunately turned out slightly too small to hold the image. I should have planned better. So I had to stop the cloning, buy a bigger drive, transfer there and resume the cloning. This is where OpenSuperClone is handy with the feature that it keeps a log and you can stop and resume cloning whenever you want. This is described in more detail in the section "Additional details" below. For now let's pretend I did it right first time. - Open the OSC program, File > New project > select destination drive and directory there => call file OSC-project.log then Drives > Choose destination > image file > again destination drive and directory => call file OSC-clone.img - Drives > Choose source drive > choose /dev/sda Then check the model name and serial number - Double check everything including that you did not mix source and estination - Click Connect, then Start - cross fingers and wait if needed, you can stop the cloning and resume it later: button stop, then later start. If you need to fully power down the computer, then stop, disconnect, before shutdown. Then when you relaunch OSC, point to the already existing log file and clone file. I suggest not powering down if it can be avoided. Indeed, everytime a hard drive is powered on it's a new occasion for it to fail further. Best to recover in as few tries as possible. - for me the clone finished in ~3 hours (for 1 TB) indication 640 bad LBAs (sectors?) at the end, corresponding to 0.000033% of the drive - I considered this was good enough and did not try anything more. I kept the failed drive in case it would need further forensic (e.g. send to a professional) and hoped for the best. From now on I only operated on the recovered disk image (the clone, OSC-clone.img) Note: to be fair, I did try to go deeper, with direct AHCI access to the drive and playing with some OSC parameters. This is described in the section "Additional details" below. I skip it here because it did not recovered any more data, I still had 640 bad LBAs at the end. * Step 2: Use disk image to recover partitions and repair them - boot a recent linux, mount and open the external drive. Inspect the partition table of the disk image $ sudo gdisk -l OSC-clone.img . It looked ok, identical to original disk - mount image as a loop device. with a recent linux and a .img extension it may be enough to double-click on the file. Otherwise in terminal $ losetup --partscan --find --show OSC-clone.img it outputs the loop device, /dev/loop1 in my case and the partitions are at /dev/loop1p1 loop1p2 etc - use gparted, gnome-disks or other to find the drive letter where I want to copy the partition(s) to be repaired. In my case it was the same 4TB ext HDD, in the free space, and I found /dev/sdd. Note: next operations below might be faster by using (yet) another drive as destination. As in my case it meant I was reading and writing to the same drive, potentially affecting read and write speeds. - case 1 (simplest) copy an intact unencrypted partition $ sudo gparted /dev/loop1 /dev/sdd select the desired partition in loop1, right click > copy, select /dev/sdd, click on empty space, right click > paste, apply all operations. My speed here was ~40 MB/sec, so it's a bit long. - case 2 (slightly harder) copy an intact LVM volume in an encrypted LUKS container use gnome-disks or gparted to open the container. gnome-disks was nice because it immediately showed new devices for the volumes. e.g. the data volume was mounted at /dev/fabien/data (and also at /dev/mapper/fabien-data). Then $ gparted /dev/fabien/data /dev/sdd and copy partition as in case 1. It will work only if the filesystem on /dev/fabien/data is intact (in my case it was not), and you cannot run a fsck yet because the disk image (and hence the loop device) is read-only as far as I understand. (or at least it's not advised to try a fsck here) - case 3 (yet slightly harder) copy a LVM volume with corrupted filesystem in a LUKS container as in case 2, open the container => /dev/fabien/data gets mounted. Find the exact size of this in MiB. Perso I took the size in bytes from gnome-disks and divided by 1024^2=1048576. Then with gparted (who wants MiB) create an unformatted partition with that size on the destination drive ; in my case it was /dev/sdd3. Copy with $ sudo dd if=/dev/fabien/data of=/dev/sdd3 status=progress . Then repair filesystem on sdd3 with gnome-disks or gparted. Check the output of fsck. In my case it discarded several inodes with garbage, not too many so I thought it should be ok. Also check the folder lost+found/ at the root of the partition. If there were too many things discarded, or if you later find that you miss some files, that's where you may need to run a scraper like photorec (https://www.cgsecurity.org/wiki/photoRec) on the disk image or on the partition before fsck. - case 4 copy the full LUKS container (useful if you not only care about the data, but also the full container, e.g. as me you want to copy it back and get your original OS running as it was before the crash) find the partition number with gparted or gnome-disks, e.g. loop1p2. Then like in case 3, find the exact size in MiB, create unformatted destination partition, say sdd5, then $ sudo dd if=/dev/loop1p2 of=/dev/sdd5 status=progress Then in gparted decrypt the container and repair the filesystem of the volumes inside. Dont forget to close the encryption before shutdown - once all this is finished, unmount everything cleanly before shutdown. For /dev/sdd use gparted to close the encryptions if any, unmount any partition. For the loop device, I first used gnome-disks to unmount the LVM volumes (e.g. /dev/fabien/data) then close the encryption of loop1p2 (if it fails, use gparted to close the encryption). Then we unmount/detach the loop device itself: $ sudo losetup -d /dev/loop1 . * Step 3: copy partitions onto new hard drive and get system to boot again - for an unecrypted partition, can do it with gparted - for the encrypted LUKS, I brute-forced it with an unformattted partition and dd, like in cases 3 & 4 of step 2 - technical part: you probably want to change the UUID of the partitions and LVM volumes. Otherwise, if both the failed drive and the new drive are plugged in at the same time, the OS will be confused and will use one randomly. - for an unencrypted partition or a decrypted volume, change the UUID with gparted: click on the part, Partition > New UUID, apply all operations. - for a LUKS container on sdd5: $ uuidgen . copy that. $ sudo cryptsetup luksUUID --uuid= /dev/sdd5 - tell system use these new partitions: mount and open the partition or volume containing / and modify the crypttab and fstab: $ sudo nano mountpoint/etc/crypttab modify line with UUID of new container. In my case it looked like dtpart_crypt UUID= /root/dtpart_keyfile luks,discard then $ sudo nano /etc/fstab and modify the UUID of the data volume for instance. For me it looked like UUID= /home/fabien/data ext4 errors=remount-ro 0 2 - replace the failed drive with the new drive in the computer. Cross fingers and boot. It worked for me :) ### Additional gory details ### * dealing with a too small destination drive - First time I launched OpenSuperClone and started it, at some point I had a doubt and ran some calculations, and I saw that the partition on the external drive I was using would be slightly too small to hold the disk image. I had not accounted for the overhead space taken by the filesystem itself (journal, tree structure). (also maybe I freaked out for nothing by looking at something reporting in GiB instead of GB) - I bought a bigger ext HDD (4 TB), made a 1.5TB ext4 partition and a destination directory and copied there OSC-project.log OSC-project.log.bak and OS-clone.img - I had some trouble ejecting the first (small) ext HDD. I unmounted the partition with a right click, located drive letter with gparted: sdc => then $ udisksctl power-off -b /dev/sdc - then relaunch OSC. File > open project > select OSC-project.log on the new ext HDD. Drives > destination > image file > select OSC-clone.img. Drives > source > /dev/sda. Check model name and serial number. - double check everything - click connect, then start * Retrying with direct AHCI access - find the ATA port of /dev/sda $ ls -l /sys/block/sda . Check also with $ ls -l /dev/disk/by-path/pci-something-ata-1 Confirm in OSC: followwing manual of HDD superclone (https://www.hddsuperclone.com/hddsuperclone/manual): Mode > direct AHCI (wait a bit, see next * point below), create dummy new project (can discard later), Drives > choose source => I saw an entry "ata1.00 sda (big number) model_name serial_number" => confirms SATA port is 1=1.00 - reboot the OSC live USB. At the GRUB menu go to Disable/hide ATA Ports => Boot with ATA 1.00 disabled - launch OSC, reopen the project (OSC-project.log, OSC-clone.img), Drives > choose source => ata1, check it gives good model name and serial number - Recheck everything. Connect, start. in my case it did not do better, still 640 bad LBAs. I tried tweaking a few settings and retrying, to no avail, I gave up * errors switching to direct AHCI mode - first time I selected direct AHCI mode I got some error message - switched back and forth with other modes and used Drives > Fix Driver Memory Error. At some point it worked and I got a message that error was fixed. Maybe just waiting a bit would have solved it too * If gnome-disks cannot relock a LUKS container - open gparted, right click on the partition > deactivate as far as I understand this unmounts the LVM volumes - now you can close the encryption (in gparted) or lock the container (in gnome-disks) which are the same thing in different words