Maintenance time

Written By: - Date published: 11:51 am, January 2nd, 2022 - 15 comments
Categories: infrastructure, The Standard, uncategorized - Tags:

The Standard server is going down for a few hours later today for maintenance.

Server waiting for maintenance

It needs a replacement for the steadily failing hard drives in the archive array (RAID-Z3) and a replacement of 2Tb disks (some now over 7 years old) with new 4Tb drives. Rather than having 8 in the array, one has fully died and another faulted. So I’ll replace with 4x4Tb drives. About to head to PBTech to get a one more drive (I prefer getting them out of different batches).

The metaSAS card needs to move so that it doesn’t occasionally get overheating errors. I also have an large quiet exterior fan to attach to provide more even ventilation.

The SSDs that run The Standard array are starting to get wear issues, most notably in the log regions. Plus I need to start replacing those Intel 120Gb SSDs with something that is a bit bigger.

A good vacuum of the chassis interior wouldn’t go amiss. Even with the dust covers, the fine dust is slowly building up. Plus I think that the cat hair is starting to get in. She does like to groom herself just in front (after having a good sunbath).

I was thinking about updating the motherboard, CPU and RAM. But that can wait for another year.

As usual, I can’t really give you a precise time-frame because of the vagaries of shopping (urrgh) and balky hardware. However everything is backed up (and that takes a while to archive close to 10Tb to AWS S3). It should be a matter of only a few hours at worst to get up and running again, and then days to get the non-critical backup data down again.

In the mean time – enjoy the sunny day. I know that Luna will…

Luna in the sun
Luna bathing in the sun and preparing to shed cat hair into the server.

 Updated: The debris of old drives..

The residual drives – to be stored or disposed of.

15 comments on “Maintenance time ”

  1. Puckish Rogue 1

    So does this mean there'll be a dictionary function added? (I'd chip in a few bucks if it was money thing)

  2. Shanreagh 2

    Good luck Luna, you know this is only temporary and you will soon be able to get back into cat shedding mode in front of that consistently nice warm 'thing'.

    Support PR. If there is anything that a few $$$$ will help I would be happy to help too.

    This site and ALL its users/poster/commenters brings joy to me not the least of which is the intelligence and writing skills of all. And not forgetting the impecable moderating. smiley

  3. Dennis Frank 3

    I wonder if an AI module will eventually be a worthwhile addition? It could self-diagnose times for regenerative replacements to the system, put in the order to PBTech.

    You'd need a slot in your front door for the courier driver to drop the packages through. Then a package-stripping robot (along the lines of them automatic vacuumers) to extricate the replacement modules from the packages.

    Writing the instruction program to teach the robot to insert the new bits would be the tricky part. Robot would need eyes to see the slots, and an internalised photo of the equipment from the inserter's perspective to match up with.

    Anyway, if you need a creative project for your retirement plan you could consider how to make the TS system resilient via self-maintenance… enlightened

    • Shanreagh 3.1

      I vote for Luna being trained to do all this. She is on site, fewer $$$$ in travel costs and may be keen. So often when recruiting for new tasks the existing 'employees' are overlooked. After AI is pretty scary, almost as scary as having a vaccine…'ooops who said that?'

      We should keep this within the animal kingdom. Failing Luna then Gezza's Pukeko or Elvira the eel may be free for training. wink

    • lprent 3.2

      Curiously enough, a server self-diagnosis software module is what I'm writing at present professionally. Probably not the AI yet, I'm just pulling enough data to allow alerts. But also to enable data mining over time for expert system rules.

  4. higherstandard 4

    Has your lord and master thought of doing a Mayor Stubbs …. might be just what the city needs.

    • lprent 4.1

      She is no Governor Grey, ruler of government house.

      I'd could say that she rules my partner – the soft-hearted food giver to a cat whose finicky eating habits a a legend. However Luna doesn't sleep on my side of the bed because I auto-kick when bitten through the covers. Luna comes to me when she wants to be rubbed up the wrong way – as a hair removal.

      However all these must be ridiculous rumours because I just got told so….

  5. lprent 5

    Ok – starting to shutdown. I located where my screw driver set went to..

    • lprent 5.1

      And we're back.

      Had a problem that I had three identical Intel SSD drives, and I'd forgetten to get the serial numbers before shutdown

      Like this

      $ lsblk --nodeps -o name,serial
      NAME SERIAL
      sda S1DHNSAF817333Y
      sdb ZDH8TXGX
      sdc WD-WX42D31KS5SE
      sdd WD-WXA2D21RV3HE
      sde ZDHAR4D8
      sdf ZDHAQ4NC
      sdg CVCV4302022A120BGN
      sdh CVLT614108Z2120GGN
      sdi PHWL5433003G120LGN
      sdj S2R4NX0H410553P
      sdk 130801400034
      sdl S3YBNB0N402586X
      sdm CVMP2355042L120BGN
      nvme0n1 50026B76839DD779
      $ lsscsi
      [0:0:16:0] disk ATA ST4000VN008-2DR1 SC60 /dev/sdb
      [0:0:17:0] disk ATA WDC WD40EFZX-68A 0B81 /dev/sdc
      [0:0:18:0] disk ATA WDC WD40EFZX-68A 0B81 /dev/sdd
      [0:0:19:0] disk ATA ST4000VN008-2DR1 SC60 /dev/sde
      [0:0:20:0] disk ATA ST4000VN008-2DR1 SC60 /dev/sdf
      [1:0:0:0] disk ATA Samsung SSD 840 BB6Q /dev/sda
      [2:0:0:0] disk ATA INTEL SSDSC2CW12 400i /dev/sdg
      [3:0:0:0] disk ATA INTEL SSDSC2KW12 G200 /dev/sdh
      [4:0:0:0] disk ATA INTEL SSDSC2BB12 0370 /dev/sdi
      [5:0:0:0] disk ATA Samsung SSD 850 2B6Q /dev/sdj
      [6:0:0:0] disk ATA SanDisk SDSSDP06 0 /dev/sdk
      [7:0:0:0] disk ATA Samsung SSD 860 4B6Q /dev/sdl
      [8:0:0:0] disk ATA INTEL SSDSC2CT12 300i /dev/sdm
      [N:0:1:1] disk KINGSTON SA2000M81000G__1 /dev/nvme0n1

      That would have chopped nearly half an hour off the time. I kept putting the wrong one back in.

      But also not getting the right battery for the UPS systems didn’t help.

  6. lprent 6

    Turns out the drives were older than I thought.

    • 1 WD from 2012
    • 5 WD from 2013
    • 1 Seagate from 2013 (which was the dead one).
    • 1 WD from 2012 2016

    Lets hope that 5 x 4 terabyte drives (3 Seagate IronWolf and 2 WD Red Plus) last as long. Turns out I forget that I had a 4Tb Ironwolf that I had alreday brought in 2020 for testing.

    Now there are just these 8 2Tb antiques plus a few Intel 120Gb to clean and dispose of. Most of them are still ok with few errors. I will tuck them away in a backup e-sata system.

  7. McFlock 7

    any reason wordfence would be throwing 403 errors?

    • lprent 7.1

      Yeah, zpool creation of archive created it at /archive rather than /mnt/archive

      rsync to copy Hold to /mnt/archive carefully created a /mnt/archive and filled up the root directory. When it ran out of space, it started to fill RAM, and then quit. That meant that there wasn't a lot of spare RAM available, so large requests to use it – like wordfence would cause a 403 error because it it couldn't get enough RAM to do whatever it was wanting to do.

      Fixed finally. Pretty freaky that TS carried on working from reading. Not so hot on saving anything.

      The rsyncs and copies are in full overnight pulls. Hopefully all done before morning.

  8. Descendant Of Smith 8

    Not related to your rebuild but I thought interesting anyway.

    Son is just dealing with a few of these errors on different businesse's systems. Affecting email for some.

    The FIP-FS "Microsoft scan engine failed to load. Can't convert "2201010001" too long.

    YYMMDDhhmm formatted time exceed signed INT range.

    Seems 2022 is a problem not just 2000.

    Found a link.

    https://borncity.com/win/2022/01/02/microsoft-besttigt-exchange-year-2022-problem-fip-fs-scan-engine-failed-to-load-1-jan-2022/

    • lprent 8.1

      Oh dear. Who would code that into a 32 bit signed integer?
      If they'd done it in a 32 bit unsigned integer then it wouldn't be a problem until later.

      Hex 0xFFFFFFFF == Dec 4,294,967,295 (ie with a no signed bit)
      Hex 0x7FFFFFFF == Dec 2,125,483,647 (can be negative)

      Some dumbarse corporate programming fool obviously thought it'd be more 'efficient' to convert a string to a numeric to break it up. Which is stupid in the face of it. It is more efficient to split the string and convert the date elements because 2 string bytes at a datetime. That is a trivial conversion to date (see std c++ datetime), whereas converting 10 bytes at a time to a number is laborious, especially when you have to then split that number to get to a datetime.

      (reads article) Ok – Microsoft Exchange – figures. One bit of software that I swore I’d never recommend two decades ago.

The server will be getting hardware changes this evening starting at 10pm NZDT.
The site will be off line for some hours.