FEATURE: Add import script for Friends+Me Google+ Exporter JSON archives (#7334)

FEATURE: Add import script for Friends+Me Google+ Exporter JSON archives (#7334)

This script has been used to import over 50,000 Google+ posts and over 300,000 comments from 29 communities into a single Discourse instance, as well as for at least three other imports. Google+ has closed for the public, but it is still available at this time for GSuite customers. If GSuite customers decide to migrate from Google+ to Discourse, or if Google “sunsets” Google+ for GSuite customers, this importer may be useful. FMGE for GSuite? : FMGE_Support

Development and use of this script has been discussed in detail: [bounty] Google+ (private ) communities: export screenscraper + importer - marketplace - Discourse Meta

diff --git a/script/import_scripts/friendsmegplus.rb b/script/import_scripts/friendsmegplus.rb
new file mode 100644
index 0000000..2704774
--- /dev/null
+++ b/script/import_scripts/friendsmegplus.rb
@@ -0,0 +1,684 @@
+require File.expand_path(File.dirname(__FILE__) + "/base.rb")
+
+require 'csv'
+
+# Importer for Friends+Me Google+ Exporter (F+MG+E) output.
+#
+# Takes the full path (absolute or relative) to
+# * each of the F+MG+E JSON export files you want to import
+# * the F+MG+E google-plus-image-list.csv file,
+# * a categories.json file you write to describe how the Google+
+#   categories map to Discourse categories, subcategories, and tags.
+#
+# You can provide all the F+MG+E JSON export files in a single import
+# run.  This will be the fastest way to do the entire import if you
+# have enough memory and disk space.  It will work just as well to
+# import each F+MG+E JSON export file separately.  This might be
+# valuable if you have memory or space limitations, as the memory to
+# hold all the data from the F+MG+E JSON export files is one of the
+# key resources used by this script.
+#
+# Create an initial empty ("{}") categories.json file, and the import
+# script will write a .new file for you to fill in the details.
+# You will probably want to use jq to reformat the .new file before
+# trying to edit it.  `jq . categories.json.new > categories.json`
+#
+# Provide a filename that ends with "upload-paths.txt" and the names
+# of each of the files uploaded will be written to the file with that
+# name
+#
+# Edit values at the top of the script to fit your preferences
+
+class ImportScripts::FMGP < ImportScripts::Base
+
+  def initialize
+    super
+
+    # Set this to the base URL for the site; required for importing videos
+    # typically just 'https:' in production
+    @site_base_url = 'http://localhost:3000'
+    @system_user = Discourse.system_user
+    SiteSetting.max_image_size_kb = 40960
+    SiteSetting.max_attachment_size_kb = 40960
+    # handle the same video extension as the rest of Discourse
+    SiteSetting.authorized_extensions = (SiteSetting.authorized_extensions.split("|") + ['mp4', 'mov', 'webm', 'ogv']).uniq.join("|")
+    @invalid_bounce_score = 5.0
+    @min_title_words = 3
+    @max_title_words = 14
+    @min_title_characters = 12
+    @min_post_raw_characters = 12
+    # Set to true to create categories in categories.json.  Does
+    # not honor parent relationships; expects categories to be
+    # rearranged after import.
+    @create_categories = false
+
+    # JSON files produced by F+MG+E as an export of a community
+    @feeds = []
+
+    # CSV is map to downloaded images and/or videos (exported separately)
+    @images = {}
+
+    # map from Google ID to local system users where necessary
+    # {
+    #   "128465039243871098234": "handle"
+    # }
+    # GoogleID 128465039243871098234 will show up as @handle
+    @usermap = {}
+
+    # G+ user IDs to filter out (spam, abuse) — no topics or posts, silence and suspend when creating
+    # loaded from blacklist.json as array of google ids `[ 92310293874, 12378491235293 ]`
+    @blacklist = Set[]
+
+    # G+ user IDs whose posts are useful; if this is set, include only
+    # posts (and non-blacklisted comments) authored by these IDs
+    @whitelist = nil
+
+    # Tags to apply to every topic; empty Array to not have any tags applied everywhere
+    @globaltags = [ "gplus" ]
+
+    @imagefiles = nil
+
+    # categories.json file is map:
+    # "google-category-uuid": {
+    #   "name": 'google+ category name',
+    #   "category": 'category name',
+    #   "parent": 'parent name', # optional
+    #   "create": true, # optional
+    #   "tags": ['list', 'of', 'tags'] optional
+    # }
+    # Start with '{}', let the script generate categories.json.new once, then edit and re-run
+    @categories = {}
+
+    # keep track of the filename in case we need to write a .new file
+    @categories_filename = nil
+    # dry run parses but doesn't create
+    @dryrun = false
+    # @last_date cuts off at a certain date, for late-spammed abandoned communities
+    @last_date = nil
+    # @first_date starts at a certain date, for early-spammed rescued communities
+    @first_date = nil
+    # every argument is a filename, do the right thing based on the file name
+    ARGV.each do |arg|
+      if arg.end_with?('.csv')
+        # CSV files produced by F+MG+E have "URL";"IsDownloaded";"FileName";"FilePath";"FileSize"
+        CSV.foreach(arg, headers: true, col_sep: ';') do |row|
+          @images[row[0]] = {
+            filename: row[2],
+            filepath: row[3],
+            filesize: row[4]
+          }
+        end
+      elsif arg.end_with?("upload-paths.txt")
+        @imagefiles = File.open(arg, "w")
+      elsif arg.end_with?('categories.json')
+        @categories_filename = arg
+        @categories = load_fmgp_json(arg)
+      elsif arg.end_with?("usermap.json")
+        @usermap = load_fmgp_json(arg)
+      elsif arg.end_with?('blacklist.json')
+        @blacklist = load_fmgp_json(arg).map { |i| i.to_s }.to_set
+      elsif arg.end_with?('whitelist.json')
+        @whitelist = load_fmgp_json(arg).map { |i| i.to_s }.to_set
+      elsif arg.end_with?('.json')
+        @feeds << load_fmgp_json(arg)
+      elsif arg == '--dry-run'
+        @dryrun = true
+      elsif arg.start_with?("--last-date=")
+        @last_date = Time.zone.parse(arg.gsub(/.*=/, ''))
+      elsif arg.start_with?("--first-date=")
+        @first_date = Time.zone.parse(arg.gsub(/.*=/, ''))
+      else
+        raise RuntimeError.new("unknown argument #{arg}")
+      end
+    end
+
+    raise RuntimeError.new("Must provide a categories.json file") if @categories_filename.nil?
+
+    # store the actual category objects looked up in the database
+    @cats = {}
+    # remember google auth DB lookup results
+    @emails = {}
+    @newusers = {}
+    @users = {}
+    # remember uploaded images
+    @uploaded = {}
+    # counters for post progress
+    @topics_imported = 0
+    @posts_imported = 0
+    @topics_skipped = 0
+    @posts_skipped = 0
+    @topics_blacklisted = 0
+    @posts_blacklisted = 0
+    # count uploaded file size
+    @totalsize = 0
+
+  end
+
+  def execute
+    puts "", "Importing from Friends+Me Google+ Exporter..."
+
+    read_categories
+    check_categories
+    map_categories
+
+    import_users
+    import_posts
+
+    # No need to set trust level 0 for any imported users unless F+MG+E gets the
+    # ability to add +1 data, in which case users who have only done a +1 and
+    # neither posted nor commented should be TL0, in which case this should be
+    # called after all other processing done
+    # update_tl0
+
+    @imagefiles.close() if !@imagefiles.nil?
+    puts "", "Uploaded #{@totalsize} bytes of image files"
+    puts "", "Done"
+  end
+
+  def load_fmgp_json(filename)
+    raise RuntimeError.new("File #{filename} not found") if !File.exists?(filename)
+    JSON.parse(File.read(filename))
+  end
+
+  def read_categories
+    @feeds.each do |feed|
+      feed["accounts"].each do |account|
+        account["communities"].each do |community|
+          community["categories"].each do |category|
+            if !@categories[category["id"]].present?
+              # Create empty entries to write and fill in manually
+              @categories[category["id"]] = {
+                "name" => category["name"],
+                "community" => community["name"],
+                "category" => "",
+                "parent" => nil,
+                "tags" => [],
+              }
+            elsif !@categories[category["id"]]["community"].present?
+              @categories[category["id"]]["community"] = community["name"]
+            end
+          end
+        end
+      end
+    end
+  end
+
+  def check_categories
+    # raise a useful exception if necessary data not found in categories.json
+    incomplete_categories = []
+    @categories.each do |id, c|
+      if !c["category"].present?

[... diff too long, it was truncated ...]

GitHub sha: 9fc3de01