I don’t want to rant too much about WordPress but the sloppy data model is the main reason why we decided to switch. Pictures in articles are just hard coded URLs to the picture, links to other articles are also just HTML links and you generally have a very week separation between content and layout. If you have loads of articles being created with a with all the different WordPress versions over the years, you end up with inhomogeneous content. Since many links are just hard-coded, you have no easy way of checking if all the referenced files still exist and where they are used.
Another reason was the multi language support of TYPO3. The main language of the site is German but for some articles it would be beneficial to also offer the content in english.
The last reason was that we wanted to see how hard it is to migrate from WordPress to TYPO3 :-D
The goal was to import all YouTube videos and images to FAL and to import the posts to the news extension. The pictures should be referenced correctly as media file and since we have a “read more” section in the posts, containing hard-coded links to other news posts, these should also be converted to the “related news” feature of the news extension.
The physical import of the images was the easy part: Copy all images to the fileadmin folder and delete all generated/resized pictures of WordPress (files with the resolution as suffix). I used this command:
find ./uploads/ -regextype posix-extended -regex '^.+?-[0-9]{3}x[0-9]{2,3}\.jpg' | xargs rm
The "Update Storage Index" scheduler task adds all the files to the FAL.
For the migration of the news posts I created a command line controller, which we conveniently call via the typo3_console extension. The plan is to run this command only once so all news related data is erased before each run.
Users
The WordPress users are stored in the table wp_users. It’s very simple to import the users to the be_users table. Even the password hash can just be copied to the password field.
Categories and Tags
The first step is to import all categories and tags. Both are stored in the table wp_term_taxonomy where the type is defined by the column taxonomy (for categories it’s category for tags it’s post_tag).
Since the ids after import are different in TYPO3, it’s necessary to build up mapping arrays so the old ids can be mapped to the new ids.
Categories are imported as sys_category, tags are imported as tx_news_domain_model_tag.
WordPress stores the tag/category name in a separate table, so you have to query wp_term_taxonomy for a list of all Tags/Categories and then use the term_id to query the wp_terms table to get the name. Some error checking is necessary since in our instance some terms were missing while the entry in the taxonomy table still existed.
Posts
The initial import of the posts is not too hard. The posts are stored in the wp_posts table. As WordPress stores all kinds of stuff in this table, additional conditions have to be added: posts_type=’POST’ and post_status=’publish’ in our case.
We extended the tx_news model to store the original post id and post_name (permalink) in the imported post. After all posts are imported the tags and categories are assigned to the posts (as stored in the wp_term_relationships table).
The post text is split on the <!—more--> text since news has a dedicated teaser column.
Images
WordPress stores the meta data of images in the wp_posts table. Makes totally sense, right? To get all the rows I used the query post_status=’inherit’ and post_type=’attachment’. The filename is stored in the column guid and can be used to find the file in the sys_files table of TYPO3. With this information it’s relatively easy to create a sys_file_metadata entry with the imported title and description of the image.
Initially I assumed that the images also show up in the post where they are attached to (post_parent column), but this is only the case for galleries. This feature was not used on our site, so this information could be left out. This might be different on other installations of course.
Another important image is the preview image of each post. WordPress stores this image in the wp_postmeta table where the meta_key is ’_thumbnail_id’. News stores this image as sys_file_reference that assigns the image to the post and the reference has the column showinpreview set to 1.
Are we done yet?
Basically this was the initial import all data. The main problem is that you have to output the post content with the f:format.raw view helper and still have no references on the data model level.
So I did not stop here and added a parser to convert the hard-coded URLs to relations in the database. This is where the fun started. Initially I tried to use some regular expressions to extract the image tags and a-tags but as we all should know by now: Parsing HTML with regular expressions <link http: stackoverflow.com questions regex-match-open-tags-except-xhtml-self-contained-tags>is bad (tm).
So after many hours of frustration I switched to the HTML parser in PHP. This did solve many problems but it still took quite some time to parse all cases. The main problem was that the content has been created over a large span of time with different authors. So there were many different variants how stuff was done. The details are specific to this installation and will be different to other sites. But I want to outline some points that are probably common on other sites as well:
Import all images
To include an image in a news post you have two options: Use the RTE to insert an image or attach it as media file. We wanted to avoid using images in RTE so we had to use the media file reference option. News usually shows all attached images after the text, but we have to include images inline in the text. To enable this option, we implemented a ViewHelper that replaces [picture] references in the text with the actual images in the order they are assigned in the relations tab. That way the editor can simply type [picture] and the image is inserted automatically.
The parsing itself was not too hard. Just use the parser to get all img-tags in a post, check if it’s an internally stored image and replace the tag with [picture].
Import Videos
If a YouTube video is embedded in a post it’s imported as well (not the video itself but the link to the video can be added to the FAL:
OnlineMediaHelperRegistry::getInstance()->transformUrlToFile($url, $this->targetFolder, ['youtube']);
The FAL entry of this video is then assigned to the post and can be inserted just like an image with the [video] tag in the text.
Internal links
All internal links where replaced with linkhandler links. We had a custom BBCode tag at the end of each article that linked to related news. This information is imported in the tx_news_domain_model_link table.
Summary
It was both fun and frustrating to write this importer. WordPress spreads the information on many different tables in some place and uses the same table for different content in other places. The data model is a strange mix of “inline content” and relations all over the place. The integrity of the data generally is bad (at least I found many problems in our installation). Since there is no strong separation between data and presentation you probably will get in troubles at some point.
Can this importer also import other instances? The users/tags/categories/posts import is the same for all sites. But all further processing of the old posts and therefore better data structure, which is the actual benefit to gain, depends on the content (used WordPress plugins, how did the editors insert images, …).
You can see the result on <link https: www.austrianwings.info>https://www.austrianwings.info
Feel free to <link https: reelworx.at>contact us if you need support with a similar task.