URL rewriting using Apache

Several years ago, you made the grave error of putting an overly specific URL in some advertising (example.com/fall_2005_news.html). You’ve come to your senses and are re-organizing your website’s structure, and you really want to get rid of that html file. You’d like any user attempting to visit that ancient URL to instead be shunted to example.com/news.

Well, such a procedure is relatively easy. Apache .htaccess files can be used in any folder of your website, and will apply to any subfolders. Thus, a single .htaccess file in the root of your domain can apply to the entire website.

The aforementioned example is fairly simple, but URL rewriting of extreme complexity is possible once you know the rules. Generally, instead of simply changing file1.html to folder/file1.html using a single rule, you will instead change every file fitting the form fileX.html to folder/fileX.html. That is to say, any request for a file that matches a specific pattern will be rewritten to a new URL.

Regular Expressions

Regex Pointers

^ start of line anchor
$ end of line anchor
. match any character
? match 0 to 1 of the preceding elements
* match 0 to N of the preceding elements
+ match 1 to N of the preceding elements
[abc] matches any 1 character from the list abc
[^ab] matches any characters except a and b
\. match the character period
(.*) backreference that matches all characters
! negate the match

It’s a bit beyond the scope of this post to teach regex in depth, but here’s a few pointers to help decipher the code in the next section.

As a mildly complex example, consider ^www\.([^\.]+\.[^\.]+)$. This starts at the beginning of the line, matches the character string www, followed by a period, followed by 1 to N characters that are not period, followed by a period, followed by 1 to N characters that are not a period, followed by the end of the line.

It matches any URL that fits the form: www.example.com. It then stores a backreference for the example.com portion of the match. Think of a backreference as a saved variable containing the text that was matched inside a pair of parenthesis. Multiple backreferences can be made per block of regex, and they can be used later on in your code.

URL Rewriting

Apache defines a few simple commands that allow you to use regex to dynamically alter a URL. The official documentation is a great help, but the examples on the web are of much greater instructional value. Generally you begin URL rewriting like so:

<IfModule mod_rewrite.c>
RewriteEngine on
RewriteBase /

and ends like so:

</IfModule>

It uses an if..then block to enclose all URL rewriting commands, and will only evaluate the block if the correct Apache module is installed. It begins by turning on the rewrite engine, and setting the base of all rewriting to the root directory.

RewriteCond and RewriteRule control how a URL is rewritten. Generally you write 0 to N RewriteCond statements followed by a single RewriteRule statement. The RewriteRule will only be executed if every RewriteCond statement preceding it matches something.

RewriteCond takes two arguments: the string to match against, followed by the pattern to match. RewriteRule takes two arguments: the pattern to match, followed by the rewritten URL. RewriteRule always uses the REQUEST_URI variable as its string to match against; If the URL was http://www.example.com/folder1/file1.html, it would match against the string folder1/file1.html.

Strip off the www subdomain

RewriteCond %{HTTP_HOST} ^www\.(.+)$ [NC]
RewriteRule ^(.*)$ http://%1/$1 [R=301,L]

Backreferences

%1 the first backreference from RewriteCond
$1 the first backreference from RewriteRule
$2 the second backreference from RewriteRule

This starts with the HTTP_HOST variable, which contains just the www.example.com portion of the incoming URL. It then matches www., and stores a backreference to all characters that come after that.

Assuming that the URL did indeed contain a www., then the RewriteRule comes into play. The pattern ^(.*)$ will match everything in the REQUEST_URI, and store a backreference to the string. The URL is then rewritten using the two backreferences.

Map subdomains to subfolders

RewriteCond %{HTTP_HOST} ^([^.]+)\.([^.]+\.[^.])$
RewriteRule ^(.*)$ http://%2/%1/$1 [R=301,L]

This starts with the HTTP_HOST variable, which contains subdomain.example.com. It finds the subdomain using ([^.]+)\. to match 1..N characters that is not periods until it reaches a period. A backreference is stored as %1.

Next, it finds the domain using ([^.]+\.[^.])$ to match 1..N of characters that are not periods (example) followed by a period followed by 1..N characters that are not periods (com). It stores the domain in the backreference %2.

Finally, the RewriteRule uses the pattern ^(.*)$ tomatch everything in the REQUEST_URI, and stores a backreference as $1. The URL is then rewritten using the three backreferences. So, subdomain.example.com now becomes example.com/subdomain.

Prevent image hotlinking

If you host images on your website, you may want to prevent other websites from stealing your bandwidth by hotlinking to your images.

#RewriteCond %{HTTP_REFERER} !^$
#RewriteCond %{HTTP_REFERER} !^http://(www\.)?example.com/.*$ [NC]
#RewriteRule \.(gif|jpg|png)$ - [F]

The first RewriteCond will match any request where the referring site is not empty. The second will match any request where the referring site is any site except your own site (example.com or www.example.com).

If either of those conditions are met, the RewriteRule kicks in, and matches any file that ends in gif, jpg, or png. So, if any outside website links to any file on your website that ends in those 3 file extensions, it will return a forbidden response.

WordPress URL Rewriting

If you use WordPress, when you customize your permalinks through the admin interface, WordPress will attempt to alter your .htaccess file to add the following lines:

RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

What this does is check to see if a requested file or directory exists (example.com/directory). If the file or directory exists, nothing happens, and you are able to access the resource as usual. If it does not exist (example.com/postname), the RewriteRule activates, sending the request to the WordPress index.php. From here, the WordPress permalink php code takes over, translating your request into the WordPress resource you requested.

2 Responses to “URL rewriting using Apache”

  1. Hey
    Rewrite Conditons
    How do i check if my host has a subdomain (pl from pl.engadget.com)
    and
    How do i check if the present url’s path text (/iphone/ or /iphone from pl.engadget.com/iphone or pl.engadget.com/iphone#blahhh or pl.engadget.com/iphone/whatever/stuff/here/….)

    One more thing is, if the above condtions are true, the pagination URL should be http://pl.engadget.com/iphone/page/pagenum and my htaccess I though of is;
    RewriteRule ^iphone/tag/([^/]+)/page/([0-9999]+)/?$ /index.php?a=iphone-tag&tag=$1&postpage=$2 [L,QSA]

    nothing seems to workk. please help

  2. Checking if a host has a subdomain

    In your example (pl.engadget.com) a simple workable pattern would be ^([^\.])+.?engadget.com$. This pattern will match the subdomain as a backreference, if a subdomain exists. From there, just use the backreference in your RewriteRule.

    Now if you’re checking for a specific subdomain, the pattern is simply going to be ^pl.engadget.com$

    Check the present url’s path

    It looks like you’re trying to say ‘if the url’s path text matches pattern1 or pattern2 or pattern3, then rewrite the url like so’. Now if you were requiring all of these patterns to match at the same time, you’d use several RewriteCond statements with a closing RewriteRule.

    However, in your case it seems like you’re trying to perform a url rewrite on several different areas (categories, tags, pages, individual posts). For that, the best starting point is to set up a several sets of RewriteCond and RewriteRule to handle every distinct case. Once those are working, you can try to combine the RewriteCond patterns into something more general.

    Solution

    There also seems to be a minor error in your RewriteRule. You’re using [L,QSA] at the end. The QSA flag enables the query string append mode. This forces apache to append to the query string instead of replacing it.

    So, in your RewriteRule, it will be taking the second half of the RewriteRule and appending it onto the existing url. Simply remove the QSA and it should work fine.

    Combining all this together, a sample rule would look like so:

    RewriteCond %{HTTP_HOST} ^pl.engadget.com$
    RewriteCond %{REQUEST_URI} ^/iphone/?.*$
    RewriteRule ^iphone/tag/([^/]+)/page/([0-9999]+)/?$ /index.php?
                     a=iphonetag&tag=$1&postpage=$2 [L]
    

    And translated into English, it would read as follows: If the host of the url is pl.engadget.com AND the uri of the url starts with /iphone THEN take the current url (pl.engadget.com/iphone/tag/sampletag/page/1/) and rewrite the url to the form pl.engadget.com/index.php?a=iphonetag&tag=sampletag&postpage=1.

    Obviously, you’re going to need to put a lot more thought than I did into how the urls for this vary, and how to best match the arguments. Hopefully I’ve started you on the right track.

Leave a Reply

Comments from new authors must be approved before they will appear.
After your first comment is approved, you are free to comment at will.