How can I use regex with sed (or equivalent unix c

2019-07-14 05:41发布

regular expression attempt

(\\section\{|\\subsection\{|\\subsubsection\{|\\paragraph[^{]*\{)(\w)\w*([ |\}]*)

search text

\section{intro to installation of apps}
\subsection{another heading for \myformatting{special}}
\subsubsection{good morning, San Francisco}
\paragraph{installation of backend services}

desired output

All initial characters are capitalized except prepositions, conjunctions, and the usual parts of speech that are made upper case on titles.

I supposed I should really narrow this down, so let me borrow from the U.S. Government Printing Office Style Manual:

The articles a, an, and the; the prepositions at, by, for, in, of, on, to, and up; the conjunctions and, as, but, if, or, and nor; and the second element of a compound numeral are not capitalized.

Page 41

\subsection{Installation guide for the server-side app \myapp{webgen}}

changes to

\subsection{Installation Guide for the Server-side App \myapp{Webgen}}

OR

\subsection{Installation Guide for the Server-side App \myapp{webgen}}

How would you name this type of string modification?

  1. Applying REGEX to a string between strings?

  2. Applying REGEX to a part of a string when that part falls between two other strings of characters?

  3. Applying REGEX to a substring that occurs between two other substrings within a string?

  4. <something else>

problem

I match each latex heading command, including the {. This means that my expresion does not match more than the first word in the actually heading text. I cannot surround the whole heading code with an "OR space" because then I will find nearly every word in the document. Also, I have to be careful of formatting commands within the headings themselves.

other helpful related questions

2条回答
何必那么认真
2楼-- · 2019-07-14 06:12

Here is an example of how you could do it in Perl using the module Lingua::EN::Titlecase and recursive regular expressions :

use strict;
use warnings;

use Lingua::EN::Titlecase;

my $tc = Lingua::EN::Titlecase->new();
my $data = do {local $/; <> };

my ($kw_regex) = map { qr/$_/ }
  join '|', qw(section subsection subsubsection paragraph);
$data =~ s/(\\(?: $kw_regex))(\{(?:[^{}]++|(?2))*\})/title_case($tc,$1,$2)/gex;
print $data;

sub title_case {
    my ($tc, $p1, $p2) = @_;

    $p2 =~ s/^\{//;
    $p2 =~ s/\}$//;
    if ($p2 =~ /\\/ ) {
        while ($p2 =~ /\G(.*?)(\\.*?)(\{(?:[^{}]++|(?3))*\})/ ) {
            my $next_pos = $+[0];
            substr($p2, $-[1], $+[1] -$-[1], $tc->title($1));
            substr($p2, $-[3], $+[3] -$-[3], title_case($tc,'',$3));
            pos($p2) = $next_pos;
        }
        $p2 =~ s/\G(.+)$/$tc->title($1)/e;
    }
    else {
        $p2 = $tc->title($p2);
    }
    return $p1 . '{' . $p2 . '}';
}
查看更多
爷的心禁止访问
3楼-- · 2019-07-14 06:17

So it seems to me as if you need to implement pseudo-code like this:

  1. Are we on the first word? If yes, capitalize it and move on.
  2. Is the current word "reserved"? If yes, lower it and move on.
  3. Is the current word a numeral? If yes, lower it and move on.
  4. Are we still in the list? If yes, print the line verbatim and move on.

One other helpful rule might be to leave fully upper-case words as they are, just in case they're acronyms.

The following awk script might do what you need.

#!/usr/bin/awk -f

function toformal(subject) {
  return toupper(substr(subject,1,1)) tolower(substr(subject,2))
}

BEGIN {
  # Reserved word list gets split into an array for easy matching.
  reserved="at by for in of on to up and as but if or nor";
  split(reserved,a_reserved," "); for(i in a_reserved) r[a_reserved[i]]=1;
  # Same with the list of compound numerals. If this isn't what you mean, say so.
  numerals="hundred thousand million billion";
  split(numerals,a_numerals," "); for(i in a_numerals) n[a_numerals[i]]=1;
}

# This awk condition matches the lines we're interested in modifying.
/^\\(section|subsection|subsubsection|paragraph)[{]/ {

  # Separate the particular section and the text, then split text to an array.
  section=$0; sub(/\\/,"",section); sub(/[{].*/,"",section);
  text=$0; sub(/^[^{]*[{]/,"",text); sub(/[}].*/,"",text);
  size=split(text,atext,/[[:space:]]/);

  # First word...
  newtext=toformal(atext[1]);

  for(i=2; i<=size; i++) {
    # Reserved word...
    if (r[tolower(atext[i])]) { newtext=newtext " " atext[i]; continue; }
    # Compound numerals...
    if (n[tolower(atext[i])]) { newtext=newtext " " tolower(atext[i]); continue; }
#    # Acronyms maybe...
#    if (atext[i] == toupper(atext[i])) { newtext=newtext " " atext[i]; continue; }
    # Everything else...
    newtext=newtext " " toformal(atext[i]);
  }

  print newtext;
  next;

}

# Print the line if we get this far. This is a non-condition with
# a print-only statement.
1
查看更多
登录 后发表回答